The Pennsylvania State University The Graduate School College of Engineering

MODELING AND LEVERAGING EMERGING NON-VOLATILE MEMORIES FOR FUTURE COMPUTER DESIGNS

A Dissertation in Computer Science and Engineering by Xiangyu Dong

c 2011 Xiangyu Dong

Submitted in Partial Fulfillment of the Requirements for the Degree of

Doctor of Philosophy

December 2011 The dissertation of Xiangyu Dong was reviewed and approved* by the following:

Yuan Xie Associate Professor of Computer Science and Engineering Dissertation Advisor, Chair of Committee

Mary Jane Irwin Professor of Computer Science and Engineering

Vijaykrishnan Narayanan Professor of Computer Science and Engineering

Suman Datta Professor of Electrical Engineering

Norman P. Jouppi Director of Intelligent Infrastructure Lab at Hewlett-Packard Labs Special Member

Mahmut Taylan Kandemir Professor of Computer Science and Engineering Director of Graduate Affairs

*Signatures are on file in the Graduate School. Abstract

Energy efficiency has become a major constraint in the design of computing systems today. As CMOS continues scaling down, traditional CMOS scaling theory requires to reduce supply and threshold voltages in proportion to device sizes, which exponentially increases the leakage. As a result, leakage power has become comparable to dynamic power in current- generation processes. Before leakage power becomes the dominant part in the power budget, disruptive emerging technologies are needed. Fortunately, many new types of non- technologies are now evolving. For example, emerging non-volatile memories such as Spin-Torque-Transfer RAM (MRAM, STTRAM), Phase-Change RAM (PCRAM), and Resistive RAM (ReRAM) show their at- tractive properties of high access performance, low access energy, and high cell density. Therefore, it is promising to facilitate these emerging non-volatile memory technologies in designing future high-performance and low-power computing systems. However, since none of these new non-volatile memories is mature yet, academic research is necessary to demonstrate the usefulness of these technologies. In order to do that, this dissertation investigates three aspects of facilitating these emerging non-volatile memory technologies. First, a circuit-level performance, energy, and area models for various non- volatile memories is built. Second, several architecture-level techniques that mitigate the drawbacks in non-volatile memory write operations are proposed and evaluated. Third, application-level case studies of adopting these emerging technologies are conducted.

iii Contents

List of Figures ix

List of Tables xiii

List of Symbols xv

Acknowledgments xvi

Dedication xvii

1 Introduction 1

2 Technology Background 5 2.1 NAND Flash Switching Mechanism ...... 5 2.2 STTRAMSwitchingMechanism ...... 6 2.3 PCRAMSwitchingMechanism ...... 8 2.4 ReRAMSwitchingMechanism ...... 9 2.5 eNVMReadOperations ...... 10 2.6 eNVMProblems ...... 10 2.6.1 WriteLatency/EnergyIssue...... 11 2.6.2 WriteEnduranceIssue...... 11

3 Related Work 12 3.1 PreviousWorkonCircuitLevel ...... 12

iv 3.2 Previous Work on Architecture Level ...... 13 3.3 Previous Work on Application Level ...... 13

4 Circuit-Level: eNVM Modeling 15 4.1 NVSimProjectOverview ...... 15 4.2 NVSimFramework...... 17 4.2.1 DeviceModel...... 17 4.2.2 ArrayOrganization...... 17 4.2.3 MemoryBankType ...... 18 4.2.4 ActivationMode ...... 19 4.2.5 RoutingtoMats ...... 20 4.2.6 RoutingtoSubarrays ...... 23 4.3 AreaModel ...... 24 4.3.1 CellAreaEstimation...... 25 4.3.2 Peripheral Circuitry Area Estimation ...... 28 4.4 TimingandPowerModels...... 30 4.4.1 Generic Timing and Power Estimation ...... 30 4.4.2 DataSensingModels...... 31 4.4.3 CellSwitchingModel ...... 34 4.5 MiscellaneousCircuitry ...... 35 4.5.1 PulseShaper ...... 35 4.5.2 ChargePump...... 36 4.6 ValidationResult...... 36 4.6.1 NAND Flash Validation ...... 36 4.6.2 STT-RAMValidation ...... 37 4.6.3 PCRAMValidation ...... 37 4.6.4 ReRAMValidation...... 38 4.7 Summary ...... 38

v 5 Architecture-Level: Techniques for Alleviating eNVM Write Overhead 40 5.1 Directly Replacing SRAM Caches ...... 40 5.2 Read-PreemptiveWriteBuffer ...... 42 5.3 HybridSRAM-eNVMCache...... 43 5.4 Effectiveness of Read-Preemptive Write Buffer and Hybrid Cache...... 44 5.5 Summary ...... 46

6 Application-Level: eNVM for File Storage 47 6.1 Multi-LevelCell ...... 47 6.1.1 ExtraWriteOverhead ...... 48 6.1.2 ExtraReadOverhead ...... 49 6.1.3 ReducedCellLifetime ...... 49 6.1.4 PCRAMLifetimeModel...... 49 6.2 AdaptiveMLC/SLCPCRAMArrayStructure ...... 51 6.2.1 MLC/SLCWrite: SET,RESET,andPGMPulses ...... 51 6.2.2 MLC/SLC Read: Dual-Mode Sense Amplifier ...... 51 6.2.3 AddressRe-mapping...... 53 6.2.4 Reconfigurable PCRAM-based Solid-State Disk ...... 55 6.3 ExperimentalResults...... 55 6.3.1 PCRAMMLC/SLCTimingModel ...... 56 6.3.2 Performance-Aware Management Result ...... 57 6.3.3 Performance-CostAnalysis ...... 58 6.3.4 LifetimeAnalysis...... 58 6.4 Summary ...... 59

7 Application-Level: eNVM for Exascale Fault Tolerance 60 7.1 Problem...... 60 7.2 Integrating eNVM Modules into MPP Systems ...... 64 7.3 Local/Global Hybrid Checkpoint ...... 67 7.3.1 HybridCheckpointScheme ...... 67

vi 7.3.2 System Failure Category Analysis ...... 69 7.3.3 Theoretical Performance Model ...... 71 7.4 ExperimentalResults...... 74 7.4.1 CheckpointingScenarios ...... 74 7.4.2 ScalingMethodology...... 75 7.4.3 PerformanceAnalysis ...... 76 7.4.4 PowerAnalysis ...... 77 7.5 Summary ...... 78

8 Application-Level: eNVM as On-Chip Cache 79 8.1 Overview ...... 79 8.2 ReRAM-Based Cache Wear-Leveling ...... 81 8.2.1 Inter-Set Cache Line Wear-Leveling ...... 81 8.2.2 Intra-Set Cache Line Wear-Leveling ...... 83 8.2.3 Endurance Requirements for ReRAM Caches ...... 84 8.3 Circuit-LevelReRAMModel ...... 84 8.3.1 ReRAMModeling ...... 84 8.3.2 ReRAMArrayDesignSpectrum ...... 86 8.4 Architecture-Level Model of Design ...... 86 8.4.1 Feed-Forward Network ...... 87 8.4.2 Training and Validation ...... 88 8.5 ExperimentalMethodology ...... 89 8.5.1 Circuit-Architecture Joint Exploration Framework ...... 90 8.5.2 SimulationEnvironment ...... 91 8.6 Design Exploration and Optimization ...... 94 8.6.1 Cache Hierarchy Design Exploration ...... 94 8.6.2 DesignOptimization ...... 98 8.6.3 Discussion...... 101 8.7 Summary ...... 103

vii 9 Conclusion 105

Bibliography 106

viii List of Figures

1.1 While the state-of-the-art memory hierarchy includes SRAM, DRAM, NAND flash, and HDD, we propose an eNVM-based new hierarchy that only uses non-volatile memory technology and provides both high performance and low-poweroperations...... 2 2.1 The basic string block of NAND flash, and the conceptual view of floating gate flash (BL=bitline, WL=wordline, SG=select gate). . . . . 6 2.2 The conceptual view of an STTRAM cell...... 7 2.3 The schematic view of a PCRAM cell with a MOSFET selector transistor (BL=bitline, WL=wordline, SL=sourceline)...... 8 2.4 The temperature-time relationship during SET and RESET operations. . . 8 2.5 TheworkingmechanismofReRAMcells...... 9 4.1 An example of the memory array organization modeled in NVSim: a hi- erarchical memory organization includes banks, mats, and subarrays with decoders, multiplexers, sense amplifiers, and output drivers...... 17 4.2 The example of the wire routing in a 4x4 mat organization for the data array of a 8-way 1MB cache with 64B cache lines...... 21 4.3 The example of the wire routing in a 4x4 mat organization for the tag array of a 8-way 1MB cache with 64B cache lines...... 21 4.4 An example of mat using internal sensing and H-tree routing...... 23 4.5 An example of mat using external sensing and bus-like routing...... 24 4.6 Conceptual view of a MOS-accessed cell (1T1R) and its connected word line, bitline,andsourceline...... 25

ix 4.7 Conceptual view of a cross-point cell array without diode (0T1R) and its connectedwordlinesandbitlines...... 27 4.8 The layout of the NAND-string cell modeled in NVSim...... 28 4.9 Transistor sizings: (a) latency-optimized; (b) balanced; (c) area-optimized. . 29 4.10 Analysis model for current sensing scheme...... 32 4.11 Analysis model for current-in voltage sensing scheme...... 32 4.12 Analysis model for voltage-divider sensing scheme...... 33 4.13 The current-voltage converter modeled in NVSim...... 34 4.14 The circuit schematic of the slow quench pulse shaper used in [1]...... 35 5.1 In a read-preemptive write buffer, the read operations can be granted over the ongoing write operation if the progress of that write operation is less than 50%...... 42 5.2 The performance impact of the preemption condition...... 43 5.3 STTRAM write intensity with and without hybrid SRAM-STTRAM caches. 44 5.4 The performance improvement and energy reduction after applying the read- preemptive write buffer and hybrid cache techniques...... 46 6.1 ThebasicMLCPCRAMprogrammingscheme ...... 48 6.2 SET and RESET resistances during PCRAM cycling, illustrating the differ- ence between failure by “stuck-SET” and by “stuck-RESET”...... 49 6.3 The block diagram of the PCRAM array organization that supports both MLCandSLCoperations...... 52 6.4 The conceptual view of managing SLC and MLC modes...... 53 6.5 The performance of the adaptive MLC/SLC solution under different utiliza- tions...... 56 6.6 The performance per cost analysis of the adaptive MLC/SLC solution. . . . 57 7.1 The typical organization of the contemporary supercomputer. All the per- manent storage devices are taken control by I/O nodes. There is no local permanentstorageforeachnode...... 61

x 7.2 The proposed new organization that supports hybrid checkpoints. The pri- mary permanent storage devices are still connected through I/O nodes, but each process node also has a permanent storage...... 61 7.3 The bandwidth with different write size...... 62 7.4 The main memory bandwidth with different write size...... 63 7.5 The organization of a conventional 18-chip DRAM DIMM with ECC support. 65 7.6 The organization of the proposed 18-chip PCRAM DIMM with ECC support. 65 7.7 The local/global hybrid checkpoint model...... 68 7.8 A conceptual view of execution time broken by the checkpoint interval: (a) an application running without failure; (b) an application running with a failure, where the system rewinds back to the most recent checkpoint, and it is recovered by the local checkpoint; (c) an application running with a failure that cannot be protected by the local checkpoint. Hence, the system rewinds back to the most recent global checkpoint...... 71 7.9 The checkpoint overhead comparison in a 1-petaFLOPS system (normalized tothecomputationtime)...... 76 7.10 The checkpoint overhead comparison in a 1-exaFLOPS system (normalized tothecomputationtime)...... 76 8.1 Inter-set L3 cache line write count variation in a simulated 8-core system with 32KB I-L1, 32KB D-L1, 1MB L2, and 8MB L3 caches...... 82 8.2 Intra-set L3 cache line write count variation in a simulated 8-core system with 32KB I-L1, 32KB D-L1, 1MB L2, and 8MB L3 caches...... 82 8.3 The comparison of the D-L1 intra-set cache write variations using plain LRU policy and the proposed endurance-aware LRU policy with proactive invali- dation: (a) The log-scale write counts of different applications; (b) the linear- scale write counts of selected applications that originally have large intra-set cachewritevariations...... 83 8.4 The design spectrum of 32nm ReRAM: (upper) Read latency vs. density; (bottom) Write latency vs. density...... 85

xi 8.5 The basic organization of a two-layer feed-forward artificial neural network. 87 8.6 An accurate ANN fitting example: MG from NPB...... 89 8.7 A typical ANN fitting example: dedup from PARSEC...... 89 8.8 The worst ANN fitting example: x264 from PARSEC...... 90 8.9 Overview of the optimization framework...... 91 8.10 CDF plots of error on IPC prediction of NPB and PARSEC benchmark applications...... 93 8.11 Pareto-optimal curves: energy and performance trade-off of the memory hi- erarchy. Main memory dynamic power is included for a fair comparison. . . 95 8.12 Pareto-optimal curves (cross-point ReRAM as main memory): energy and performance trade-off under different constraints on SRAM deployment. . . 96 8.13 Pareto-optimal curves (cross-point ReRAM as main memory): cache area and performance trade-off under different constraints on SRAM deployment. 99 8.14 The global Pareto-optimal curve (cross-point ReRAM as main memory) and feasible design options with total cache area less than 3mm2...... 100 8.15 ThepathofEDPoptimization...... 101 8.16 The path of EDAP optimization...... 102 8.17 A proposed universal memory hierarchy using ReRAM...... 103

xii List of Tables

1.1 Characteristics of memory and storage technologies [2] ...... 2 1.2 Characteristics of emerging non-volatile memory (eNVM) (collected from literatures) ...... 3 4.1 The initial number of wires in each routing group ...... 19 4.2 NVSim’s NAND flash model validation with respect to a 50nm 2Gb NAND flashchip(B-SLC2)[3]...... 37 4.3 NVSim’s STT-RAM model validation with respect to a 65nm 64Mb STT- RAMprototypechip[4] ...... 37 4.4 NVSim’s PCRAM model validation with respect to a 0.12µm 64Mb MOS- accessedPCRAMprototypechip[5] ...... 38 4.5 NVSim’s PCRAM model validation with respect to a 90nm 512Mb diode- selectedPCRAMprototypechip[1] ...... 38 4.6 NVSim’s ReRAM model validation with respect to a 0.18µm 4Mb MOSFET- selectedReRAMprototypechip[6] ...... 39 5.1 Area, access time, and energy comparison between SRAM and STTRAM caches within similar silicon area (65nm technology) ...... 41 5.2 Baseline configuration parameters ...... 45 5.3 L2 transaction intensities ...... 45 5.4 The performance and power improvement (STTRAM cache vs. SRAM cache) 46 6.1 ThelifetimemodelofPCRAMcells...... 50 6.2 The timing model of PCRAM cells in SLC and MLC modes...... 57 7.1 Different Configurations of the PCRAM Chips ...... 66

xiii 7.2 The Statistics of the Failure Root Cause Collected by LANL during 1996-2005 69 7.3 Local/Global Hybrid Checkpointing Parameters ...... 72 7.4 Bottleneck Factor of Different Checkpoint Schemes ...... 75 7.5 Specifications of the Baseline Petascale System and the Projected Exascale System...... 75 8.1 Inputdesignspaceparameters...... 92 8.2 MOS-accessed and cross-point ReRAM main memory parameters (1Gb, 8- bit,16-bank) ...... 94 8.3 On-die cache hierarchy design parameters of 7 design options ...... 97 8.4 On-die cache hierarchy design parameters of 7 design options (Continued) . 98 8.5 Overview of the proposed universal memory hierarchy ...... 103

xiv List of Symbols

eNV M Emerging non-volatile memory ST T RAM Spin-torque transfer random access memory P CRAM Phase-change random access memory ReRAM Resistive random access memory NV Sim Non-volatile memory simulator

xv Acknowledgments

Many people have contributed to my academic progress. First of all, I thank Prof. Yuan Xie for his invaluable guidance and support as my Ph.D. advisor. I appreciate his energetic working style and the way of guiding students. Also at Pennsylvania State University, I thank Prof. Mary Jane Irwin, Prof. Vijaykrish- nan Narayanan, and Prof. Suman Datta for serving on my dissertation committee. Their feedbacks at the early stage give me a strong foundation to continue on my research topic. For enhancing my research skills with practical experience, I sincerely thank Dr. Nor- man P. Jouppi for hosting my internship at Hewlett-Packard Labs. His legendary research experience and great personality make my time in Hewlett-Packard Labs one of the most cherishing period in my life. For enriching my abroad studying experience, I thank my friends and colleagues at Pennsylvania State University and Hewlett-Packard Labs. I especially thank Xiaoxia Wu, Yang Ding, Guangyu Sun, Yibo Chen, Jin Ouyang, Dimin Niu, Tao Zhang, Cong Xu, Jishen Zhao, Jing Xie, Matt Poremba, Jingchen Liu, Rob Schreiber, Partha Ranganathan, Jichuan Chang, and Sheng Li. Last but not least, I thank Jue Wang for making my life much more vivid than I could ever imagine. Most importantly, I thank my parents Zhiping Liu and Junkang Dong for their love. I hope I would always be their proud.

xvi Dedication

This dissertation is dedicated to my parents Zhiping Liu and Junkang Dong, for their support, encouragement, education, and love throughout my life.

xvii Chapter 1

Introduction

One of the major constraints in the design of computing systems today is energy efficiency. The state-of-the-art CMOS process starts to suffer from the leakage issue when the process node is keeping scaling down. In order to build the next-generation exascale computing systems with high energy efficiency, disruptive technologies are required to provide solutions of high-performance as well as low-power computing capability. Consider the facts that memory access latency and disk access latency are several orders of magnitudes slower than the processing cores and system memory power (DRAM power) and disk power contribute as much as 40% to the overall power consumption in a data center [7–9], it is highly necessary to first improve the performance and the power profiles of the traditional memory hiearacy. Traditionally, SRAM on-chip caches, DRAM off-chip main memory, and hard disk drive (HDD) storage are the three major element components that build today’s com- puter memory hierarchy. With the improvements in speed, density, and cost of NAND flash devices, the solid-state drive (SSD) has gained the momentum as the HDD storage re- placement or as the storage cache located between DRAM main memory and HDD storage. Hence, the state-of-the-art memory hierarchy has four levels as shown in Figure 1.1. However, the state-of-the-art memory hierarchy shown in Figure 1.1 is facing several challenges. On the performance side, the HDD speed becomes a severe bottleneck and its mechanical structure limits the upper bound of the access speed. Although the recent adop-

1 2

on-chip cache off-chip memory storage buffer mass storage SSD state-of-the-art SRAM DRAM HDD memory hierarchy (NAND Flash) Power-hungry Power-hungry Poor performance Poor performance

Cache-class Memory-class eNVM-based new Storage-class eNVM memory hierarchy eNVM eNVM Low-power Low-power High-performance

Figure 1.1: While the state-of-the-art memory hierarchy includes SRAM, DRAM, NAND flash, and HDD, we propose an eNVM-based new hierarchy that only uses non-volatile memory technology and provides both high performance and low-power operations. tion of SSD storage raises the performance level, the slow programming speed of NAND flash devices and the write endurance of 105 still hamper the SSD storage to be the main- stream storage in the future. On the energy consumption side, the existing DRAM main memory already contributes to a large portion of the total system energy consumption, and the increasing leakage power makes SRAM on-chip caches and DRAM off-chip main mem- ories impractical to scale down to the next fabrication process level. Therefore, disruptive technologies are strongly required to improve the memory hierarchy performance without incurring proportional energy consumption increment.

Table 1.1: Characteristics of memory and storage technologies [2] Device type SRAM DRAM NAND flash HDD Maturity Product Product Product Product Cell size 150F 2 6F 2 4F 2 (2/3)F 2 MLC capability No No 4 bits/cell No Write energy 1pJ 2pJ 10nJ N/A Write latency 5ns 10ns 200µs 10ms Write endurance 1016 1016 105 N/A

Recently, significant efforts and resources have been put on the research and develop- ment of emerging memory technologies. Several promising candidates, such as Spin-Torque Transfer RAM (MRAM or STTRAM) [4,10–13], Phase-Change RAM (PCRAM) [1,5,14–24] and Resistive RAM (ReRAM) [6, 25–35], have gained substantial attentions and are being actively pursued by both academic and industrial research. In this dissertation, we call them eNVM (emerging non-volatile memory). eNVMs have attractive features such as 3

Table 1.2: Characteristics of emerging non-volatile memory (eNVM) (collected from litera- tures) Device type STTRAM PCRAM ReRAM Maturity Prototype Prototype Prototype Cell size 13F 2-100F 2 4F 2-40F 2 4F 2-70F 2 MLC capability No 4 bits/cell 4 bits/cell Write energy 1pJ-100pJ 4pJ-200pJ 0.3pJ-25pJ Write latency 4ns-100ns 100ns-1000ns 0.3ns-100ns Write endurance 1012-1016 105-108 105-1011 scalability, fast read access, zero standby power, and non-volatility. In addition, by using different types of peripheral circuitry designs, eNVM can cover a wide design space from highly performance-optimized on-chip cache to highly area-optimized secondary storage. Table 1.1 and Table 1.2 compare the characteristics between conventional memory/storage technologies (i.e. SRAM, DRAM, NAND flash, and HDD) and eNVMs. This comparison shows eNVMs generally outperform their non-volatile counterparts (e.g. NAND flash and HDD) in all aspects. In addition, eNVMs have the same high cell density as that of DRAM and the similar read performance to that of SRAM, but the drawbacks of eNVMs are their relatively long write latency and high write energy (for PCRAM and ReRAM, the limited write endurance is another issue). However, it is still promising to redesign the memory hierarchy by adopting various styles of eNVMs in different hierarchy levels as demonstrated in Figure 1.1. Compared to the state-of-the-art designs,

• The non-volatility nature of eNVMs lowers the memory subsystem energy consump- tion that were originally contributed to SRAM cache and DRAM main memory;

• The nanosecond-level access speed of eNVMs easily outperforms today’s SSD and HDD in the storage level of the hierarchy.

Since eNVM technologies are still premature and they have drawbacks in handling write operations, the challenges of eNVM research include: (1) on circuit level, estimation tools are required for high-level designs to understand and facilitate the underlying eNVM devices; (2) on architecture level, non-standard techniques are necessary to mitigate the long write latency, high write energy, and limited write endurance of eNVMs; (3) on application level, 4 eNVMs need to demonstrate their effectiveness in easing the high-performance and energy- efficient computing system design in the future. This dissertation is one of the efforts to solve these challenges. First, a circuit-level per- formance, energy, and area models for various non-volatile memories is built and described in Chapter 4. Second, several architecture-level techniques that mitigate the drawbacks in non-volatile memory write operations are proposed and evaluated in Chapter 5. Third, application-level case studies of adopting these eNVM technologies are conducted. The case studies include eNVM-based file storage in Chapter 6, eNVM-based checkpoint-restart scheme in Chapter 7, and eNVM-based memory hierarchy exploration in Chapter 8. Chapter 2

Technology Background

Different eNVM technologies have their particular storage mechanisms and correspond- ing write methods. In this chapter, we first review the technology background of several emerging eNVM technologies including Spin-Torque-Transfer RAM (MRAM, STT-RAM), Phase-Change RAM (RAM), and Resistive RAM (ReRAM). In addition, the conventional NAND flash technology is also covered.

2.1 NAND Flash Switching Mechanism

The physical mechanism of the flash memory is to store bits in the floating gate and con- trol the gate threshold voltage. The series bit-cell string of NAND flash, as shown in Figure 2.1(a), eliminates contacts between the cells and approaches the minimum cell size of 4F 2 for low-cost manufacturing. The small cell size, low cost, and strong application demands make the NAND flash dominate the traditional non-volatile memory market. Fig- ure 2.1(b) shows that a flash memory cell consists of a floating gate and a control gate aligned vertically. The flash memory cell modifies its threshold voltage VT by adding elec- trons to or subtracting electrons from the isolated floating gate. NAND flash charges or discharges the floating gate by using Fowler-Nordheim (FN) tunneling or hot carrier injection (HCI). A program operation adds tunneling charges to the floating gate and the threshold voltage becomes negative, while an erase operation

5 6

BL SG(D)

WL1

WL2 Control gate Floating gate Basic block Drain Source (b) WL15

WL16

SG(S)

(a)

Figure 2.1: The basic string block of NAND flash, and the conceptual view of floating gate flash memory cell (BL=bitline, WL=wordline, SG=select gate). subtracts charges and the threshold voltage returns positive. While NAND flash currently has the most significant market demands, it is not an ideal non-volatile memory technology. First, NAND flash memory can only be erased in a “block” size at a time, and thus complicates the programming scheme. Second, NAND flash is known to have severe write endurance problem, which means that a single flash memory cell only has a very finite number of program-erase cycles. Therefore, in practice, flash translation layer (FTL) is always required to simplify the access scheme and maintain the effective wear-leveling. In addition, NAND flash memories are facing significant scaling challenges due to their dependence upon reductions in lithographic resolution as well as fundamental physical limitations beyond the 22nm process node, such as severe floating gate interference, lower coupling ratio, short channel effects, and low electron charge in the floating gate [2].

2.2 STTRAM Switching Mechanism

STTRAM is a new type of MRAM, and STTRAM has drawn considerable attentions in the past five years for its below-100nm scaling and its lower write current as compared with 7

%LWOLQH %/ 

)UHH OD\HU :RUGOLQH :/  5HIHUHQFH OD\HU 07-

6RXUFHOLQH 6/ 

Figure 2.2: The conceptual view of an STTRAM cell. conventional magnetic field induced switching MRAM. Similar to its MRAM counterpart, STTRAM uses Magnetic Tunnel Junction (MTJ) as the memory storage and leverages the difference in magnetic directions to represent the memory bit. As shown in Figure 2.2, MTJ contains two ferromagnetic layers. One ferromagnetic layer has fixed magnetization direction and it is called the reference layer, while the other layer has a free magnetization direction that can be changed by passing a write current and it is called the free layer. The relative magnetization direction of two ferromagnetic layers determines the resistance of MTJ. If two ferromagnetic layers have the same directions, the resistance of MTJ is low, indicating a “1” state; if two layers have different directions, the resistance of MTJ is high, indicating a “0” state. Unlike the conventional MRAM, in which the polarization of the free layer in the MTJ is set by an external magnetic field set by the combined write or program current in the write wordline and the bitline, STTRAM directly uses the current to SET (or RESET) the polarization of the free ferromagnetic layer. As shown in Figure 2.2, when writing “0” state into STTRAM cells (RESET operation), positive voltage difference is established between SL and BL; when writing “1” state (SET operation), vice versa. The current amplitude required to reverse the direction of the free ferromagnetic layer is determined by the size and aspect ratio of MTJ and the write pulse duration. 8

2.3 PCRAM Switching Mechanism

BL BL GST GST ‘RESET’ ‘SET’ SL SL WL WL

N+ N+ N+ N+

Figure 2.3: The schematic view of a PCRAM cell with a MOSFET selector transistor (BL=bitline, WL=wordline, SL=sourceline).

Amorphizing RESET pulse

Melting temperature (~600 !_

Crystallizing SET pulse Crystallization transition temperature (~300 !_ GST temperature GST

Time

Figure 2.4: The temperature-time relationship during SET and RESET operations.

PCRAM uses chalcogenide material (e.g. GST) to store information. The chalcogenide materials can be switched between a crystalline phase (SET state) and an amorphous phase (RESET state) with the application of heat. The crystalline phase shows low resistivity while the amorphous phase is characterized by high resistivity. Figure 2.3 shows an example of a MOS-accessed PCRAM cell. The SET operation crystallizes GST by heating it above its crystallization temperature, and the RESET operation melt-quenches GST to make the material amorphous as illus- trated in Figure 2.4. The temperature is controlled by passing a specific electrical current profile and generating the required Joule heat. High-power pulses for the RESET opera- tion to heat the memory cell above the GST melting temperature. In contrast, moderate power but longer duration pulses for the SET operation to heat the cell above the GST 9 crystallization temperature but below the melting temperature [20].

2.4 ReRAM Switching Mechanism

Although many non-volatile memory technologies (e.g. the aforementioned STTRAM and PCRAM) are based on electrically induced resistive switching effects, we define ReRAM as the one involves electro- and thermochemical effects in the resistance change of a metal- oxide-metal system. In addition, we confine our definition to bipolar ReRAM. Figure 2.5 illustrated the general concept for the ReRAM working mechanism. An ReRAM cell con- sists of a metal oxide layer (e.g Ti [25], Ta [28], and Hf [29]) sandwiched by two metal (e.g. Pt [25]) electrodes. The electronic behavior of metal/oxide interfaces depends on the oxygen vacancy concentration of the metal oxide layer. Typically, the metal/oxide interface shows Ohmic behavior in the case of very high doping and rectifying in the case of low dop- ing [25]. In Figure 2.5, the TiOx region is semi-insulating indicating lower oxygen vacancy

concentration, while the TiO2−x is conductive indicating higher concentration.

Top electrode Pt High oxygen vacancy TiO concentration 2-x V Low oxygen vacancy concentration TiO 2 Bottom electrode Pt

Figure 2.5: The working mechanism of ReRAM cells.

The oxygen vacancy in metal oxide is n-type dopant, whose draft under the electric field can cause the change of doping profiles. Thus, applying electronic current can modulate the I-V curve of the ReRAM cell and further switch the cell from one state to the other state. Usually, for bipolar ReRAM, the cell can be switched ON (SET operation) only by applying a negative bias and OFF (RESET operation) only by applying the opposite bias [25]. Although the fundamental mechanism of ReRAM switching is still arguable, 10 several ReRAM prototypes [6,32,34] have been demonstrated and show promising properties on fast switching speed and low energy consumption.

2.5 eNVM Read Operations

The read operations of these NVM technologies are almost the same. Since the NVM memory cell has different equivalent resistance in ON and OFF states, the read operation can be accomplished either by applying a small voltage on the bitline and sensing the the current that passes through the memory cell or by injecting a small current into the bitilne and sensing the voltage across the memory cell. Instead of SRAM that generates complementary read signals from each cell, NVM usually has a group of dummy cells to generate the reference current or reference voltage. The generated current (or voltage) from the to-be-read cell is then compared to the reference current (or voltage) by using sense amplifiers. Various types of sense amplifiers can be used in this step, and they are modeled later in Chapter 4.4.

2.6 eNVM Problems

These eNVM technologies share some common properties. For example, the advantages of STTRAM, PCRAM, and ReRAM include fast memory access speed (in several nanosec- onds) and byte-accessible capability while legacy non-volatile technologies are extremely slow (e.g. several microseconds for NAND flash and several milliseconds for HDD) and can only be accessed in the unit of pages or blocks. Some previous research leverages these ad- vantages to design high-performance file systems [36] and storage device [37]. However, the major disadvantages of these emerging memory technologies are the asymmetric read/write latency and energy consumption (for all of these technologies), as well as the limited write endurance (except MRAM and STTRAM). 11

2.6.1 Write Latency/Energy Issue

Compared to volatile memories such as SRAM and DRAM, eNVMs have more stable data keeping mechanism. Accordingly, it needs to take longer time and consume more energy to overwrite the existing data. Hence, the major drawbacks of eNVM are the relatively long write latency and high write energy. Lots of previous research efforts have been put on this aspect. Early write termination [38] and write pause [39] are proposed to mitigate the performance penalty caused by long write latency. Data-comparison write technique [40,41] is also common-applied to reduce the energy consumed by eNVM write operations.

2.6.2 Write Endurance Issue

Write endurance is the number of times that an NVM cell can be overwritten. Among all the NVM technologies mentioned in this dissertation, only MRAM and STTRAM are free from the write endurance issue. NAND flash, PCRAM, and ReRAM all have limited write endurance. NAND flash only has write endurance of 105-106. The PCRAM endurance is now in the range between 105 and 109 [1,5,14]. ReRAM research currently shows endurance numbers in the range between 105 and 1011 [30,31,35]. A projected plan by ITRS for 2024 for emerging NVM, i.e. PCRAM and ReRAM, highlight endurance in the order of 1015 or more write cycles [42] . In previous work, several architecture-level techniques, such as ECP [43], dynamically replicated memory [44], SAFER [45], start-gap [46], security refresh [47], and FREE-p [48], are designed to improve the memory module lifetime under the limited write endurance condition. Chapter 3

Related Work

Before we start to describe our contributions, we first list the previous research efforts that are eNVM-related. We summarize them into three categories.

3.1 Previous Work on Circuit Level

Many modeling tools have been developed during the last decade to enable system-level de- sign exploration for SRAM- or DRAM-based cache and memory. For example, CACTI [49, 50] is a tool that has been widely used in the computer architecture community to esti- mate the performance, energy, and area of SRAM and DRAM caches. Evans and Fran- zon [51] developed an energy model for SRAMs and used it to predict an optimum organi- zation for caches. eCACTI [52] incorporated a leakage power model into CACTI. Murali- manohar et al. [53] modeled large-capacity caches through the use of an interconnect-centric organization composed of mats and request/reply H-tree networks. In addition, CACTI has also been extended to evaluate the performance, energy, and area for STTRAM [54], PCRAM [55, 56], cross-point ReRAM [57] and NAND flash [58]. However, as CACTI is originally designed to model an SRAM-based cache, some of its fun- damental assumptions do not match the actual NVM circuit implementations, and thereby the NVM array organization modeled in these CACTI-like estimation tools deviates from the NVM chips that have been fabricated.

12 13

3.2 Previous Work on Architecture Level

Several architectural techniques have been proposed to mainly address the eNVM write issues. In order to alleviate the long write latency, Zhou et al. [38] applied an early write termination technique to reduce STTRAM write energy. Similarly, Qureshi et al. [39] added their write pausing technique to reduce PCRAM write energy. Wu et al. [59] proposed a data migration scheme to a hybrid cache architecture to reduce the number of eNVM write operations. In order the extend the eNVM lifetime, there are also various architectural proposals including ECP [43], SAFER [45], FREE-p [48], and dynamically replicated mem- ory [44]. These architectural techniques add different types of data redundancy to solve the access errors caused by the limited eNVM write endurance. Furthermore, the limited eNVM write endurance may be exploited by an attacker to run malicious application that causes damage to the memory. To conquer this issue, Qureshi et al. [46] proposed randomized start-gap wear leveling. Seong et al. [47] also addressed this potential attack by applying security refresh.

3.3 Previous Work on Application Level

Many researchers have studied using non-volatile memories as the file storage. The Non- Volatile System Laboratory in UCSD built a series of prototype storage systems based to explore the future of fast storage. They first designed Moneta [37, 60], a PCIe-attached storage array that comprises 64GB of emulated PCRAM storage and a carefully designed interface between hardware and software. Then, they built another storage array prototype called Onyx [61], which is one of the world’s first PCRAM-based SSDs. Condit et al. pro- posed a non-volatile file system [36] that leverages byte-accessibility. Some other researchers have focused on how to manage the single-level cell (SLC) and multi-level cell (MLC) dy- namically. Kgil et al. [62] studied a NAND flash-based disk cache that dynamically manages NAND flash pages when its lifetime approaches and hence increases the overall lifetime of the disk cache system. Lee et al. [63] proposed a flexible flash file system that dynamically leverages unused flash storage and uses SLC instead of MLC to improve the access latency. 14

These previous research are focused on NAND flash memory which has the erase-before- write constraint. This enforces the flash-based file system design on the basis of journal, which quickly fills the entire device capacity. Qureshi et al. [64] demonstrated a morphable memory system that is a PCRAM-based main memory space with both MLC and SLC and uses memory monitoring to determine the ratio between the MLC and the SLC regions. In addition, using non-volatile memories as checkpoints has also been investigated. Wang et al. [65] proposed and designed a non-volatile flip-flop structure for checkpoint processors in energy-harvesting applications. Volos et al. devised a simple interface called Mnemosyne [66] for programming with persistent memory. By leveraging the non-volatility property, Mnemosyne ensures consistency in the presence of failures. Considering that some non-volatile memory technologies such as STTRAM and ReRAM have relatively better write performance, using them as on-chip caches can also bring bene- fits. Rasquinha et al. [67] and Sun et al. [68] designed STTRAM-based last-level caches tar- geting at reducing write energy and mitigating write error, respectively. Smullen et al. [69] investigated the the trade-off among non-volatility, fast, and energy-efficient accesses of STTRAM caches. Finally, Wu et al. [59] studied the hybrid cache hierarchy by combining various memory technologies across volatile and non-volatile ones. Chapter 4

Circuit-Level: eNVM Modeling

As the ultimate goal of this eNVM research is to devise a universal memory hierarchy as shown in Figure 1.1, each of these eNVM technologies has to supply a wide design space that covers a spectrum from highly latency-optimized microprocessor caches to highly density- optimized secondary storage. Therefore, specialized peripheral circuitry is required for each optimization target. However, since few of these eNVM technologies are mature so far, only a limited number of prototype chips have been demonstrated and just cover a small portion of the entire design space. Therefore, circuit-level estimation models are necessary to facilitate the eNVM research by predicting the eNVM performance, energy consumption, and chip area without heavy efforts in building prototypes. Hence, as the first step of this eNVM research, we develop NVSim, a circuit-level model for eNVM performance, energy, and area estimations. As an extension of CACTI [50], NVSim supports not only SRAM and DRAM, but also NAND flash, STTRAM, PCRAM, and ReRAM.

4.1 NVSim Project Overview

The main goals of developing NVSim tool are:

• Estimate the access time, access energy, and silicon area of NVM chips with a given organization and specific design options before the effort of actual fabrications;

15 16

• Explore the NVM chip design space to find the optimized chip organization and design options that achieve best performance, energy, or area;

• Find the optimal NVM chip organization and design options that is optimized for one design metric while keeping other metrics under constraints.

We build NVSim by using the same empirical modeling methodology as CACTI [49,50] but starting from a new framework and adding the specific features for NVM technologies. Compared to CACTI, the framework of NVSim includes the following new features,

• It allows to move the sense amplifiers from the inner memory subarrays to the outer bank level and factor them out to achieve overall area efficiency of the memory module;

• It provides more flexible array organizations and data activation modes by considering any combinations of memory data allocation and address distribution;

• It models various types of data sensing schemes instead of voltage-sensing scheme only;

• It allows memory banks to be formed in a bus-like manner rather than the H-tree manner only;

• It provides multiple design options of buffers instead of latency-optimized option that uses logical effort;

• It models the cross-point memory cells rather than MOS-accessed memory cells only;

• It considers the subarray size limit by analyzing the current sneak path;

• It allows advanced target users to redefine memory cell properties by providing a customization interface.

Since one of the design goals of NVSim is to facilitate the architecture-level research of eNVM, the accuracy of NVSim has to be sufficiently high. For this purpose, NVSim is validated against several industry prototype chips within the error range of 30%. 17

Predecoder Precharger Sub- array

Mat Cell array Row decoder

Bank driver Wordline

Column multiplexer Sense amplifier Output driver

Figure 4.1: An example of the memory array organization modeled in NVSim: a hierarchical memory organization includes banks, mats, and subarrays with decoders, multiplexers, sense amplifiers, and output drivers.

4.2 NVSim Framework

The framework of NVSim is modified on the basis of CACTI [50, 70]. We add several new features, such as more flexible data activation modes and alternative bank organizations.

4.2.1 Device Model

NVSim uses device data from the ITRS report [42] and the MASTAR tool [71] to obtain the process parameters. NVSim covers the process nodes from 22nm to 180nm and supports 3 transistor types, which are High Performance (HP), Low Operating Power (LOP), and Low Stand-by Power (LSTP).

4.2.2 Array Organization

Figure 4.1 shows the array organization. There are 3 hierarchy levels in such organization, which are bank, mat, and subarray. Basically, the descriptions of these levels are:

• Bank is the top-level structure modeled in NVSim. One non-volatile memory chip can have multiple banks. The bank is a fully-functional memory unit, and it can be operated independently. In each bank, multiple mats are connected together in either H-tree or bus-like manner.

• Mat is the building block of bank. Multiple mats in a bank operate simultaneously 18

to fulfill a memory operation. Each mat consists of multiple subarrays and one pre- decoder block.

• Subarray is the elementary structure modeled in NVSim. Every subarray contains a set of peripheral circuitry including row decoders, column multiplexers, and output drivers.

Conventionally, sense amplifiers are integrated on the subarray level as modeled in CACTI [50, 70]. However, in NVSim model, sense amplifiers can be placed either on the subarray level or the mat level.

4.2.3 Memory Bank Type

For practical memory designs, memory cells are grouped together to form memory modules of different types. For instance,

• The main memory is a typical random-access memory (RAM), which takes the address of data as input and returns the content of data;

• The set-associative cache contains two separate RAMs (data array and tag array), and can return the data if there is a cache hit by the given set address and tag;

• The fully-associative cache usually contains a content-addressable memory (CAM).

To cover all the possible memory designs, we model 5 types of memory banks in NVSim: one for RAM, one for CAM, and three for set-associate caches with different access manners. The functionalities of these 5 types of memory banks are listed as follows,

1. RAM: Output the data content at the I/O interface given the data address.

2. CAM: Output the data address at the I/O interface given the data content if there is a hit.

3. Cache with normal access: Start to access the cache data array and tag array at the same time; the data content is temporarily buffered in each mat; if there is a hit, the cache hit signal generated from the tag array is routed to the proper mats and the content of the desired cache line is outputted to the I/O interface. 19

Table 4.1: The initial number of wires in each routing group Broadcast Distributed Address wire data wire data wire (NAW ) wire (NBW ) wire (NDW )

Random-access memory (RAM) 0 Wblock log2Nblock Content-addressable memory (CAM) Wblock 0 Data array log (N /A) log A W Normal access 2 block 2 block Tag array log2 (Nblock/A) Wblock A Data array log N 0 W Cache Sequential access 2 block block Tag array log2 (Nblock/A) Wblock A Data array log (N /A) 0 W A Fast access 2 block block Tag array log2 (Nblock/A) Wblock A

4. Cache with sequential access: Access the cache tag array at first; if there is a hit, then access the cache data array with the set address and the tag hit information, and finally output the desired cache line to the I/O interface.

5. Cache with fast access: Access the cache data array and tag array simultaneously; read the entire set content from the mats to the I/O interface; selectively output the desired cache line if there is a cache hit signal generated from the tag array.

4.2.4 Activation Mode

We model the array organization and the data activation modes using eight parameters, which are

• NMR: number of rows of mat arrays in each bank;

• NMC : number of columns of mat arrays in each bank;

• NAMR: number of active rows of mat arrays during data accessing;

• NAMC : number of active columns of mat arrays during data accessing;

• NSR: number of rows of subarrays in each mat;

• NSC : number of columns of subarrays in each mat; 20

• NASR: number of active rows of subarrays during data accessing;

• and NASC : number of active columns of subarrays during data accessing.

The values of these parameters are all constrained to be power of two. NMR and NMC define the number of mats in a bank, and NSR and NSC define the number of subarrays in a mat. NAMR, NAMC , NASR, and NASC define the activation patterns, and they can take any values smaller than NMR, NMC , NSR, and NSC , respectively. On the contrary, the limitation of array organization and data activation pattern in CACTI is caused by several constraints on these parameters such as NAMR = 1, NAMC = NMC , and NSR = NSC =

NASR = NASC = 2. NVSim has these flexible activation patterns, and is able to model sophisticated memory accessing techniques, such as single subarray activation [72].

4.2.5 Routing to Mats

In order to first route the data and address signals from the I/O port to the edge of memory mats and from mat to the edges of memory subarrays, we divided all the interconnect wires into three categories: Address Wires, Broadcast Data Wires, and Distributed Data Wires. Depending on the memory module types and the activation modes, the initial number of wires in each group is assigned according to the rules listed in Table 4.1. We use the terminology block to refer to the memory words in RAM and CAM designs and the cache lines in cache designs. In Table 4.1, Nblock is the number of blocks, Wblock is the block size, and A is the associativity in cache designs. The number of Broadcast Data Wires are always kept unchanged, the number of Distributed Data Wires is cut by half at each routing point where data are merged, and the number of Address Wires is subtracted by one at each routing point where data are multiplexed. We use the case of the cache bank with normal access to demonstrate how the wires are routed from the I/O port to the edges of the mats. For simplicity, we suppose the data array and the tag array are two separate modules. While the data and the tag arrays usually have different mat organizations in practice, we use the same 4x4 mat organization for the demonstration purpose as shown in Figure 4.2 and Figure 4.3. The total 16 mats are 21

Total: 9+3+128 = 140 Mat Mat Mat Mat D Address wires: 9 0 1 2 3 Broadcast wires: 3 Total: 11+3+256 = 270 Distributed wires: 128 C B Address wires: 11 Broadcast wires: 3 Total: 10+3+128 = 141 Mat Mat Mat Mat Distributed wires: 256 Address wires: 10 4 5 6 7 Broadcast wires: 3 A Total: 11+3+512 = 526 Distributed wires: 128 Address wires: 11 Mat Mat Mat Mat Broadcast wires: 3 Distributed wires: 512 Total: 11+3+128 = 142 8 9 10 11 Address wires: 11 Broadcast wires: 3 Mat Mat Mat Mat Distributed wires: 128 Merging Multiplexing 12 13 14 15 node node

Figure 4.2: The example of the wire routing in a 4x4 mat organization for the data array of a 8-way 1MB cache with 64B cache lines.

Total: 9+16+2 = 27 Mat Mat Mat Mat D Address wires: 9 0 1 2 3 Broadcast wires: 16 Total: 11+16+4 = 31 Distributed wires: 2 C B Address wires: 11 Broadcast wires: 16 Total: 10+16+2 = 28 Mat Mat Mat Mat Distributed wires: 4 Address wires: 10 4 5 6 7 Broadcast wires: 16 A Total: 11+16+8 = 35 Distributed wires: 2 Address wires: 11 Mat Mat Mat Mat Broadcast wires: 16 Distributed wires: 8 Total: 11+16+2 = 29 8 9 10 11 Address wires: 11 Broadcast wires: 16 Mat Mat Mat Mat Distributed wires: 2 Merging Multiplexing 12 13 14 15 node node

Figure 4.3: The example of the wire routing in a 4x4 mat organization for the tag array of a 8-way 1MB cache with 64B cache lines.

positioned in a 4x4 formation and connected by a 4-level H-tree. Therefore, NMR and NMC are 4. As an example, we use the activation mode in which two rows and two columns of the mat array are activated for each data access, and the activation groups are Mat {0, 2, 8, 10},

Mat {1, 3, 9, 11}, Mat {4, 6, 12, 14}, and Mat {5, 7, 13, 15}. Thereby, NAMR and NAMC are 2. In addition, We set the cache line size (block size) to 64B, the cache associativity to A = 8, and the cache bank capacity to 1MB, so that the number of cache lines (blocks)

is Nblock = 8M/512 = 16384, the block size in the data array is Wblock,data = 512, and the

block size in the tag array is Wblock,tag = 16 (assuming 32-bit addressing and labeling dirty block with one bit).

According to Table 4.1, the initial number of address wires (NAW ) is log2Nblock/A = 11 22 for both data and tag arrays. For data array, the initial number of broadcast data wires

(NBW,data) is log2A = 3, which is used to transit the tag hit signals from the tag array to the corresponding mats in the data array; the initial number of distributed data wires

(NDW,data) is Wblock,data = 512, which is used to output the desired cache line from the mats to the I/O port. For tag array, the broadcast data wire (NBW,tag) is Wblock,tag = 16, which is sent from the I/O port to each of the mat in the tag array; the initial number of distributed data wires (NDW,tag) is A = 8, which is used to collect the tag hit signals from each mat to the I/O port and then send to the data array after a 8-to-3 encoding process. From the I/O port to the edges of the mats, the numbers of wires in the three categories are changed as follows and demonstraed in Figure 4.2 and Figure 4.3, respectively.

1. At node A, the activated mats are distributed in both the upper and the bottom parts, so node A is a merging node. As per the routing rule, the address wires and broadcast data wires remain the same but the distributed data wires are cut in half. Thus, the

wire segment between node A and B have: NAW = 11, NBW,data = 3, NDW,data = 256,

NBW,tag = 16, and NDW,tag = 4.

2. Node B is again a merging node. Thus, the wire segment between node B and C have:

NAW = 11, NBW,data = 3, NDW,data = 128, NBW,tag = 16, and NDW,tag = 2.

3. At node C, the activated mats are allocated only in one side, either from Mat 0/1 or from Mat 4/5, so Node C is a multiplexing node. As per the routing rule, the distributed data wires and broadcast data wires remain the same but the address wires is decremented by 1. Thus, the wire segment between node C and D have:

NAW = 10, NBW,data = 3, NDW,data = 128, NBW,tag = 16, and NDW,tag = 2.

4. Finally, node D is another multiplexing node. Thus, Thus, the wire segments at the

mat edges have: NAW = 9, NBW,data = 3, NDW,data = 128, NBW,tag = 16, and

NDW,tag = 2.

Thereby, each mat in the data array takes the input of a 9-bit set address and a 3-bit tag hit signals (which can be treated as the block address in a 8-way associative set), and it generates the output of a 128-bit data. A group of 4 data mats provides the desired output 23

Subarray 0 Subarray 1 Subarray 2 Subarray 3

Sense Sense Sense Sense amplifier amplifier Predecoder amplifier amplifier

H-tree Full-swing signals

Sense Sense Sense Sense amplifier amplifier amplifier amplifier

Subarray 4 Subarray 5 Subarray 6 Subarray 7

Mat

Figure 4.4: An example of mat using internal sensing and H-tree routing. of a 512-bit (64B) cache line, and four such groups cover the entire 11-bit set address space. On the other hand, each mat in the tag array takes the input of a 9-bit set address and a 16-bit tag, and it generates a 2-bit hit signals (01 or 10 for hit and 00 for miss). A group of 4 tag mats concatenate their hit signals and provides the information whether a 16-bit tag hits in a 8-way associated cache with a 9-bit address space, and four such groups extend the address space from 9-bit to the desired 11-bit. Other configurations in Table 4.1 can be explained in the similar manner.

4.2.6 Routing to Subarrays

The interconnect wires from mat to the edges of memory subarrays are routed using the same H-tree organization as shown in Figure 4.4, and its routing strategy is the same wire partitioning rule described in Chapter 4.2.5. However, in NVSim, we provide an option of building mat using a bus-like routing organization as illustrated in Figure 4.5. The wire partitioning rule described in Chapter 4.2.5 can also be applied to the bus-like organization with a few extension. For example, a multiplexing node with a fanout of N decrements the number of address wires by log2N instead of 1; a merging node with a fanout of N divides the number of distributed data wires by N instead of 2. 24

Partial- swing Subarray 0 Subarray 1 Subarray 2 signals Subarray 3

Subarray 4 Subarray 5 Subarray 6 Subarray 7

Predecoder Sense Sense amplifier amplifier Mat

Figure 4.5: An example of mat using external sensing and bus-like routing.

Furthermore, the default setting of including sense amplifiers in each subarray can cause a dominant portion of the total array area. As a result, for high-density memory module designs, NVSim provides an option of moving the sense amplifiers out of the subarray and using external sensing. In addition, a bus-like routing organization is designed to associate with the external sensing scheme. Figure 4.4 shows a common mat using H-tree organization to connect all the sense amplifier-included subarrays together. In contrast, the new external sensing scheme is illustrated in Figure 4.5. In this external sensing scheme, all the sense amplifiers are lo- cated at the mat level and the output signals from each sense amplifier-free subarray are partial-swing. It is obvious that the external sensing scheme has much higher area efficiency compared to its internal sensing counterpart. However, as a penalty, sophisticated global interconnect technologies, such as repeater inserting, cannot be used in the external sens- ing scheme since all the global signals are partial-swing before passing through the sense amplifiers.

4.3 Area Model

Since NVSim estimates the performance, energy, and area of non-volatile memory modules, the area model is an essential component of NVSim especially given the facts that inter- connect wires contribute a large portion of total access latency and access energy and the geometry of the module becomes highly important. In this section, we describe the NVSim 25

Storage element

Word line

Bit line

Source line

Figure 4.6: Conceptual view of a MOS-accessed cell (1T1R) and its connected word line, bit line, and source line. area model from the memory cell level to the bank level in details.

4.3.1 Cell Area Estimation

Three types of memory cells are modeled in NVSim: MOS-accessed, cross-point, and NAND-string.

MOS-Accessed Cell

MOS-accessed cell corresponds to the typical 1T1R (1-transitor-1-resistor) structure used by many NVM chips [4, 5, 11, 13, 17, 19, 73], in which an NMOS access device is connected in series with the non-volatile storage element (i.e. MTJ in STT-RAM, GST in PCRAM, and metal-oxide in ReRAM) as shown in Figure 4.6. Such an NMOS device turns on/off the access path to the storage element by tuning the voltage applied to its gate. The MOS- accessed cell usually has the best isolation among neighboring cells due to the property of MOSFET. In MOS-accessed cells, the size of NMOS is bounded by the current needed by the write operation. the size of NMOS in each MOS-accessed cell needs to be sufficiently large so that the NMOS has the capability of driving enough write current. The driving current of 26

1 NMOS, IDS can be first-order estimated as follows ,

2 W VDS IDS = K (VGS − VTH ) VDS − (4.1) L " 2 # if NMOS is working at the linear region; or calculated by

K W 2 IDS = (VGS − VTH ) (1 + λVDS) (4.2) 2 L if NMOS is working at the saturation region. Hence, no matter in which region NMOS is working, the current driving ability of NMOS is proportional to its width-to-length (W/L) ratio2, which determines the NMOS size. To achieve high cell density, we model the MOS- accessed cell area by referring to DRAM design rules [74]. As a result, the cell size of a MOS-accessed cell in NVSim is calculated as follows,

2 Areacell,MOS−accessed = 3 (W/L + 1)(F ) (4.3) in which the width-to-length ratio (W/L) is determined by Equation 4.1 or Equation 4.2 and the required write current is configured as one of the input values of NVSim. In NVSim, we also allow advanced users to override this cell size calculation by directly importing the user-defined cell size.

Cross-Point Cell

Cross-point cell corresponds to the 1D1R (1-diode-1-resistor) [1, 16, 23, 34, 75] or 0T1R (0- transistor-1-resistor) [22,26,30] structures used by several high-density NVM chips recently. Figure 4.7 shows a cross-point array without diodes (i.e. 0T1R structure). For 1D1R structure, a diode is inserted between the word line and the storage element. Such cells either rely on the one-way connectivity of diode (i.e. 1D1R) or leverage materials’ non-

1Equations 4.1 and 4.2 are for long-channel drift/diffusion devices, the equations are subjected to change depending on the technology, though the proportional relationship between the current and W/L still hold for very advanced technologies. 2Usually, the transistor length (L) is fixed as the minimal feature size, and the transistor width (W) is adjustable. 27

Storage element

Bit line

Word line

Figure 4.7: Conceptual view of a cross-point cell array without diode (0T1R) and its con- nected word lines and bit lines. linearity (i.e. 0T1R) to control the memory access path. As illustrated in Figure 4.7, the widths of word lines and bit lines can be the minimal value of 1F and the spacing in each direction is also 1F, thus the cell size of each cross-point cell is,

2 Areacell,cross−point = 4(F ) (4.4)

Compared to MOS-accessed cells, cross-point cells have worse cell isolation but provide a way of building high-density memory chip because they have much smaller cell sizes. In some cases, the cross-point cell size is constrained by the diode due to limited current density, NVSim allows user to override the default 4F 2 setting.

NAND-String Cell

The NAND-string cell is modeled particularly for the NAND flash memory. In a NAND- string cell, a group of floating gates are connected in series and two ordinary gates with contacts are added at the end-points of the string as shown in Figure 4.8. Since the area of the floating gates can be minimized to 2Fx2F, the total area of a NAND-string cell is,

2 Areacell,NAND−string = 2 (2N + 5) (F ) (4.5)

where N is the number of floating gates in a string and we assume the addition of two gates and two contacts causes 5F in the total string length. 28

WL31

SG(D) contact SG(D)

WL31

WL30 … NAND- string cell WL1

WL0

SG(S) F

SG(S)

WL0 F

Figure 4.8: The layout of the NAND-string cell modeled in NVSim.

4.3.2 Peripheral Circuitry Area Estimation

Besides the area occupied by memory cells, there is a large portion of memory chip area that is contributed to the peripheral circuitry. In NVSim, we have peripheral circuitry components such as row decoders, prechargers, and column multiplexors on the subarray level, predecoders on the mat level, and sense amplifiers and write drivers on either the subarray or mat level depending on whether internal or external data sensing scheme is used. In addition, on every level, interconnect wires might occupy extra silicon area if the wires are relayed using repeaters. In order to estimate the area of each peripheral circuitry component, we delve into the actual gate-level logic design as similar to CACTI [50]. However, in NVSim, we size transistors in a more generalized way than CACTI does. The sizing philosophy of CACTI is to use logical effort [76] to size the circuits for minimum delay. NVSim’s goal is to estimate the properties of NVM chips of a board range, and these chips might be optimized for density or energy consumption instead of minimum delay, thus we provide optional sizing methods rather than only applying logical 29

(a) 1 4 16 64 256 1024 4096 Total delay = 30 unit, Total area = 1365 unit (b) 1 4 16 64 4096 Total delay = 80 unit, Total area = 85 unit (c) 1 64 4096

Total delay = 130 unit, Total area = 65 unit

Figure 4.9: Transistor sizings: (a) latency-optimized; (b) balanced; (c) area-optimized. effort. In addition, for some peripheral circuitry in NVM chips, the size of some transistors is determined by their required driving current instead of their capacitive load, and this violates the basic rules of using logical effort. Therefore, we offer three transistor sizing choices in the area model of NVSim: one optimizing latency, one optimizing area, while another balancing latency and area. An example is illustrated in Figure 4.9 demonstrating the different sizing methods when an output buffer with 4096 times the capacitance of a minimum-sized inverter is to be designed. In a latency-optimized buffer design, the number of stages and all of the inverter sizing in the inverter chain is calculated by logical effort to achieve minimum delay (30 units) while paying a huge area penalty (1365 units). In an area-optimized buffer design, there are only two stages of inverters, and the size of the last stage is determined by the minimum driving current requirement. This type of buffer has the minimum area (65 units), but is much slower than the latency-optimized buffer. The balanced option determines the size of last stage inverter by its driving current requirement and calculates the size of the other inverters by logical effort. This results in a balanced delay and area metric. 30

4.4 Timing and Power Models

As an analytical modeling tool, NVSim uses RC analysis to model the timing and power. In this section, we describe how resistances and capacitances are estimated in NVSim and how they are combined to calculate the delay and power consumption.

4.4.1 Generic Timing and Power Estimation

In NVSim, we consider the wire resistance and wire capacitance from interconnects, turn-on resistance, switching resistance, gate and drain capacitances from transistors, and equivalent resistance and capacitance from memory storage elements (e.g. MTJ in STT-RAM and GST in PCRAM). The methods of estimating wire and parasitic resistances and capacitances are modified from the previous versions of CACTI [49, 50] by several enhancements. The en- hancements include updating the transistor models by latest ITRS report [42], considering the thermal impact on wire resistance calculation, adding drain-to-channel capacitance in the drain capacitance calculation, and so on. We build a look-up table to model the equiv- alent resistance and capacitance of memory storage elements since they are the properties of certain non-volatile memory technology. Considering NVSim is a system-level estimation tool, we only model the static behavior of the storage elements and record the equivalent

resistances and capacitances of RESET and SET states (i.e. RRESET , RSET , CRESET , 3 CSET ) . After calculating the resistances and capacitances of nodes, the delay of each logic component is calculated by using a simplified version of Horowitz’s timing model [77] as follows, 1 2 Delay = τ ln + αβ (4.6) s 2   where α is the slope of the input, β = gmR is the normalized input transconductance by the output resistance, and τ = RC is the RC time constant.

3One of the exceptions is that NVSim records the detailed I-V curves for cross-point ReRAM cells without diode because we need to leverage the non-linearity of the storage element 31

The dynamic energy and leakage power consumptions can be modeled as

2 Energydynamic = CVDD (4.7)

Powerleakage = VDDIleak (4.8)

where we model both gate leakage and sub-threshold leakage currents in Ileak. The overall memory access latency and energy consumption are estimated by combining all the timing and power values of circuit components together. NVSim follows the same methodology that CACTI [50] uses with minor modifications.

4.4.2 Data Sensing Models

Unlike other peripheral circuitries, the sense amplifier is an analog design instead of a logic design. Thus, in NVSim, we develop a separate timing model for the data sensing schemes. Different sensing schemes have their impacts on the trade-off among performance, energy, and area. In NVSim, we consider three types of sensing schemes: current sensing, current-in voltage sensing, and voltage-divider sensing. In the current sensing scheme as shown in Figure 4.10, the state of memory cell (STT- RAM, PCRAM or ReRAM) is read out by measuring the resulting current through the selected memory cell when a read voltage is applied: the current on the bit-line is compared to the reference current generated by reference cells, the current difference is amplified by current-mode sense amplifiers, and they are eventually converted to voltage signals. Figure 4.11 demonstrates an alternative sensing method by applying a current source on the selected memory cell and sensing the voltage via the voltage-mode sense amplifier.

The voltage-divider sensing scheme is presented by introducing a resistor (Rx) in series with the memory cell as illustrated in Figure 4.12. The resistance value is selected to achieve the maximum read sensing margin, and it is calculated as follows,

Rx = Ron × Roff (4.9) q where Ron and Roff are the equivalent resistance values of the memory cell in LRS and 32

Current-Voltage Voltage Bitline RC Model Converter Sense Amplifier

+ B Full-swing Vin R R R R R C C C C Iout Vout Output -

Figure 4.10: Analysis model for current sensing scheme.

Voltage Bitline RC Model Sense Amplifier

+ R R R R R Full-swing Iin RB C C C C C Vout Output -

Figure 4.11: Analysis model for current-in voltage sensing scheme.

HRS, respectively.

Bitline RC Model

We model the bit-line RC delay analytically for each sensing scheme. The most significant difference between the current-mode sensing and voltage-mode sensing is that the input resistance of ideal current-mode sensing is zero while that of ideal voltage-mode sensing is infinite. And, the most significant difference between current-in voltage sensing and voltage- divider sensing is that the internal resistance of an ideal current source is infinite while the resistor Rx serving as a voltage divider can be treated as the internal resistance of a voltage source. Delays of current-in voltage sensing, voltage-divider sensing and current sensing are given by the follows equations using Seevinck’s delay expression [78]:

RT CT 2RB δtv = × 1+ (4.10) 2  RT  R C 2(R ||R ) δt = T T × 1+ B x (4.11) vd 2 R  T  RT RT CT RB + 3 δti = × (4.12) 2 RB + RT ! where RT and CT are the total line resistance and capacitance, RB is the equivalent resis-

tance of the memory cell, and Rx is the resistance of voltage divider. In these equations, 33

Voltage Bitline RC Model Sense Amplifier

+ Rx R R R R R Full-swing Vin RB C C C C C Vout Output -

Figure 4.12: Analysis model for voltage-divider sensing scheme.

tv, tvd, and ti are the RC delays of current-in voltage sensing, voltage-divider sensing, and current sensing schemes, respectively. Rx||RB, instead of RB, is used as the new effective pull-down resistance in Equation 4.11 according to the transformation from a Thevenin Equivalent to a Norton Equivalent. Equation 4.10 and 4.11 show that voltage-divider sensing is faster than current-in voltage sensing with the extra cost of fabricating a large resistor. Comparing Equation 4.12 with Equation 4.10 and 4.11, we can see the current sensing is much faster than current-in voltage sensing and voltage-divider sensing since the former delay is less than the intrinsic line delay RT CT /2 while the latter delays are larger than RT CT /2. The bit-line delay analytical models are verified by comparing them with the HSPICE simulation results.

Current-Voltage Converter Model

As shown in Figure 4.10, the current-voltage converter in our current-mode sensing scheme is actually the first-level sense amplifier, and the CACTI-modeled voltage sense amplifier is still kept in the bitline model as the final stage of the sensing scheme. The current- voltage converter senses the current difference I1 −I2 and then it is converted into a voltage

difference V1 − V2. The required voltage difference produced by current-voltage converter is set by default to 80mV . Although this value is the minimum sensible voltage difference of the CACTI-modeled voltage sense amplifier, advanced user can override it for specific sense amplifier design. We refer to a previous current-voltage converter design [78] and the circuit schematic is shown in Figure 4.13. This sensing scheme is similar to the hybrid-I/O approach [79], which can achieve high-speed, robust sensing, and low power operation. To avoid unnecessary calculation, the current-voltage converter is modeled by directly using the HSPICE-simulated values and building a look-up table of delay, dynamic energy, 34

I1 I2

M1 M2

V1 V2

M3 M4

Figure 4.13: The current-voltage converter modeled in NVSim. and leakage power.

4.4.3 Cell Switching Model

Different NVM technologies have their specific switching mechanism. Usually, the switching phenomenon involves magnetoresistive, phase-change, thermochemical, and electrochemical effects, and it cannot be estimated by RC analysis. Hence, the cell switching model in NVSim largely relies on the NVM cell definition. The predefined NVM cell switching properties include the SET/RESET pulse duration (i.e. tSET and tRESET) and SET/RESET

current (i.e. ISET and IRESET) or voltage. NVSim does not model the dynamic behavior during the switching of the cell state, the switching latency (i.e. cell write latency) is directly the pulse duration and the switching energy (i.e. cell write energy) is estimated using Joule’s first law that is,

2 EnergySET = ISETRtSET

2 EnergyRESET = IRESETRtRESET (4.13)

in which the resistance value R can be the equivalent resistance of the corresponding SET

or RESET state (i.e RSET or RRESET. However, for NVM technologies that have threshold switching phenomenon (e.g PCRAM and ReRAM), the resistance value R always equals to the resistance of the low-resistance state. This is because when a voltage above a particular threshold is applied to these NVM cells in the high-resistance state, the resulting large 35

Write Driver

Pulse Shaper

Icell

SET_EN RESET_EN

Figure 4.14: The circuit schematic of the slow quench pulse shaper used in [1]. electrical fields greatly increase the electrical conductivity [80].

4.5 Miscellaneous Circuitry

Some specialized circuitry is required for certain types of NVMs. For instance, some PCRAM chips need pulse shaper to reform accurate SET and RESET pulses, and NAND flash and some PCRAM chips need charge pump to generate the high-voltage power plane that is necessary for write operations.

4.5.1 Pulse Shaper

Some PCRAM need specialized circuits to handle its RESET and SET operations. Spe- cific pulse shapes are required to heat up the GST quickly and to cool it down gradually, especially for SET operations. This pulse shaping requirement is achieved by using a slow quench pulse shaper. As shown in Figure 4.14, the slow quench pulse shaper is composed of an arbitrary slow-quench waveform generator and a write driver. In NVSim, the delay impacts of the slow quench shaper are neglected because they are already included in the RESET/SET calculation of the timing model. The energy impacts of the shaper is modeled by adding an energy efficiency during the RESET/SET operation, which we set the default value to 35% [19] and it can be overridden by advanced user. The 36

area of slow quench shapers is modeled by measuring the die photos [1,19].

4.5.2 Charge Pump

The write operations of NAND flash and some PCRAM chips require voltage higher than the chip supply voltage. Therefore, a charge pump that uses capacitors as energy storage elements to create a higher voltage is necessary in a NAND flash chip design. In NVSim, we neglect the silicon area occupied by charge pump since the charge pump area can vary a lot depending on its underlying circuit design techniques and the charge pump area is relatively small compared to the cell array area in a large-capacity NAND chip. However, we model the energy dissipated by charge pumps during the program and erase operations in NVSim because they contribute a considerable portion of the total energy consumption. The energy consumed by charge pumps is referred from an actual NAND flash chip design [81], which specifies that a conventional charge pump consumes 0.25µJ at 1.8V supply voltage. We use this value as the default in NVSim.

4.6 Validation Result

NVSim is validated against NAND flash chips and several industrial prototype designs of STT-RAM [4], PCRAM [5, 73] and ReRAM [6] in terms of area, latency, and energy. We extract the information from real chip design specifications to set the input parameters required by NVSim, such as capacity, line size, technology node, and array organization.

4.6.1 NAND Flash Validation

It is challenging to validate the NAND flash model in NVSim since the public manufacturer datasheets do not disclose sufficient data on the operation latency and power consumption for validation purpose. Instead, Grupp et al. [3] report both latency and power consumption measurements of several commercial NAND flash chips from different vendors. Grupp’s re- port does not include the NAND flash silicon area, hence we set the actual NAND flash chip area by assuming an area efficiency of 90%. The comparison between the measurement [3] and the estimations given by NVSim is listed in Table 4.2. 37

Table 4.2: NVSim’s NAND flash model validation with respect to a 50nm 2Gb NAND flash chip (B-SLC2) [3] Metric Actual Projected Error Area 23.85mm2 22.61mm2 −5.20% Read latency 21µs 25.2µs +20.0% Program latency 200µs 200.1µs +0.1% Erase latency 1.25ms 1.25ms +0.0% Read energy 1.56µJ 1.85µJ +18.6% Program energy 3.92µJ 4.24µJ +8.2% Erase energy 34.5µJ 36.0µJ +4.3%

Table 4.3: NVSim’s STT-RAM model validation with respect to a 65nm 64Mb STT-RAM prototype chip [4] Metric Actual Projected Error Area 39.1mm2 38.05mm2 −2.69% Read latency 11ns 11.47ns +4.27% Write latency < 30ns 27.50ns - Write energy N/A 0.26nJ -

4.6.2 STT-RAM Validation

We validate the STT-RAM model against a 65nm prototype chip [4]. We let 1 bank = 32×8 mats, and 1 mat = 1 subarray to simulate the memory array organization. We also exclude the chip area of I/O pads and duplicated cells to make the fair comparison. As the write latency is not disclosed, we assume the write pulse duration is 20ns. The validation result is listed in Table 4.3.

4.6.3 PCRAM Validation

We first validate the PCRAM model against a 0.12µm MOS-accessed prototype. The array organization is configured to have 2 banks, each has 8×8 mats. Every mat contains only one subarray. Table 4.4 lists the validation result, which shows a 5% underestimation of area and 6% underestimation of read latency. The projected write latency (SET latency as the worst case) is also consistent to the actual value. Another PCRAM validation is made against a 90nm diode-accessed prototype [1]. 38

Table 4.4: NVSim’s PCRAM model validation with respect to a 0.12µm 64Mb MOS- accessed PCRAM prototype chip [5] Metric Actual Projected Error Area 64mm2 57.44mm2 −10.25% Read latency 70.0ns 65.93ns −5.81% Write latency > 180.0ns 180.17ns - Write energy N/A 6.31nJ -

Table 4.5: NVSim’s PCRAM model validation with respect to a 90nm 512Mb diode-selected PCRAM prototype chip [1] Metric Actual Projected Error Area 91.50mm2 93.04mm2 +1.68% Read latency 78ns 59.76ns −23.40% Write latency 430ns 438.55ns +1.99% Write energy 54nJ 47.22nJ −12.56%

4.6.4 ReRAM Validation

We validate the ReRAM model against a 180nm 4Mb HfO2-based MOS-accessed ReRAM prototype [6]. According to the disclosed data, the subarray size is configured to 128Kb. We further model a bank with 4x8 mats and each mat contains a single subarray. The validation result is listed in Table 4.6. Note that the estimated chip area given by NVSim is much smaller than the actual value since the prototype chip has SLC/MLC dual modes but the current version of NVSim does not model the MLC-related circuitry.

4.7 Summary

To enable the system-level design space exploration of eNVM technologies and facilitate computer architects leverage these emerging technologies, it is necessary to have a quick estimation tool. While abundant estimation tools are available as SRAM/DRAM design as- sistants, similar tools for eNVM designs are currently missing. Therefore, we build NVSim, a circuit-level model for NVM performance, energy, and area estimation, which supports var- ious NVM technologies including STT-RAM, PCRAM, ReRAM, and conventional NAND flash. As an extension of CACTI, NVSim also models SRAM and DRAM in a more accu- 39

Table 4.6: NVSim’s ReRAM model validation with respect to a 0.18µm 4Mb MOSFET- selected ReRAM prototype chip [6] Metric Actual Projected Error Area4 187.69mm2 33.42mm2 - Read latency 7.2ns 7.72ns +7.22% Write latency 0.3ns − 7.2ns 6.56ns - Write energy N/A 0.46nJ - rate way since some false assumptions in CACTI are fixed in NVSim. The usage of NVSim can be divided into two categories:

• NVSim can be used to optimize the eNVM designs toward certain design metric;

• NVSim can also be used to estimate the performance, energy, and area before fabri- cating a real prototype chip, especially when the emerging NVM device technology is still under development and there is no standard so far.

Under given process nodes, the tunable memory design parameters modeled in NVSim include but not limit to array structure, subarray size, sense amplifier design, write method, repeater design, and buffer design. If necessary, NVSim can also explore different types of transistor or wire models to get the best result. NVSim is implemented in C++ from scratch, and it contains more than 20,000 lines of source code. We use it for our later architecture- and application-level studies. The NVSim binary code can be downloaded at http://www.rioshering.com/nvsimwiki.

4A large portion of the chip area is contributed to the MLC control and test circuits, which are not modeled in NVSim. Chapter 5

Architecture-Level: Techniques for Alleviating eNVM Write Overhead

Compared to volatile memories such as SRAM and DRAM, eNVMs have more stable data keeping mechanism. Accordingly, it needs to take a longer time and consume more energy to overwrite the existing data. Hence, the major drawbacks of eNVM are the relatively long write latency and high write energy. To alleviate the eNVM write overhead, we propose two techniques for alleviating eNVM write overhead. For these studies, NVSim is used to get the eNVM cache/memory parameters such as access latency, access energy, and occupied chip area.

5.1 Directly Replacing SRAM Caches

Typically, eNVM cache has much higher cell density than its SRAM counterpart. This property makes it feasible to replace an SRAM cache with a much larger eNVM substitution of the same silicon area. Using STTRAM as an example, the circuit-level comparison is listed in Table 5.1. In this example, STTRAM cache is 4 times denser than SRAM cache. Due to the similar silicon area, the read latency and read energy of STTRAM are close to the ones of SRAM, but the write latency and the write energy of STTRAM become the major drawbacks. This conclusion holds for other eNVM technologies such as PCRAM and

40 41

Table 5.1: Area, access time, and energy comparison between SRAM and STTRAM caches within similar silicon area (65nm technology) Cache size 128KB SRAM 512KB STTRAM Area 3.62mm2 3.30mm2 Read latency 2.252ns 2.318ns Write latency 2.264ns 11.024ns Read energy 0.895nJ 0.858nJ Write energy 0.797nJ 4.997nJ

ReRAM. Because of these two drawbacks, we can have two intuitions of directly replacing SRAM caches with eNVM ones which have the similar area but with large capacity:

• eNVM caches with larger capacity can reduce the cache miss rate. However, the long latency associated with the write operations to the eNVM cache has a negative impact on the performance. When the write intensity is high, the benefits caused by miss rate reductions could be offset by the long latency of eNVM write operations and eventually result in performance degradation.

• The non-volatility of eNVM caches can greatly reduce the leakage power. However, when the write intensity is high, the dynamic power increases significantly because of the high write energy. Therefore, the total amount of power savings is reduced.

These two conclusions show that, if we directly replace SRAM caches with eNVM caches using the “same area” strategy, the long latency and the high energy consumption of eNVM write operations can offset the performance and the power benefits brought by eNVM caches when the cache write intensity is high. Hence, novel architecture techniques are required. In this dissertation, two techniques are proposed:

• Read-preemptive write buffer

• Hybrid SRAM-eNVM cache

The details of these two techniques are described in Chapter 5.2 and Chapter 5.3, re- spectively. 42

5.2 Read-Preemptive Write Buffer

As mentioned, long eNVM write latency has a serious impact on the performance and the power consumption. Write buffer is a common technique to hide long write latency. However, the normal write buffer is not sufficiently effective to hide the extra long latency of eNVM write operations because multiple read operations can be blocked by an ongoing write operation and thus drastically harm the overall performance. To solve this problem, we design a priority policy to grant read operations the rights of canceling an ongoing write operation in some conditions. The scheduling scheme is demonstrated in Figure 5.1 and the rules of such buffer are described as follows,

• Rule 1: The read operation always has the higher priority in a competition for the execution right;

• Rule 2: When a read request is blocked by a write retirement and the write buffer is not full, the read request can trap and stall the write retirement if the progress of that write retirement is less than a threshold. Then, the read operation is granted the execution right. The canceled write retirement will be retried later.

Normal scheduling RD0 RD1 WR0 RD2 RD3

Read-preemptive scheduling RD0 RD1 WR0 RD2 RD3 WR0 Canceled

Figure 5.1: In a read-preemptive write buffer, the read operations can be granted over the ongoing write operation if the progress of that write operation is less than 50%.

The proposed read-preemptive policy tries to execute eNVM read requests as early as possible, but the drawback is the necessity to re-execute the canceled write and the possibility of buffer saturation. The pivot is to find a proper preemption condition. One extreme method is to stall the write retirement as long as there is a read request, which means that read requests can always be executed immediately. Theoretically, if the write buffer size is large enough, no read request will be blocked. However, since the buffer size 43

is limited, the increased possibility of full buffer could also harm the performance. In some other cases, stalling write retirements for read requests are not always good. For example, if a write retirement almost finishes, no read request should stall the retirement process. Consequently, we propose to use the retirement accomplishment degree, denoted as α, as the preemption condition. The retirement accomplishment degree is the accomplishment percentage of the ongoing write retirement, below which no preemption will occur. Figure 5.2 compares the IPC of using different α values in the proposed read-preemptive policy. Note that α = 0% represents the non-conditional preemption policy. We can find that, for the workloads with low write intensity, such as galgel and apsi, the performance is improved as the α value increases. However, for the workloads with high write intensity, like streamcluster, the performance is only improved at the beginning of the α increase. Generally, the α value is set to 50% to make the proposed read-preemptive policy effective for all the workloads.

1.3

1.2 streamcluster galgel 1.1 apsi 1 equake 0.9 fma3d swim 0.8 =0% =25% =50% =75% =100%

Figure 5.2: The performance impact of the preemption condition.

5.3 Hybrid SRAM-eNVM Cache

The aforementioned read-preemptive write buffer hides the eNVM long write latency, but the total number of write operations remains the same. In order to reduce the number of write operations to eNVM cache lines, another architecture-level technique, called hybrid SRAM-eNVM cache is proposed. The basic idea of hybrid SRAM-eNVM caches is to make every cache set a mixture of eNVM cache lines and SRAM cache lines and to keep as many write-intensive data in the SRAM part as possible and hence reduce the number of write 44

operations to the eNVM part. The management policy of the hybrid SRAM-eNVM cache can be described as follows,

• The cache controller is aware of the locations of SRAM cache ways and eNVM cache ways. When there is a write miss, the cache controller first try to place the data in the SRAM cache ways.

• Data in eNVM lines will be migrated to SRAM lines when they are accessed by two successive write operations.

8MB Pure STTRAM cache 8MB hybrid SRAM-STTRAM cache (15 way STTRAM + 1 way SRAM) 1 0.8 0.6 0.4 0.2 0 galgel apsi equake fma3d swim streamcluster

Figure 5.3: STTRAM write intensity with and without hybrid SRAM-STTRAM caches.

In order to show the effectiveness of the hybrid SRAM-eNVM cache, we use STTRAM as a case study. Figure 5.3 shows the number of STTRAM write operations per 1K instructions is reduced dramatically by using the proposed hybrid SRAM-STTRAM approach. As a result, the dynamic power associated with the STTRAM write operations is also reduced and the performance is improved.

5.4 Effectiveness of Read-Preemptive Write Buffer and Hy- brid Cache

To evaluate the effectiveness of the proposed read-preemptive write buffer and the hybrid SRAM-eNVM cache techniques, we use 8MB STTRAM to replace a 2MB SRAM L2 cache in a 4-core chip-multiprocessor example. The experimental settings are listed in Table 5.2. We use the Simics [82] for performance simulations and use a few multi-threaded bench- marks from SPECOMP [83] and PARSEC [84] suites. Since the performance and power 45

Table 5.2: Baseline configuration parameters Processors Number of cores 8 Frequency 3GHz Memory Hierarchy L1 cache private, 16KB+16KB, 2-way, 64B line, write-through 2-cycle read/write latency SRAM L2 shared, 2MB , 32-way, 64B line, write-back 7-cycle read/write latency STTRAM L2 shared, 8MB, 32-way, 64B line, write-back 7-cycle read latency, 33-cycle write latency Main memory 4GB, 500-cycle latency

Table 5.3: L2 transaction intensities Name TPKI WPKI galgel 1.01 0.31 apsi 4.15 1.85 equake 7.94 3.84 fma3d 8.43 4.00 swim 19.29 9.76 streamcluster 55.12 23.326 of STTRAM caches are closely related to transaction intensity, we select some simulation workloads as listed in Table 5.3 so that we have a wide range of transaction intensities to L2 caches. The average numbers of total transactions (TPKI)1 and write transactions (WPKI) of L2 caches are listed in Table 5.3. The high eNVM cell density makes 8MB STTRAM caches have comparable silicon area with 2MB SRAM caches, and thus have the advantage in larger cache capacity. However, without any techniques, directly replacing SRAM caches by STTRAM ones causes an overall performance loss. In contrary, the adoption of our two proposed techniques improves the overall performance by 4.9% and reduces the total energy by 73.5% as shown in Figure 5.4 and Table 5.4. 1TPKI is the number of total transactions per 1K instructions and WPKI is the number of write trans- actions per 1K instructions. 46

Table 5.4: The performance and power improvement (STTRAM cache vs. SRAM cache) Performance Total Power Read-preemptive buffer 9.93% 67.26% Hybrid cache -2.61% 85.45% Combined 4.91% 73.5%

2MB SRAM cache 8MB STTRAM cache 8MB hybrid cache with read-preemptive write buffer 1 0.8 0.6 0.4 IPC IPC 0.2 0 Normalized 1 0.8 0.6 0.4

Energy 0.2

Normalized 0 galgel apsi equake fma3d swim streamcluster

Figure 5.4: The performance improvement and energy reduction after applying the read- preemptive write buffer and hybrid cache techniques.

5.5 Summary

Thanks to their high density, fast read access, and non-volatility, eNVM technologies are the promising candidates of on-chip caches. However, one of the eNVM disadvantage is their long write latency and high write energy. Even though directly replacing SRAM caches with eNVM ones can result in considerable power savings, the drawback comes from the long la- tency and high energy consumption of eNVM write operations. As a result, for applications with high cache write intensities, the performance can be degraded and the power saving can be reduced. Therefore, two novel architecture-level techniques are proposed. First, the read-preemptive write buffer is designed to mitigate the performance penalty caused by the long write latency. Second, the hybrid SRAM-eNVM cache is proposed to reduce the eNVM write count. The experimental result shows these two techniques can make eNVM cache work effectively for most workloads regardless of their cache write intensities. Chapter 6

Application-Level: eNVM for File Storage

Recently, the feasibility of multi-level cell (MLC) for eNVM [6, 18, 21, 24], which enables a cell to store more than one bit of digital data, has been shown. This new property makes eNVM more competitive and considered as the successor of the NAND flash technology, which also has the MLC capability but does not have an easy scaling path to reach higher densities. However, the MLC capability of eNVM, such as PCRAM and ReRAM, usually comes with the penalty of longer programming time and shortened cell lifetime compared to their single-level cell (SLC) counterpart. Therefore, it suggests an adaptive MLC/SLC reconfigurable eNVM design that can exploit both of the fast SLC access speed and the large MLC capacity with the awareness of workload characteristics and lifetime requirements. Without the loss of generality, we investigate MLC PCRAM as an example.

6.1 Multi-Level Cell

Usually, eNVM exploits the large resistance contrast between the “1” and the “0” states. As for PCRAM and ReRAM, due to their large resistance contrast between the RESET and SET states, MLC PCRAM [18, 21, 24] and MLC ReRAM [6] are both feasible. However, the degree of success of such an MLC write depends on the resistance distributions over

47 48

Reset pulse

Verify Verify Verify Set-sweep pulse

… Write pulse amplitude Writepulse

~150ns ~40ns SET to intermediate states (SET) (RESET)

Figure 6.1: The basic MLC PCRAM programming scheme a large ensemble of eNVM cells. Unlike single-level cell (SLC) write, where the bit write quality can be ensured by over-SET or over-RESET, the intrinsic randomness associated with each write attempt and the inter-cell variability make it infeasible to have a universal pulse shape for writing an intermediate state. In order to deal with this issue, resistance dis- tribution tightening techniques have been developed based on “program-and-verify” (P&V) procedures.

6.1.1 Extra Write Overhead

P&V is a common programming technique for multi-bit writing and is widely used in NAND flash memories. In order to achieve non-overlapping resistance distributions of different bit levels, P&V needs to iteratively apply set pulse and then verify that a specified precision criterion is met, which leads to much longer write latency. Using PCRAM as an example, the MLC programming algorithm, as illustrated in Figure 6.1, first programs the cell to its low-resistance (SET) state by means of a SET-sweep pulse. This is followed by a single RESET pulse with a fast quench, whose purpose is to initialize the cell to a totally RESET state before partial SET sequences are applied. In the final PGM step, the SET pulse amplitude is gradually increased under a feedback-loop control, so that the tight resistance distribution can be achieved [24]. It is obvious that, compared to the SLC RESET and SET operations that only require to apply a specific pulse shape, the MLC write scheme has to at least incorporate a SET and a RESET operation in each write operation. Thus, the write latency and the write energy is tremendously larger than the ones in the SLC 49

1.E+09 Stuck-SET 1.E+08 Stuck-RESET 1.E+07

1.E+06

1.E+05

1.E+04

Resistance (Unit: ohm) (Unit: Resistance 1.E+03

1.E+02 1.E+01 1.E+03 1.E+05 1.E+07 1.E+09 1.E+11 Number of writes

Figure 6.2: SET and RESET resistances during PCRAM cycling, illustrating the difference between failure by “stuck-SET” and by “stuck-RESET”. mode.

6.1.2 Extra Read Overhead

Reading data from MLC device is more difficult than reading data from SLC device as it requires distinguishing more precisely between neighboring resistance levels. As we show later in Chapter 6.2, reading MLC data needs more comparison steps, which inevitably causes extra read latency overhead.

6.1.3 Reduced Cell Lifetime

As shown above, P&V needs to initialize all the target cells to RESET state before each intermediate writing. As we discuss in the next section, RESET is the major source of cell wear-out, hence MLC reduces the cell lifetime compared to the SLC mode.

6.1.4 PCRAM Lifetime Model eNVM usually has limited write endurance. For example, several PCRAM reliability ex- periments have shown the PCRAM write endurance numbers in the range of 105 and 109 cycles [1, 5, 14]. Two types of failure modes have been observed to happen after cycling, called “stuck-RESET” and “stuck-SET”, as shown in Figure 6.2. 50

Table 6.1: The lifetime model of PCRAM cells. RESET cycles MLC SLC 0-107 Yes Yes 107-109 No Yes 109- No No

In a stuck-RESET failure, the device resistance suddenly spikes, and the resistance is stuck at the level that is much higher than the normal RESET state. This stuck-RESET failure is typically caused by void formation or delamination that catastrophically severs the electrical path between GST and access device. On the contrary, in a stuck-SET failure, a gradual degradation of RESET-to-SET resis- tance margin is usually observed as demonstrated in Figure 6.2. As PCRAM cell continues experiencing write cycles, its GST characteristics change, and somehow it becomes more difficult to create an amorphous (RESET) phase in GST than before. Although, a larger amplitude RESET pulse is able to force GST to switch between the RESET and SET states, it causes a larger RESET power consumption and worse it eventually lets stuck-SET occur earlier. Therefore, in this work, stronger RESET pulse is not used to prolong the PCRAM cell lifetime. Instead, during the stuck-SET degradation, MLC cell is reconfigured to SLC modes as we discuss in the later sections. Degrading MLC to SLC not only bypasses the issue of decreasing resistance margin, more importantly, writing SLC data does less damage to the PCRAM cell than writing MLC data does. According to the study conducted by Goux et al., it is demonstrated that stuck-SET failure is due to a change in the RESET condition that is induced by cycling [85]. 3 2 Their endurance data suggests that endurance scales inversely with tm, where tm is the time- spent-melting during each RESET pulse. Their experiment coincides with other study that observes cycling with only SET pulse greatly extends endurance (more than 1012 cycles) over RESET-SET cycling (1010) [86]. According to Figure 6.2, in this work, we assume that during the first 107 RESET- SET cycles, each PCRAM cell can be either in MLC mode or in SLC mode depending on the external configuration; between 107 and 109 cycles, the RESET resistance degradation 51

makes PCRAM cell to lose the MLC capability and it can only work in SLC mode; all the PCRAM cells after 109 cycles are considered as non-functional. In addition, it is obvious that each MLC write includes a RESET operation and we further assume that the RESET/SET distribution of SLC write is 50%/50%. Therefore, the PCRAM lifetime model can be concluded as tabulated in Table 6.1.

6.2 Adaptive MLC/SLC PCRAM Array Structure

In this section, we demonstrate how an MLC PCRAM array can be configured to support SLC accesses on the fly by incurring only a little hardware overhead. In addition, a density control layer is designed to manage the variable PCRAM capacity and to track the density mode of PCRAM cells at a proper granularity.

6.2.1 MLC/SLC Write: SET, RESET, and PGM Pulses

An MLC write has to initialize the target cell to the RESET state and then iteratively apply PGM pulses (partial SET pulses with fixed pulse duration but different pulse amplitude or vice versa) until the targeted intermediate resistance level is reached. A full SET pulse (SET-sweep pulse) is also used to program the cell into the SET state. Therefore, SET, RESET, and PGM pulse generators are all required in the MLC PCRAM chip, as shown in Figure 6.3. The components in grey are optional for MLC supporting; the other components are required for basic SLC PCRAM operations. In order for a degraded MLC PCRAM cell to work in SLC mode, it is straightforward to use only SET and RESET pulse generators to program the cells. During the SLC writing process, the PGM pulse generator and its associated iteration control logic are bypassed because there is no intermediate resistance levels to program the cells into.

6.2.2 MLC/SLC Read: Dual-Mode Sense Amplifier

Since every PCRAM cell in MLC mode stores more than one bits, during the MLC reading process, each sense amplifier output is compared to multiple references. The comparison results are latched separately and then encoded into a multi-bit data. Figure 6.3 illustrates 52

Row PCM array decoder …... …...

Column decoder

SET_ENRESET_EN PGM_EN

SET RESET PGM Pulse Pulse Pulse 6 bits (Program Current Current DAC) selection

Sweep I I I READ_EN REF1 REF2 REF3 Ramp

Sense SLC amplifier data output

Bit 1 MLC Encoder Bit 0 data output

Figure 6.3: The block diagram of the PCRAM array organization that supports both MLC and SLC operations. an example of the sensing scheme for a 2-bit MLC PCRAM array. In this scheme, a ramp generator triggers three reference sense amplifiers at different time, the output of each reference sense amplifier triggers a corresponding flip-flop which eventually stores the result of the comparison between the bitline current and one of the reference currents, and finally 53

MSB bank LSB bank 1 MLC (bitmap) MLC (bitmap) 1 MLC (bitmap) MLC (bitmap) 0 SLC (2KB) 0 SLC (2KB) SLC (2KB) 0 SLC (2KB) …

0 SLC (2KB) Mode indicator Mode 1) 0, ==MLC (SLC 1 MLC (4KB) MLC (4KB) 1 MLC (4KB) 1 MLC (4KB) banks,128Kblocks/bank 2 block 4KB/MLC block,2KB/SLC

Block buffer (4KB)

Figure 6.4: The conceptual view of managing SLC and MLC modes. the value stored in the flip-flops (“000”, “001”, “011”, or “111”) is encoded into a 2-bit data. However, when an MLC PCRAM cell degrades to SLC mode, most of the components in the MLC writing scheme become unnecessary. The output node of the bitline sense amplifier can be used directly as the SLC data output. Therefore, switching from MLC to SLC mode does not need to add any significant peripheral circuitry but only some simple control logic. The shaded part of Figure 6.3 shows the removable components for SLC PCRAM read and write operations.

6.2.3 Address Re-mapping

With the capability of changing the PCRAM cells between the MLC mode and the SLC mode on-the-fly, the effective PCRAM device capacity can vary at runtime. In order to control such a variable device size, an address re-mapping mechanism is introduced as illustrated in Figure 6.4. In our proposal, the PCRAM chip is divided into two banks and each bank has a complete set of I/O path. The PCRAM cells in each bank are grouped into large blocks. As the optimal block size of the state-of-the-art file system (e.g. Ext2) with large size (more than 1GB) is 4KB, we group each 16K PCRAM cells into one group. Thereby, if the block is in SLC mode, it has the capacity of 2KB, and if it is in 2-bit MLC mode, it has the capacity of 4KB. 54

Algorithm 1 Address Re-mapping Algorithm bankID = getBankID(blockAddress) rowID = getRowID(blockAddress) if Bitmap[rowID] = MLC then Active Bank[bankID] Access Block[rowID] else if bankID = 0 then Active all the banks Access Block[rowID] in all the banks Combine the SLC blocks into a 4KB block else Return bad block end if end if

In this design, the blocks having the same row address are always in the same mode. Hence, a bitmap is used to indicate whether a block is in SLC or MLC mode. The general address re-mapping algorithm is described by Algorithm 1. In short, if the bitmap indicates the accessing block is MLC, then only the corresponding bank is activated. However, if the accessing block is SLC, only the address mapped to the first bank (MSB bank) is valid and the 4KB I/O block is accessed by combining two 2KB SLC blocks (one in MSB bank and the other in LSB bank). In SLC mode, all the accesses to the LSB banks are invalid and end up with a ‘bad block’ signal. In the case of Linux it is possible to supply a bad block list, which is generally much easier to just run ‘badblocks’ at the disk formatting time. The state-of-the-art file system (e.g. Ext2) uses a special sort of ‘hidden file’ to which it allocates all of the bad blocks on the file system. This technique insures that those data blocks will never be accessed or used for any other files. On the other hand, when the block changes its mode from SLC to MLC, it is straightforward to remove this block from the ‘bad block’ list and thus it becomes available to be allocated by other files. The bitmap indicating the MLC/SLC status is the only hardware overhead to enable our adaptive MLC/SLC proposal. In the case of a PCRAM device with 4G cells (512MB as pure SLC PCRAM or 1GB as pure MLC PCRAM), each bank contains 128K rows and the bitmap size is 16KB. In our design, the 55

first four PCRAM blocks are always set as MLC, and they store this 16KB bitmap. The bitmap is loaded into the system main memory during the booting time, and it is written back to the PCRAM device during the unmounting time. Here, we assume the PCRAM device has sufficient on-device capacitor so that the bitmap data can always be written back even upon power emergency events.

6.2.4 Reconfigurable PCRAM-based Solid-State Disk

In the near future, PCRAM is considered as the direct substitution of the NAND flash. Considering most of the NAND flash-based devices used today have lower capacity utiliza- tion (e.g. personal flash device, SD card in digital cameras, etc.), we proposes two ways to partition the MLC and SLC blocks to apply the adaptive MLC/SLC PCRAM device into real practice, To optimize the PCRAM device for performance, it is the rationale to first set all the PCRAM blocks in SLC mode for fast read/write accesses. In this way, the initial device capacity utilization is 50%. When the required device utilization surpasses 50%, the SLC blocks that belong to the least recently modified file are merged as MLC blocks, and therefore the extra device space is left for the new files. The expansion process continues until the device capacity utilization reaches 100%, in which it means all the PCRAM blocks are in the MLC mode.

6.3 Experimental Results

In this section, we evaluate the performance and lifetime improvement after applying the adaptive MLC/SLC technique when the PCRAM device is under-utilized. In order to estimate the efficiency of the proposed technique on a real platform, we collected the actual I/O trace on Linux 2.6.32-23 kernel using a 2GB RAMDISK formatted as Ext2 file system with 4KB block size. We created a synthetic file system trace by first filling up the RAMDISK with randomly generated files whose sizes range from 5KB to 10MB and then randomly accessing them 1,500,000 times. 56

40 30 35 25 30 20 25 20 15 15 10 10 Throughput(MB/s) Throughput(MB/s) 5 5 0 0 0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100% PCM device utilization PCM device utilization (a) Synthetic (b) Financial 1

100 25 90 80 20 70 60 15 50 40 10 30

Throughput(MB/s) 20 Throughput(MB/s) 5 10 0 0 0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100% PCM device utilization PCM device utilization (c) Financial 2 (d) WebSearch

Figure 6.5: The performance of the adaptive MLC/SLC solution under different utilizations.

In addition, we used disk traces from Storage Performance Council [87], which were in- tended to model the disk behavior on enterprise level applications like web servers, database servers, and web search. Later in this section, we call our synthetic trace as Synthetic and the traces from SPC as Financial 1, Financial 2, and WebSearch, respectively.

6.3.1 PCRAM MLC/SLC Timing Model

We use a preliminary version of NVSim to estimate the read and write latencies for SLC and MLC modes. For simplicity, we average the SET latency and the RESET latency in the SLC write latency calculation, and we set the average P&V steps as 4 to form the MLC write latency. The rounded values are tabulated in Table 6.2. The read width is set to 64 cells (64 bits for SLC and 128 bits for MLC) and the write width is limited to 16 cells (16 bits for SLC and 32 bits for MLC) due to the large amount of current that is required for SET and RESET operations. Under these assumptions, the I/O bandwidth of SLC mode is about twice the MLC bandwidth. 57

1.2 2 1.8 1 1.6 1.4 0.8 1.2 0.6 1 0.8 0.4 0.6 0.4 0.2

RelativePerformance/Cost RelativePerformance/Cost 0.2 0 0 0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100% Percentage of MLC blocks Percentage of MLC blocks (a) Synthetic (b) Financial 1

1.4 1.2

1.2 1 1 0.8 0.8 0.6 0.6 0.4 0.4

0.2 0.2 RelativePerformance/Cost RelativePerformance/Cost 0 0 0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100% Percentage of MLC blocks Percentage of MLC blocks (c) Financial 2 (d) WebSearch

Figure 6.6: The performance per cost analysis of the adaptive MLC/SLC solution.

Table 6.2: The timing model of PCRAM cells in SLC and MLC modes. SLC MLC Read latency 10ns 44ns Read width 64bits 128bits Read bandwidth 800MB/s 363.6MB/s Write latency 100ns 395ns Write width 16bits 32bits Write bandwidth 20.0MB/s 10.3MB/s

6.3.2 Performance-Aware Management Result

Firstly, we demonstrate how the performance-aware partitioning strategy can improve the performance. The I/O access distribution has a large impact on the efficiency of the pro- posed adaptive MLC/SLC PCRAM. If the accesses are evenly distributed to every portion of the file system, certain amount of data have to be accessed from the MLC regions. On the other hand, if the access pattern is biased and there is a part of file system is not frequently accessed, then that part of the files can be partitioned to the MLC regions and most of the 58

frequently-accessed data can be accomplished by only accessing the SLC regions. In this analysis, we assume that many PCRAM chips are connected in an array to form a large PCRAM device that has sufficient storage capacity to hold the working set (the number of PCRAM chips is a variable to adjust the device utilization). When the device utilization is lower than 50%, all the PCRAM blocks are in SLC mode, and when the device utilization is 100%, all the PCRAM blocks are in MLC mode. When the utilization is between 50% and 100%, the adaptive MLC/SLC technique is applied to supply the required capacity. Figure 6.5 illustrates the relationship between the average throughput and the PCRAM device utilization under different workloads. It can be observed that Synthetic and Financial 2 has a gradual throughput degradation as the utilization increases. This is due to the high data locality of these two workloads. Furthermore, this phenomenon is exaggerated in Financial 1, which has a large portion of the file system that are only accessed once. However, the throughput of WebSearch drops abruptly at 50% as this workload has a evenly distributed I/O access pattern.

6.3.3 Performance-Cost Analysis

Based on the previous result on the relationship between the performance and the device utilization, we can further derive the relationship between the performance and the cost by assuming each PCRAM chip has a fixed cost. As an example, under the workload access pattern of Financial 1, one extreme configuration is to use SLC-only PCRAM chips that can supply the bandwidth of 94MB/s, while the other extreme configuration is to use MLC-only PCRAM chips that only have the bandwidth of 44MB/s but halve the required PCRAM chip count. To investigate the throughput-per-cost metric, we rephrase the result in Figure 6.5 and Figure 6.6 shows when the throughput-per-cost reaches the peak value. The average improvement on throughput-per-cost is around 28%.

6.3.4 Lifetime Analysis

When a MLC PCRAM cell has experienced too much RESET operations, its RESET/SET resistance margin decreases. Therefore, it has to be configured as a SLC at that time. 59

In the lifetime-aware partitioning strategy, all the PCRAM blocks in MSB bank are first set to MLC mode and the ones in LSB bank are left as empty. When the operating system monitor finds that certain blocks have higher accessing probability, these blocks are switched to SLC mode by enabling their associated blocks in the LSB banks. Using the lifetime-aware partitioning strategy, the maximum lifetime improvement can be as high as 100 according to the lifetime model described in Chapter 6.1.4. This maximum amount of lifetime improvement comes from the halved device capacity.

6.4 Summary

The proposed adaptive MLC/SLC scheme exploits the fast SLC access speed and large MLC capacity with the awareness of workload characteristic and lifetime requirement. In this part of the dissertation, a circuit-level adaptive MLC/SLC eNVM array is first devised, the management policy of MLC/SLC mode is then designed. To evaluate the proposed adaptive MLC/SLC scheme, a case study on PCRAM-based storage device is conducted. The simulation result based on four actual I/O traces shows that the adaptive MLC/SLC technique can improve the throughput-per-cost of the PCRAM device by 28% on average or it can extend the PCRAM device lifetime by 100 if the device utilization is under 50%. Chapter 7

Application-Level: eNVM for Exascale Fault Tolerance

One of the system-level applications that we have considered for eNVMs is to enhance the HDD-based checkpointing/rollback scheme, which is one of the most common approaches to ensure the fault-resilience of a computing system. In the current petascale computing systems, the HDD-based checkpointing already incurs a large performance overhead and is not a scalable solution for the future exascale computing. To solve this problem, we investigate how to use eNVM technologies to provide a low performance penalty fault- resilience scheme.

7.1 Problem

For large-scale applications in massive parallel processing (MPP) systems, coordinated checkpoint-restart is the most widely-used technique to provide fault tolerance [88]. As the scale of future MPP systems keeps increasing and the system MTTF keeps decreasing, it is foreseeable that the checkpoint protection with higher frequency is required. However, the current state-of-the-art approach, which takes a snapshot of the entire memory image and stores it into a globally accessible storage (typically built with disk arrays), as shown in Figure 7.1, is not a scalable approach and not feasible for the exascale system in the future.

60 61

Process Nodes I/O Nodes Storage

Network

Figure 7.1: The typical organization of the contemporary supercomputer. All the permanent storage devices are taken control by I/O nodes. There is no local permanent storage for each node.

Process Nodes (with local storage) I/O Nodes

Storage

Network

Figure 7.2: The proposed new organization that supports hybrid checkpoints. The primary permanent storage devices are still connected through I/O nodes, but each process node also has a permanent storage.

There are two primary obstacles that prevent performance scaling.

Bottleneck 1: HDD Data Transfer Bandwidth

As shown in Figure 7.1, the in-practice checkpoint storage device is HDD, which implies that the most serious bottleneck of in-disk checkpointing is the sustained transfer rate of HDDs (which is usually less than 150MB/s). The significance of this problem is demonstrated by the facts that the I/O generated by HDD-based checkpointing consumes nearly 80% of the total file system usage even on today’s MPP systems [88], and the checkpoint overhead accounts for over 25% of total application execution time in a petaFLOPS system [89]. Although a distributed file system, like Lustre or Spider, can aggregate the file system bandwidth to hundreds of GB/s, in such systems the checkpoint size also gets aggregated 62

110

100 Region 1: Write size fits 90 into the HDD buffer 80 Region 2: Sustained 70 bandwidth for large-size 60 write operations 50 Write Speed (MB/s) Speed Write 40 30 0 500 1000 1500 Write Size (MB)

Figure 7.3: The hard disk drive bandwidth with different write size. by the scale of nodes, nullifying the benefit. Since the HDD data transfer bandwidth is not easily scaled up due to its mechanical nature, it is necessary to change the future checkpoint storage from in-disk to in-memory. In order to quantify speed difference between the in-disk and in-memory checkpointing, we measure their peak sustainable speed using a hardware configuration with 2 Dual-Core AMD Opteron 2220 Processors, 16GB of ECC-protected registered DDR2-667 memory, and Western Digital 740 hard disk drives operating at 10,000 RPM with a peak bandwidth of 150MB/s reported in the datasheet. As a block device, the HDD has a large variation on its effective bandwidth depending upon the access pattern. In our system, although the data sheet reports a peak bandwidth of 150MB/s, the actual working bandwidth is much smaller. We measure the actual HDD bandwidth by randomly copying files with different sizes and use system clock to track the elapsed time. The result is plotted in Figure 7.3, which shows all the points fall into two regions: one is near the y-axis, and the other is at the 50MB/s line. When the write size is relatively small, the effective write bandwidth of the HDD can range from 60MB/s to 100MB/s depending on the status of the HDD internal buffer. However, it can be observed that when the write size is in megabyte scale, the effective write bandwidth of HDD drops dramatically and the actual value is 50MB/s, which is only one third of its peak bandwidth of 150MB/s. 63

6000

5000

4000 Region 2: Sequential big- size write operations 3000 Region 1: Random small- 2000 size write operations Write speed (MB/s) speed Write 1000

0 0 500 1000 1500 Write size (MB)

Figure 7.4: The main memory bandwidth with different write size.

On contrary, the result of in-memory checkpointing speed is shown in Figure 7.4. Sim- ilar to the HDD bandwidth, all the collected data fall into two regions. However, unlike the relationship between the HDD bandwidth and write size, the attainable bandwidth is higher when the write size is large due to the benefit achieved from spatial locality. This performance behavior is desirable for checkpointing since checkpoints are usually large. In addition, the achievable bandwidth is very close to 5333MB/s, which is the theoretical peak bandwidth of the DDR2-667 memory used in this experiment. Therefore, compared to the in-disk checkpointing speed, the attainable in-memory speed can be two orders of magnitude faster. On this hand, eNVM technologies are the ideal candidates to implement in-memory checkpointing.

Bottleneck 2: Centralized Checkpoint Storage

Another bottleneck of the current checkpointing system, as shown in Figure 7.1, comes form the centralized checkpoint storage. Typically, several nodes in system are assigned to be the I/O nodes that are in charge of the HDD accesses. Thus, the checkpoints of each node (including computer nodes and I/O nodes) have to go through the I/O nodes via network connections before reaching their final destinations, which consumes a large part of the system I/O bandwidth and causes burst congestion. Therefore, the network bandwidth 64

and the back-end file system bandwidth both become the performance bottlenecks to the system checkpointing speed. Under today’s technology, the node-to-node interconnect bandwidth is usually still less than 4GB/s (e.g. 4X QDR InfiniBand) and the aggregated bandwidth of the file system (i.e. Spider file system) used by Jaguar in Oak Ridge National Lab is 240GB/s. Considering that the coordinated checkpointing requires all the nodes to dump their checkpoints with the same time stamp, the checkpointing speed is limited to 12MB/s for a 1-petaFLOPS system with 20,000 nodes (in this case the bandwidth of the centralized file system becomes the primary bottleneck). In addition, when the system scale keeps grows, the physical dis- tance between the checkpoint sources and targets is increasing. Thereby, it not only causes unacceptable performance, but also wastes lots of power consumption on data transfers. To solve this bottleneck, we propose a hybrid checkpointing mechanism that uses both local and global checkpoints, in which the local checkpoint is faster and does not need any network connection while the slower global checkpoint is still preserved to provide the full fault coverage.

7.2 Integrating eNVM Modules into MPP Systems

Without loss of generality, we use PCRAM as the case study of deploying eNVM resources. The integration method is similar to the traditional DRAM Dual-Inline Memory Mod- ules (DIMMs). In this dissertation, we call this integration method as PCRAM-DIMM. Such re-engineered PCRAM-DIMM is designed to be compatible with DDR3 interface and can be directly plugged into the DIMM socket. Therefore, only a change in the memory controller firmware is needed to support the new timing parameters for PCRAM read and write operations. Although re-engineering such a PCRAM-DIMM module incurs much cost, we envision this can be a standard module for the future fault-tolerate MPP systems and thus minimize the negative impact on the system cost. Besides the cost concern, the biggest design issue of PCRAM-DIMM is the long write latency of PCRAM. While some of the PCRAM prototypes show the PCRAM read latency is longer than 50ns [14,19,24,75], the read latency (from address decoding to data sensing) 65

DRAM DRAM DRAM DRAM Chip 0 Chip 1 …… Chip 7 Chip 8

x64 x64 x64 x64 8x prefetch 8x prefetch 8x prefetch 8x prefetch x8 x8…… x8 x8

Rank 0 DDR3-1333 bus (64-bit data w/ 8-bit ECC)

Rank 1

x8 x8…… x8 x8 8x prefetch 8x prefetch 8x prefetch 8x prefetch x64 x64 x64 x64

DRAM DRAM DRAM DRAM Chip 9 Chip 10 …… Chip 16 Chip 17

Figure 7.5: The organization of a conventional 18-chip DRAM DIMM with ECC support.

PCRAM PCRAM PCRAM PCRAM Chip 0 Chip 1 …… Chip 6 Chip 8

x576 x576 x576 x576 8x prefetch 8x prefetch 8x prefetch 8x prefetch x72 x72…… x72 x72

18 -to -1 Mux /Demux DDR3-1333 bus (64-bit data w/ 8-bit ECC)

x72 x72…… x72 x72 8x prefetch 8x prefetch 8x prefetch 8x prefetch x576 x576 x576 x576

PCRAM PCRAM PCRAM PCRAM Chip 9 Chip 10 …… Chip 16 Chip 17

Figure 7.6: The organization of the proposed 18-chip PCRAM DIMM with ECC support. can be reduced to around 10ns by cutting the bitlines and wordlines inside a PCRAM array into small segments [56]. However, the PCRAM write latency reduction is limited by the long SET pulse (∼ 100ns). If the conventional DRAM-DIMM organization is directly adopted as shown in Figure 7.5, the resulted PCRAM write bandwidth is only 0.08GB/s, which is far below the DDR3-1333 bandwidth of 10.67GB/s. To solve the bandwidth mismatch between the DDRx bus and the PCRAM chip, two modifications are made to organize a new PCRAM-DIMM, 66

Table 7.1: Different Configurations of the PCRAM Chips Process Capacity # of Banks Read/RESET/SET Leakage Die Area 65nm 512Mb 4 27ns/55ns/115ns 64.8mW 109mm2 65nm 512Mb 8 19ns/48ns/108ns 75.5mW 126mm2 45nm 1024Mb 4 18ns/46ns/106ns 60.8mW 95mm2 45nm 1024Mb 8 16ns/46ns/106ns 62.8mW 105mm2

1. As shown in Figure 7.6, the configuration of each PCRAM chip is changed to x72 (64 bits of data and 8 bits of ECC protection), while the 8x prefetching scheme is retained for compatibility with the DDR3 protocol. As a result, there are 72 × 8 data latches in each PCRAM chip, and during each PCRAM write operation, 576 bits are written into the PCRAM cell array in parallel;

2. The 18 chips on DIMMs are re-organized in an interleaved way. For each data transi- tion, only one PCRAM chip is selected. A 18-to-1 data mux/demux is added on the DIMM to select the proper PCRAM chip for every DDR3 data transfer.

Consequently, the PCRAM write latency of each PCRAM chip can be overlapped. The overhead of this new DIMM organization includes: (1) one 1-to-18 data mux/demux; (2) 576 sets of data latches, sense amplifiers, and write drivers on each PCRAM chip. The mux/demux can be implemented by a circuit that decodes the DDR3 address to 18 chip select signals (CS#). The overhead of data latches, sense amplifiers, and write drivers are evaluated using NVSim. Various configurations are evaluated by NVSim and the results are listed in Table 7.1. Based primarily on SET latency and area efficiency, we use the 45nm 1024Mb 4- bank PCRAM chip design. The PCRAM-DIMM write bandwidth of this configuration is 64bit × 8(prefetch) × 18(chips)/106ns = 10.8GB/s, which can saturate the DDR3-1333 peak bandwidth of 10.66GB/s. In addition, according to our NVSim power model, for each 576-bit RESET and SET operation, the total dynamic energy consumption is 31.5nJ and 19.6nJ, respectively. Therefore, assuming that “0” and “1” are written uniformly, the aver- age dynamic energy is 25.6nJ per 512 bits (without 64 bits of ECC), and the 1024Mb PCRAM-DIMM dynamic power under write operations is 25.6nJ/512b × 10.8GB/s ≈ 67

4.34W . The leakage power of the 18-chip PCRAM-DIMM (consumed by the peripheral circuitry of PCRAM chips) is estimated to be 60.8mW × 18 = 1.1W .

7.3 Local/Global Hybrid Checkpoint

Integrating PCRAM into future MPP systems and using PCRAM as the fast in-memory checkpoint storage remove the first performance bottleneck, the slow HDD speed. However, the second bottleneck, the centralized I/O storage, still exists. To further remove this bot- tleneck, a hybrid checkpointing scheme with both local and global checkpoints is proposed. This scheme works efficiently as it is found that most of the system failures can be locally recovered without the involvement of other nodes.

7.3.1 Hybrid Checkpoint Scheme

We propose to add local checkpoints that periodically backup the state of each node in their own private storage. Every node has a dedicated local storage for storing its system state. Similar to its global counterpart, the checkpointing is done in a coordinated fashion. We assume that a global checkpoint is made from an existing local checkpoint. Figure 7.7 shows the conceptual view of the hybrid checkpoint scheme,

• Step 1: Each node dumps the memory image to their own local checkpoints;

• Step 2: After several local checkpoint intervals, a global checkpoint is initiated, and the new global checkpoints are made from the latest local checkpoints;

• Step 3: When there is a failure but all the local checkpoints are accessible, the local checkpoints are loaded to restore the computation;

• Step 4: When there is a failure and parts of the local checkpoints are lost (in this case, Node 3 is lost), the global checkpoints (which might be obsolete compared to the latest local checkpoints) are loaded, and the failure node is substituted by a backup node. 68

Ckpt ’ 2 Ckpt ’ 4 Ckpt ’ 1 Ckpt ’ 3 Global

2 4 2 4 2 2 4

Ckpt 1 Ckpt 2 Ckpt 3 Ckpt 4 4

1 3 1 3 1 3 1 3 Node 1 Node 2 Node 3 Node 4 Backup

Local Local Local Local Local

Figure 7.7: The local/global hybrid checkpoint model.

This two-level hybrid checkpointing gives us the opportunity to tune the local to global checkpoint ratio based on the failure types. For example, a system with high transient failures can be protected by frequent local checkpoints and a limited number of expensive global checkpoints without losing performance. The proposed local/global checkpointing is also effective in handling failures during the checkpoint operation. Since the scheme does not allow concurrent local and global checkpointing, there will always be a stable state for the system to rollback even when a failure occurs during the checkpointing process. The only time the rollback operation is not possible is when a node fails completely in the middle of making a global checkpoint. While such failure events can be handled by maintaining multiple global copies, the probability of a global failure in the middle of a global checkpoint is relatively negligible. Hence, we limit our proposal to a single copy of local and global checkpoints. Whether the MPP system can be recovered using a local checkpoint after a failure depends on the failure type. In this dissertation, all the system failures are divided into two categories:

• Failures that can be recovered by local checkpoints: In this case, the local checkpoint in the failure node is still accessible. If the system error is a transient one, (i.e., soft error, accidental human operation, or software bug), the MPP system can be simply recovered by rebooting the failure node using its local checkpoint. If the system error is due to system maintanance or hot plug/unplug, the MPP system can also be recovered by simply rebooting or migrating the computation task from one node to another node using local checkpoints. 69

Table 7.2: The Statistics of the Failure Root Cause Collected by LANL during 1996-2005 Cause Occurrence Percentage Hardware 14341 60.4% Software 5361 22.6% Network 421 1.8% Human 149 0.6% Facilities 362 1.5% Undetermined 3105 13.1% Total 23739 100%

• Failures that have to be recovered by global checkpoints: In the event of some permanent failures, the local checkpoint in the failure node is not accessible any more. For example, if the CPU, the I/O controller, or the local storage itself fails to work, the local checkpoint information will be lost. This sort of failure has to be protected by the global checkpoint, which requires storing system state in either neighboring nodes or a global storage device.

As a hierarchical approach, whenever the system fails, the system will first try to recover from the local checkpoints. If one or more than one of the local checkpoints is not accessible, the system recovery mechanism will restart from the global checkpoint.

7.3.2 System Failure Category Analysis

The effectiveness of the proposed local/global hybrid checkpointing depends on how many failures can be recovered locally. A thorough analysis of failure rates of MPP systems shows that a majority of failures are transient in nature [90] and can be recovered by using local checkpoints only. In order to quantitatively learn the failure distribution, we studied the failure events collected by the Los Alamos National Laboratory (LANL) during 1996- 2005 [91]. The data covers 22 high-performance computing systems, including a total of 4,750 machines and 24,101 processors. The statistics of the failure root cause are shown in Table 7.2. We conservatively assume that undetermined failures have to rely on global checkpoints for recovery, and assume that the failures caused by software, network, human, and facilities 70

can be protected by local checkpoints:

• If nodes halt due to software failures or human mal-manipulation, we assume some mechanisms (i.e., timeout) can detect these failures and the failure node will be re- booted automatically.

• If nodes halt due to network failures (i.e., widely-spread network congestion) or fa- cilities downtime (i.e. global power outage), automatic recovery is impossible and manual diagnose/repair time is inevitable. However, after resolving the problem, the system can simply restart by using local checkpoints.

The remaining hardware failure accounts to more than 60% of total failures. However, according to research on the fatal soft error rate of the “ASCI Q” system at LANL in 2004 [90], it is estimated that about 64% of the hardware failures are attributed to soft errors. Hence, observing the failure trace, we have the following statistics: 60.4% × 64% = 38.7% soft errors, and 60.4% × (1 − 64%) = 21.7% hard errors. As soft errors are transient and it is highly possible that the same error would not happen again after the system is restored from the latest checkpoint, local checkpoints are capable of covering all the soft errors. However, hard errors usually mean there is permanent damage to the failure node and the node should be replaced. In this case, the local checkpoint stored on the failure node is lost as well, hence only the global checkpoint can protect the system from hard errors. As a result, in total, we estimate that 65.2% of failure can be corrected by local check- points and only 34.8% of failure needs global checkpoints. Further considering the soft error rate (SER) will greatly increase as the device size shrinks, we project that SER increased 4 times from 2004 to 2008. Therefore, we make a further estimation for the 1-petaFLOPS system in 2008 that 83.9% of failures need local checkpoints and only 16.1% failures need global ones. This failure distribution biased to local errors provides a significant opportunity for the local/global hybrid checkpointing scheme to reduce the overhead. 71

Useful computation Global checkpoint …… ……  G  L  L  L  G  L (a) Running without failure Failure Local checkpoint  G  L …… …… RL  L  L  G

Local recovery (b) Running with failure, recovered by local checkpointing Wasted computation Failure

 G  L …… …… RG  L  L  G

Global recovery (c) Running with failure, recovered by global checkpointing

Figure 7.8: A conceptual view of execution time broken by the checkpoint interval: (a) an application running without failure; (b) an application running with a failure, where the system rewinds back to the most recent checkpoint, and it is recovered by the local checkpoint; (c) an application running with a failure that cannot be protected by the local checkpoint. Hence, the system rewinds back to the most recent global checkpoint.

7.3.3 Theoretical Performance Model

In an MPP system with checkpointing, the optimal checkpoint frequency is a function of both failure rate and checkpoint overhead. A low checkpoint frequency reduces the impact of checkpoint overhead on performance but loses more useful work when failures take place, and vice versa. Young [92] and Daly [93] derived expressions to determine the optimal checkpoint frequency that strikes the right balance between the checkpoint overhead and the amount of useful work lost during failures. However, their models do not support local/global hybrid checkpointing. In this dissertation, we extend Daly’s work [93] and derive a new model to calculate the optimal checkpoint frequencies for both local and global checkpoints. Let us consider a scenario with the following parameters as listed in Table 7.3 and divide the total execution time of a checkpointed workload, Ttotal, into four parts:

Ttotal = TS + Tdump + Trollback,recovery + Textra−rollback (7.1)

where TS is the original computation time of a workload, Tdump is the time spent on check- pointing, Trollback,recovery is the recovery cost when a failure occurs (no matter it is local 72

Table 7.3: Local/Global Hybrid Checkpointing Parameters Symbols Descriptions

TS The original computation time of a workload pL The percentage of local checkpoints pG 1 − pL, the percentage of global checkpoints τ The local checkpoint interval δL The local checkpoint overhead (dumping time) δG The global checkpoint overhead (dumping time) δeq the equivalent checkpoint overhead in general RL The local checkpoint recovery time RG The global checkpoint recovery time Req The equivalent checkpoint time in general qL The percentage of failure covered by local checkpoints qG 1 − qL, the percentage of failure that have to be covered by global checkpoints MTTF The system mean time to failure, modeled as 5 year/number of nodes Ttotal The total execution time including all the overhead

or global), and Textra−rollback is the extra cost to discard more useful work when a global failure occurs.

The checkpoint dumping time is simply the product of the number of checkpoints, TS/τ,

and the equivalent dumping time per checkpoint, δeq, thus

TS T = (δeq) (7.2) dump τ

where

δeq = δL · pL + δG · pG (7.3) and the parameters δL and δG are determined by the checkpoint size, local checkpoint bandwidth, and global checkpoint bandwidth. When failure occurs, at least one useful work slot has to be discarded as the wasted computation slots shown in Figure 7.8(b) and Figure 7.8(c). Together with the recovery time, this part of overhead can be modeled as follows with the approximation that the failure occurs half way through the compute interval on average,

1 T T = (τ + δ )+ R total (7.4) rollback,recovery 2 eq eq MTTF   73

where Ttotal/MT T F is the expected number of failure and the average recovery time Req is expressed as

Req = RL · qL + RG · qG (7.5)

and the recovery time RL and RG are equal to the checkpoint dumping time (in a reversed

direction) δL and δG plus the system rebooting time. Here, qL and qG are the percentage of the failure recovered by local and global checkpoints, respectively, and their values are determined in the same way as described in Chapter 7.3.2 at different system scales. Additionally, if a failure has to rely on global checkpoints, more useful computation slots will be discarded as the 2nd dark slot shown in Figure 7.8(c). In this case, as the average number of local checkpoints between two global checkpoints is pL/pG, the number of wasted computation slots, on average, is approximated to pL/2pG. For example, if pL = 80% and pG = 20%, there are 80%/20% = 4 local checkpoints between two global checkpoints, and the expected number of wasted computation slots is pL/pG/2 = 2. Hence, this extra rollback cost can be modeled as follows,

pLqG Ttotal Textra−rollback = (τ + δL) (7.6) 2pG MTTF

Eventually, after including all the overhead mentioned above, the total execution time of a checkpointed workload is,

TS 1 Ttotal Ttotal = TS + (δeq)+ (τ + δeq)+ Req τ  2  MTTF pLqG Ttoal + (τ + δL) (7.7) 2pG MTTF

It can be observed from the equation that a trade-off exists between the checkpoint frequency and the rollback time. Since many variables in the equation have strict lower bounds and can take only discrete values, we use MATLAB to optimize the two critical parameters, the checkpoint interval τ and the local checkpoint ratio pL, using a numerical method. It is

also feasible to derive closed-form expressions for τ and pL to enable run-time adjustment for any changes of workload size and failure distribution, but they are out of the scope of 74 this dissertation.

7.4 Experimental Results

The primary goal of this work is to improve the checkpoint efficiency and prevent check- pointing from becoming the bottleneck to MPP scalability. In this section, the analytical equations derived in Chapter 7.3.3 is mainly used to estimate the checkpoint overhead. In addition, simulations are also conducted to get the quantitative parameters such as the checkpoint size.

7.4.1 Checkpointing Scenarios

In order to show how the proposed local/global hybrid checkpoint using PCRAM can reduce the performance and power consumption overhead caused by the checkpointing operations, we study the following 4 scenarios:

1. Pure-HDD: The conventional checkpoint approach that only stores checkpoints in HDD globally.

2. DIMM+HDD: Store checkpoints in PCRAM DIMM locally and in HDD globally. In each node, the PCRAM DIMM capacity is equal to the DRAM DIMMcapacity.

3. DIMM+DIMM : Store both local and global checkpoints in the in-node PCRAM DIMM. In each node, the PCRAM DIMM capacity is thrice as the DRAM DIMM ca- pacity: one copy for the latest local checkpoint; two copies for the global checkpoints (one for the node itself and one for the neighboring node).

4. 3D+3D: Same as DIMM+DIMM, but deploy the PCRAM resource using 3D-PCRAM rather than PCRAM-DIMM.

The bottleneck of each scenario is listed in Table 7.4. 75

Table 7.4: Bottleneck Factor of Different Checkpoint Schemes Local medium Local bottleneck Pure-HDD - - DIMM+HDD Self’s PCRAM DIMM Memory bandwidth DIMM+DIMM Self’s PCRAM DIMM Memory bandwidth 3D+3D Self’s 3D DIMM 3D bandwidth Global medium Global bottleneck Pure-HDD HDD on I/O nodes HDD, Network bandwidth DIMM+HDD HDD on I/O nodes HDD, Network bandwidth DIMM+DIMM Neighbor’s PCRAM DIMM Network bandwidth 3D+3D Neighbor’s 3D DIMM Network bandwidth

Table 7.5: Specifications of the Baseline Petascale System and the Projected Exascale System 1 petaFLOPS 1 exaFLOPS FLOPS 1015 1018 Year 2008 2017 Number of sockets 20,000 100,000 Compute/IO node ratio 15:1 15:1 Memory per socket 4GB 210GB Phase-change memory bandwidth 10GB/s 32GB/s Network bandwidth 3.5GB/s 400GB/s Aggregate file system bandwidth 220GB/s 1600GB/s Normalized soft error rate (SER) 1 32 Transient error percentage 91.5% 99.7%

7.4.2 Scaling Methodology

We use the specification of the IBM Roadrunner Supercomputer [89], achieving a sustained performance of 1.026 petaFLOPS on LINPACK, to model the 1-petaFLOPS baseline MPP system. Table 7.5 shows the MPP system configurations for a petaFLOPS and a projected exaFLOPS system. For the configurations between these two ends, we scale the specification values according to the time frame. For all our evaluations we assume the timing overhead of initiating a coordinated checkpoint is 1ms, which is reported as the latency of data broadcasting for hardware broadcast trees in BlueGene/L [94]. 76

20%

15% HDD 10% DIMM+HDD DIMM+DIMM 5%

(normalizedto 3D+3D

computationtime) 0% Checkpointoverhead BT.C CG.C DC.B EP.C FT.B IS.C LC.U MG.C SP.C UA.C Average

Figure 7.9: The checkpoint overhead comparison in a 1-petaFLOPS system (normalized to the computation time).

20%

15%

10% DIMM+DIMM 3D+3D 5% (normalized to

computationtime) 0% Checkpoint overhead BT.C CG.C DC.B EP.C FT.B IS.C LC.U MG.C SP.C UA.C Average

Figure 7.10: The checkpoint overhead comparison in a 1-exaFLOPS system (normalized to the computation time).

7.4.3 Performance Analysis

For all our evaluations, we employ the equations derived in Chapter 7.3.3 to determine the execution time of workloads in various systems and scenarios. For a given system, based on the system scale and the checkpoint size, the optimal checkpoint frequency can be decided. For this checkpoint frequency, an inherent trade-off exists between the proportion of local and global checkpoints. For example, as the fraction of local checkpoints increases, the overall checkpoint overhead drops, but the recovery time from global checkpoints rises; on the other hand, as the fraction of global checkpoints increases, the recovery time decreases, but the total execution time can take a hit because of the high checkpoint overhead. This trade-off is actually modeled by Equation 7.7 in Chapter 7.3.3, and the optimal values of the checkpoint interval (τ) and the percentage of

local checkpointing (pL) can be found. Figure 7.9 shows the checkpoint overhead in a petascale system by using pure-HDD, DIMM+HDD, DIMM+DIMM, and 3D+3D, respectively. DIMM+HDD reduces the check- 77 point overhead by 60% compared to pure-HDD on average. Moreover, the ideal “instant checkpoint” is almost achieved by implementing DIMM+DIMM and 3D+3D. The greatly reduced checkpoint overhead directly translates to the growth of effective computation time, or equivalent system availability. The advantages of DIMM+DIMM and 3D+3D are clear as the MPP system is scaled towards the exascale level where pure-HDD and DIMM+DIMM are not feasible any more; Figure 7.10 demonstrates the results. It can be found that only DIMM+DIMM and 3D+3D are still workable on the exascale level. More importantly, the average overhead of 3D+3D is still less than 5% even in the exascale system. It shows that our intermediate PCRAM- DIMM and ultimate 3D-PCRAM checkpointing solutions can provide the failure resiliency required by future exascale systems with affordable overhead.

7.4.4 Power Analysis

Although the proposed techniques are targeted primarily to reduce the checkpoint overhead, they are useful for power reduction as well:

• Since PCRAM is a non-volatile memory technology, the PCRAM memory cells do not consume any power when the system is not taking checkpoints. Only a small amount of power is consumed by the peripheral circuits in the PCRAM chips, and this amount of power can be further saved by totally powering-off under the sleeping mode. Using 3D+3D PCRAM checkpoints, during more than 95% of system running time the PCRAM modules do not consume power. Other approaches, i.e. battery-backed DRAM checkpointing, will inevitably leak power even when no checkpoints are being taken. Note that the nap power of a 2GB DRAM-DIMM is about 200mW [95], using battery-backed DRAM checkpointing in 1-petaFLOPS systems will inevitably waste about 20kW power. In contrast, our PCRAM checkpointing module only consume any power during the computation time.

• With future supercomputers dissipating many mega watts, it is important to keep high system availability to ensure that the huge power budget is effectively spent on 78

useful computation tasks. DIMM+DIMM can maintain the system availability above 91% and 3D+3D can achieve near 97% system availability even on the exascale level.

7.5 Summary

Checkpoint-restart has been an effective tool for providing reliable and available MPP sys- tems. However, current in-disk checkpointing mechanisms incur high performance penalties and are woefully inadequate in meeting future system demands. To improve the scala- bility of checkpointing, we introduce the eNVM technology into the MPP system as a fast checkpoint device. Combined with the hybrid local/global checkpointing mechanism, PCRAM-DIMM checkpointing enables MPP systems to scale up to 500 petaFlops with tolerable checkpoint overhead. To provide reliable systems beyond this scale, we leverage emerging 3D die stacking and propose 3D PCRAM/DRAM memory for checkpointing. Af- ter combining all the effects, eNVM-based checkpointing only incurs 3% overhead in an exascale system by making near instantaneous checkpoints. Chapter 8

Application-Level: eNVM as On-Chip Cache

Given the attractive properties of high density, fast access, good scalability, and non- volatility, eNVM technologies can challenge the role of SRAM and DRAM in mainstream memory hierarchy for the first time in more than 30 years. Much innovative research [43,54, 96–98] is focusing on designing the next-generation of memory systems using these memory technologies. Unlike the previous work, we discuss the feasibility of using ReRAM together with the conventional CMOS-compatible SRAM to build a memory hierarchy from the very first level cache to the main memory.

8.1 Overview

Like other eNVM technologies, ReRAM often has high write energy, long write latency, and limited write endurance, which may offset its power benefits, result in performance degradation, or even disallow the ReRAM cache deployment. Considering that there is a trade-off among performance, energy, and silicon area (silicon cost) when designing an ReRAM-based memory hierarchy, a thorough architectural level design space exploration is therefore needed to address three key questions: (1) Will the write endurance limitations of ReRAM hold back ReRAM caches from deployment; (2) How to choose the proper amount

79 80

of on-chip ReRAM caches; (3) How to deploy different types of ReRAM and partition them into multiple levels of memory hierarchy. To answer these questions, in this part of the dissertation we developed a joint circuit-architecture optimization framework to design such a memory hierarchy under different optimization goals with the following four steps:

• We first analyzed the write endurance impact on ReRAM cache deployment and proposed wear-leveling techniques for ReRAM caches;

• We then built a circuit-level model to estimate the performance, energy, and area of ReRAM designs and used it to build a ReRAM circuit library;

• After that, we created a statistical architecture-level model that estimates the system performance and energy consumption under different memory hierarchy designs;

• Finally, we used a simplified simulated annealing algorithm1 to quickly find the near- optimal solution from this design space.

The challenges of this work are threefold. First, considering the ReRAM technology has limited write endurance, we need to first identify whether it is feasible to deploy ReRAM in the different levels of the memory hierarchy and design effective wear-leveling techniques for ReRAM caches. Although inter-set cache write wear-leveling are similar to the one for nonvolatile main memory [43,97,98], intra-set wear-leveling is a new problem to solve before using ReRAM as caches. Second, since ReRAM technology is still in an early stage, there are only a limited number of ReRAM prototypes available for calibrating the ReRAM module design parameters [6, 25–27, 99, 100]. Moreover, the current ReRAM prototype designs often do not push technology speed or density limits, whereas ultra-fast or ultra-dense ReRAM might play an important role in the memory hierarchy. Therefore, the circuit- level ReRAM model has to be built from theoretical circuit analysis starting from basic device models. Third, we require models that reflect how the architectural performance, e.g. CPI or memory system energy consumption, changes as we tune the underlying memory hierarchy design knobs such as the cache capacity and the cache latency. Conventionally,

1Simulated annealing is a generic algorithm for global optimization problems. 81

such model is built through simulations, however it is impractical to run time-consuming simulations for each possible design input. To surmount this difficulty, we apply statistical analysis and effectively use limited simulation runs to approximate the entire architectural design space. We use our framework to explore a broad space of ReRAM designs from aggressively latency-optimized to highly area-optimized, and fit them into multiple levels of the memory hierarchy. Our work shows that combined with SRAM L1 and possibly L2 caches, the versatility of ReRAM plus its current write endurance limit allow ReRAM to excel in the remaining memory hierarchy levels from L3 caches to main memories. If the ReRAM write endurance is further improved by 10X, using L2 ReRAM cache also becomes feasible and brings extra energy savings. In general, our analysis shows such an ReRAM-based memory hierarchy has significant benefits in energy reduction with insignificant performance degradation overhead, and achieves overall improvements on EDP and EDAP2.

8.2 ReRAM-Based Cache Wear-Leveling

Like other non-volatile memory technologies such as NAND flash and PCRAM, ReRAM has limited write endurance (i.e. the number of times that an ReRAM cell can be overwritten). ReRAM researchers can currently achieve write endurance up to 1010 [6] or 1011 [35], which we believe can be improved by modifications in device geometry, materials, and processing. A projected plan for future ReRAM highlights endurance in the order of 1012 or more write cycles [101]. Although this is considerably larger than NAND flash (105) or recent PCRAM chips (108), the limited write endurance can be an issue for ReRAM caches without wear- leveling.

8.2.1 Inter-Set Cache Line Wear-Leveling

We define the inter-set write variation as the variation of the average write count in a cache set. Due to the fact that the applications have biased address residency, the cache lines in different sets can experience totally different write access patterns and loads. As one

2EDP is energy-delay-product; EDAP is energy-delay-area-product. 82

1E+4

1E+3

1E+2

L3 write count 1E+1

1E+0

Figure 8.1: Inter-set L3 cache line write count variation in a simulated 8-core system with 32KB I-L1, 32KB D-L1, 1MB L2, and 8MB L3 caches.

1E+4

1E+3

1E+2

L3 write count 1E+1

1E+0

Figure 8.2: Intra-set L3 cache line write count variation in a simulated 8-core system with 32KB I-L1, 32KB D-L1, 1MB L2, and 8MB L3 caches. of the extreme examples, the DC.B application repeatedly stresses only a certain portion of cache sets and results in an average write count of cache sets that ranges from 1 to 10,000 as shown in Figure 8.1. However, many recent techniques [43, 97, 98] developed for extending endurance of PCRAM-based memories can be used to wear-level the inter- set write imbalance of ReRAM caches. In this work, we assume cache set numbers are periodically shifted. 83

LRU LRU with proactive invalidation LRU LRU with proactive invalidation 1E+6 1.2E+6 1E+5 1.0E+6 1E+4 1E+3 8.0E+5 1E+2 6.0E+5 1E+1 4.0E+5 1E+0 2.0E+5 0.0E+0 BT.C EP.C IS.C MG.C dedup (a) (b)

Figure 8.3: The comparison of the D-L1 intra-set cache write variations using plain LRU policy and the proposed endurance-aware LRU policy with proactive invalidation: (a) The log-scale write counts of different applications; (b) the linear-scale write counts of selected applications that originally have large intra-set cache write variations.

8.2.2 Intra-Set Cache Line Wear-Leveling

We define the intra-set write variation as the average variation of the write count cross a cache set. Ideally, a conventional LRU cache replacement policy should assign the access loads to each cache line in a set evenly. However, a balanced access load does not necessarily lead to a balanced write load. For example, if just one cache line in a set is frequently visited by cache write hits, it will absorb a large number of cache writes, and thus the write accesses may be unevenly distributed to the remaining N-1 lines in the set (for a N-way associative cache). Figure 8.2 shows the intra-set write variation. While the intra-set variation is much smaller compared to the inter-set variation, it still greatly shortens the ReRAM cache lifetime. In order to alleviate the intra-set variation, it is necessary to enhance the LRU replacement policy. In this work, we augment the LRU policy by two modifications:

• We first add a write hit counter (one counter for the entire cache). The counter is 9 bits wide and only incremented at each write hit event.

• If the counter wraps up (from 511 to 0), then the cache will invalidate the correspond- ing line with the write hit and reload the data into another cache line in the set. We call this feature proactive invalidation.

The motivation of this augmentation is that: (1) the intra-set write count variation is mainly caused by consecutive write hits. For example, if hot dirty data occupies a cache 84 line, the cache line will be written over and over upon every write hit of this hot data; (2) since the data is hot, it is highly possible that the hot data saturates the write hit counter and become the proactive invalidation feature invalids the hot data from one cache line and reloads it into a relatively fresh cache line in the set. In order to observe the effectiveness of the LRU + proactive invalidation policy, we examine the average write counts of the core-level cache (i.e. D-L1 cache) and the associated variations. Figure 8.3 is comparison of the plain LRU policy and our augmented LRU policy with proactive invalidation. The results show that proactive invalidation greatly reduces the intra-set write variation. For some applications such as EP.C, the relative intra-set write variation is reduced from 171% to 7%. Although the average write count is increased by 5% on average, the worst-case write count is alleivated, and thus the ReRAM cache lifetime is improved.

8.2.3 Endurance Requirements for ReRAM Caches

We estimate the endurance requirement for ReRAM caches at different levels by simulating and collecting the cache write access traces over 10 billion clock cycles (i.e. 3.125s of wall clock time on a 3.2GHz CPU). After applying state-of-the-art inter-set wear-leveling [97] and the proactive invalidation-based intra-set wear-leveling, our estimation is that, in order to have a 5-year lifetime guarantee of using ReRAM caches in L3, L2, and L1 caches, the required ReRAM cell write endurance should be 1010, 1011, and 1013 writes, respectively.

8.3 Circuit-Level ReRAM Model

In order to establish an ReRAM component library spanning from ultra-fast to ultra-dense ReRAM designs, a circuit-level ReRAM model is necessary.

8.3.1 ReRAM Modeling

We use NVSim as the circuit-level model. To properly model the new features of ReRAM, NVSim incorporates ReRAM features such as cross-point access, non-H-tree organization, external sensing, and minimum-sized row decoders. In our circuit-level ReRAM model, both MOS-accessed and cross-point ReRAM structures have been explored. 85

2 10 MOS %accessed ReRAM Cross %point ReRAM SRAM Off %chip DRAM On %chip eDRAM 1 10

0 10 Memory read latency (Unit: ns) (Unit: latency read Memory

2 3 4 10 10 10 Memory density (Unit: KB/mm 2)

2 10

MOS %accessed ReRAM Cross %point ReRAM SRAM Off %chip DRAM

1 On %chip eDRAM 10 Memory write latency (Unit: ns) (Unit: latency write Memory

0 10 2 3 4 10 10 10 Memory density (Unit: KB/mm 2)

Figure 8.4: The design spectrum of 32nm ReRAM: (upper) Read latency vs. density; (bottom) Write latency vs. density.

The new cross-point cell array structure that exploits the non-linearity of ReRAM cells makes it possible to build ultra-high density ReRAM modules and enlarges the scope of possible ReRAM module configurations. However, the area-efficiency benefit of the cross- point structure comes with design overhead. Several design issues such as half-select write, two-step sequential write, and external sensing [57] are included in our circuit-level ReRAM model. 86

8.3.2 ReRAM Array Design Spectrum

In general, MOS-accessed ReRAM is faster and cross-point ReRAM is denser. Therefore, ReRAM array designs can fit into a wide spectrum that ranges from fast-accessed designs to high area-efficiency designs. This design flexibility makes the ReRAM a promising universal memory technology that can be used for the entire memory hierarchy from the first level cache to the main memory subsystem or even a secondary storage device. Figure 8.4 demonstrates the design spectrum that ReRAM covers and the corresponding regions of SRAM and DRAM. MOS-accessed ReRAM and cross-point ReRAM are more than 10X denser than SRAM, and cross-point ReRAM can be as dense as DRAM. In terms of speed, ReRAM has comparable read speed to that of SRAM, but significantly slower write speed. The write latency of MOS-accessed ReRAM is dominated by the switching pulse duration, which is configured to be 50ns in our experiments, and the latency of cross-point ReRAM is twice this due to two-step writes.

8.4 Architecture-Level Model of Memory Hierarchy Design

At the architectural level, we need performance models that predict architectural perfor- mance of the overall system such as CPI and access counts at all cache levels as we change the underlying memory hierarchy. As we focus on designing a universal memory hierarchy in this work, the input parameters at this level are the tuning knobs of the memory subsys- tem such as cache capacity, cache associativity, cache read latency, and cache write latency, which define a huge multi-dimensional design space. In a simulation-based approach, long run times are necessary to simulate each possible input setting, making it intractable to explore a large design space. However, simulation accuracy is not the first priority in such a large scale design space exploration. Instead, a speedy but less accurate architecture-level model is the preferred choice. Previous research addresses this challenge by first randomly sampling a small portion of the entire design space and then building a statistical model using the simulation results to infer the impact of other input configurations on the overall performance metrics [102–107]. Although it is 87

sigmoid linear L1 capacity L2 capacity L1 read count L3 read count L1 associativity L2 associativity L1 write count L3 write count L1 read latency L2 read latency W W’ L1 read miss L3 read miss L1 write latency L2 write latency L1 write miss L3 write miss + + For dynamic power estimations

L3 capacity Main memory b b' L2 read count L3 associativity read latency L2 write count L3 read latency Main memory L2 read miss Instruction-per- L3 write latency write latency Hidden layer Output layer L2 write miss cycle (IPC)

Feed-forward artificial neural network Memory hierarchy design inputs Architecture-level predictions

Figure 8.5: The basic organization of a two-layer feed-forward artificial neural network. time-consuming to collect sufficient sample data from conventional simulation, this is a one- time effort, and all the later outputs can be generated with the statistical model. Different fitting models have been used in the inference process that fits a predictive model through regression. Joseph et al. [102] used linear regression, Lee and Brooks [103] used cubic splines, whereas Azizi et al. [104] applied posynomial functions to create architecture-level models.

8.4.1 Feed-Forward Network

In this work, we use an artificial neural network (ANN) [105–107] to fit the sampled simula- tion results into a predictive performance model. Figure 8.5 shows the simplified diagrams of a two-layer feed-forward network with one sigmoid hidden layer (which uses sigmoid functions as the calculation kernel) and one linear output layer (which uses linear functions as the calculation kernel). The input and output design parameters are also shown in Fig- ure 8.5. The essential architectural outputs for energy-performance-area evaluation are the read/write access counts and the read/write miss counts of every level of caches, read/write access counts of the main memory, and the number of instructions that each microprocessor core has processed. To feed the architectural model, the inputs of the architectural design space are the capacity, associativity, read/write latency of all the cache modules and the main memory. The statistical architectural model makes an output estimate from given in- put sets, and it can be treated as a black box that generates predicted outputs as a function of the inputs,

L1readCount = f1(L1capacity, L1assoc, L1readLatency, ..., L3capacity , ..., MemorywriteLatency ) 88

... = ...

L3writeMiss = fn−1(L1capacity, L1assoc, L1readLatency, ..., L3capacity , ..., MemorywriteLatency )

IPC = fn(L1capacity, L1assoc, L1readLatency, ..., L3capacity , ..., MemorywriteLatency )(8.1)

In our model, the input dimension is 14 (vector I14), and the output dimension is 13 (vector

O13). The number of neurons in the hidden layer (X) is S, which ranges from 30 to 60 depending on different fitting targets. In Figure 8.5, W and b are the weight matrix and bias vector of the hidden layer; W0 and b0 are those of the output layer. The feed-forward ANN is calculated as follows,

XS = σ (WS×14I14 +bS) (8.2) 0 0 O13 = ψ W 13×S XS +b 13 (8.3)  where σ(·) is a sigmoid function and ψ(·) is a linear function.

8.4.2 Training and Validation

This feed-forward network is able to fit multi-dimensional mapping problems given con- sistent data and enough neurons in the hidden layer. The accuracy of the statistical ar- chitectural model depends on the number of training samples provided from the actual full-system simulations. In this work, 1,000 cycle-accurate full-system simulation results are collected for each workload. Among each 1,000 samples, 800 data samples are used for training, 100 are used for testing, and the other 100 are used for validation during the training procedure to prevent over-training [108]. To reduce variability, multiple rounds of cross-validation during which data are rotated among the training, testing, and valida- tion sets are performed using different partitions, and the validation results are averaged over the rounds. Every ANN is configured to have 30-60 hidden neurons and trained using the Levenberg-Marquardt algorithm [109]. The Levenberg-Marquardt algorithm trains the ANN by adjusting the weight matrices and bias vectors based on the data iteratively until the ANN accurately predicts the outputs from the input parameters. 89

0.25

0.2

0.15

0.1

Predicted IPC 0.05

0 0 0.1 0.2 Actual IPC

Figure 8.6: An accurate ANN fitting example: MG from NPB.

0.4

0.3

0.2

Predicted IPC 0.1

0 0 0.1 0.2 0.3 0.4 Actual IPC

Figure 8.7: A typical ANN fitting example: dedup from PARSEC.

To measure the model accuracy, we use the metric error = |predicted − actual|/actual. The average prediction error is 4.29%. Figure 8.6 to Figure 8.8 show three examples of the ANN fitting results: a very accurate fit (0.15% error), a typical fit (3.06% error), and the worst fit in this work. Even in the worst case, the prediction error is under 18.71%. More details of the experiment setups, the studied benchmarks, the fitting results, and techniques for reducing fitting errors are discussed in Chapter 8.5.

8.5 Experimental Methodology

In this section, we first present the design space exploration framework, and then describe the simulation environment and the experimental methodology. 90

0.3

0.2

0.1 Predicted IPC

0 0 0.1 0.2 0.3 Actual IPC

Figure 8.8: The worst ANN fitting example: x264 from PARSEC.

8.5.1 Circuit-Architecture Joint Exploration Framework

After creating the circuit-level and the architecture-level models, we then build a joint circuit-architecture exploration framework to explore ReRAM designs in different memory hierarchies and then evaluate the trade-off among energy, performance, and silicon area in the microprocessor memory system design space. Figure 8.9 shows an overview of this joint circuit-architecture exploration framework. In this framework, 1,000 randomly-generated architecture-level inputs are used to produce 1,000 corresponding samples in the architectural design spaces. The samples are then fed into the ANN trainer to establish the architecture-level performance model for each benchmark workload. The trained ANN is used as the architecture-level performance model. The circuit-level inputs are first passed through the ReRAM performance, energy, and area model, and then fed into the ANN-based architecture-level performance model to generate the predicted architecture-level results such as IPC and power consumption together with the silicon area estimates. When the predicted result does not meet the design requirement, feedback information containing the distance between the design optimization target and the current achieved result is sent to a simulated annealing [110] optimization engine, and a new design trial is generated for the optimization loop. This optimization procedure steps forward iteratively until the design requirement (e.g. best EDP or best EDAP) is achieved or a near-optimal solution is reached. If a full design space exploration is required, instead of finding optimal solutions, all the knobs in the joint circuit-architecture design 91

Optimization loop

L1 capacity L1 capacity Predicted 1. System IPC Simulated L1 associativity L1 associativity Random inputs 2. Memory system L1 latency architecture- Feedbackannealing Adjust L1 memory type Design energy consumption L2 capacity level results engine L2 capacity 3. Memory system space L2 associativity L2 associativity silicon area/cost … Predicted circuit- … Main memory type Memory latency level results

L1 latency Benchmarks 1,000 Architecture- Circuit-level Artificial neural L1 energy Full-system samples Artificial neural level L1 area performance, network (ANN) network (ANN) Micro- simulation performance L2 latency energy, area trainer parameters L2 energy architectures model … model L3 area

Figure 8.9: Overview of the optimization framework. space can be exhaustively searched. The final goal of this framework is to find the best way of partitioning on-chip ReRAM resources to form a proper multi-level memory hierarchy that achieves the optimal energy-performance-area trade-offs.

8.5.2 Simulation Environment

We use our joint design space optimization framework to evaluate a compute node for a four-core chip multiprocessor (CMP). Each core is configured to be a scaled 32nm in-order superscalar SPARC-V9-like processor core. In this setting, the microprocessor die has four 3.2GHz physical cores and each core has its private instruction L1 cache (I-L1), data L1 cache (D-L1), and unified L2 cache (L2). Four cores together share an on-die L3 cache. In the design space, both MOS-accessed and cross-point ReRAM have been evaluated in L1, L2, and L3 caches. In addition, SRAM caches have also been explored since write latency is critical to performance in certain levels of cache hierarchy. The detailed input design space is listed in Table 8.1. In this work, I-L1 and D-L1 are assumed to have the same design specification. We use NPB [111] and PARSEC [84] as the experimental workloads. The workload size of the NPB benchmark is CLASS-C (DC has no CLASS-C setting, and CLASS-B is used instead), and the native inputs are used for the PARSEC benchmark to generate realistic program behavior. In total, 23 benchmark applications are evaluated in the experiment. We simulate 1,000 randomly generated design configurations per benchmark using the Simics full-system simulator [82], from which we then generate our statistical architectural model. Each Simics simulation run is fast forwarded to the pre-defined breakpoint at the 92

Table 8.1: Input design space parameters. Parameter Range Fabrication process 32nm Processor core 4-core, 3.2GHz, SPARC-V9-like Transistor model ITRS high-performance CMOS model I-L1 (D-L1) capacity 8KB to 64KB I-L1 (D-L1) associativity 4-way to 8-way I-L1 (D-L1) memory type SRAM, MOS-accessed ReRAM, Cross-point ReRAM I-L1 (D-L1) design style 24 in total3 L2 capacity None, or 128KB to 512KB L2 associativity 8-way or 16-way L2 memory type SRAM, MOS-accessed ReRAM, Cross-point ReRAM L2 design style 24 in total L3 capacity None, or 512KB to 16MB L3 associativity 8-way to 32-way L3 memory type SRAM, MOS-accessed ReRAM, Cross-point ReRAM L3 design style 24 in total Memory memory type MOS-accessed ReRAM, Cross-point ReRAM Memory design style External current-sensing, Area-optimized code region of interest, warmed-up by 1 billion instructions, and then simulated in the detailed timing mode for 10 billion cycles. Figure 8.10 illustrates the IPC prediction errors of the architecture-level performance model after training. The x axis shows the relative error between the predicted and the actual values, and the y axis presents the cumulative distribution function. The prediction results of other output parameters (e.g. L1 read count, L2 write miss, etc.) are similar to the IPC prediction result. After obtaining the access activities of each cache level, the memory subsystem power consumption can be calculated. Because the dynamic energy consumption of main memory is proportional to the last-level cache miss rate, we include it as a part of the memory subsystem power consumption for a fair comparison. In addition, we use McPAT [112] to estimate the power consumption of the logic components including the processor cores, on-chip memory controller, and inter-core crossbar. Under our 32nm 4-core SPARC-V9-like

3We provide 8 types of optimizations, which are for read latency, write latency, read energy, write energy, read EDP, write EDP, silicon area, and leakage power. We also provide 3 types of sensing schemes, which are external current-sensing, external voltage-sensing, and internal voltage-sensing. Internal current-sensing is not considered as it causes low area efficiency. Thereby, there are 24 memory design types in total. 93

100% 100% 90% 90% blackscholes 80% 80% 70% 70% canneal 60% 60% facesim 50% BT.C IS.C 50% freqmine CDF 40% CG.C LU.C CDF 40% raytrace 30% DC.B MG.C 30% streamcluster 20% EP.C SP.C 20% swaptions 10% FT.C UA.C 10% vips 0% 0% 0% 5% 10% 15% 20% 25% 30% 0% 0.05% 0.1% 0.15% 0.2% Prediction error Prediction error 100% 90% 80% 70% 60% 50% bodytrack CDF 40% dedup 30% ferret 20% fluidanimate 10% x264 0% 0% 5% 10% 15% 20% 25% 30% Prediction error

Figure 8.10: CDF plots of error on IPC prediction of NPB and PARSEC benchmark appli- cations. settings, the logic components have 7.41W leakage power (Powerlogic,leakage) and 10.98W peak dynamic power. The run-time dynamic power consumption (Powerlogic,dynamic) is scaled down from the peak dynamic power according to the actual IPC value. The total power consumption of the processor chip is calculated as follows,

3

Energymemory,dynamic = [NreadHiti Ehiti + NreadMissi Emissi + (NwriteHiti + NwriteMissi ) Ewritei ] Xi=1 +NreadMiss3 Eread4 + NwriteMiss3 Ewrite4 (8.4)

Powermemory,leakage = 2NcoreP1 + NcoreP2 + P3 (8.5)

Powerprocessor,total = Energymemory,dynamic/T + Powerlogic,dynamic

+Powermemory,leakage + Powerlogic,leakage (8.6)

In Equation 8.4, NreadHiti , NreadMissi , NwriteHiti , and NwriteMissi are the read count, read miss count, write count, and write miss count of the Level-i cache, which are generated from the ANN prediction. Ehiti , Emissi , and Ewritei are the dynamic energy consumption 94

Table 8.2: MOS-accessed and cross-point ReRAM main memory parameters (1Gb, 8-bit, 16-bank) MOS-accessed ReRAM Cross-point ReRAM Die area 129mm2 48mm2 Read latency 6.2ns 10.0ns Write latency 54.9ns 107.1ns Burst read latency 4.3ns 4.3ns Burst write latency 4.3ns 4.3ns of a hit, miss, and write operation in the Level-i cache, and they are obtained from the circuit-level energy model. Eread4 and Eread4 are the dynamic energy consumption of main memory read and write operations, since we label the main memory as the fourth level of

the memory hierarchy. In Equation 8.5, Ncore is the number of cores, and Pi represents the leakage power consumption of each cache level. The coefficient 2 is because of the identical data and instruction L1 caches (D-L1 and I-L1) in this work. Equation 8.6 gives the total power consumption where T is the simulation time (T = 10B/3.2GHz = 3.125s according to our experimental setup).

8.6 Design Exploration and Optimization

In this section, we demonstrate how to use the optimization framework to perform design space exploration for memory hierarchy design.

8.6.1 Cache Hierarchy Design Exploration

We first exhaustively explore the cache hierarchy design space via ANN to show the Pareto- optimal curves of the trade-off range. In this step, we separate the cache design space (L1, L2, and L3) and the memory design space. We suppose the main memory is built by either cross-point ReRAMs that are optimized for density or MOS-accessed ReRAMs that are optimized for latency. We model the ReRAM main memory similar to the DDR3 protocol, which has 8-burst-deep prefetching and memory clock of 993MHz. Table 8.2 lists the timing and area parameters of the MOS-accessed ReRAM and the cross-point ReRAM main memory solutions. We assume both of MOS-accessed and cross-point ReRAM main 95

D7 0.6 D4 D5 D6

0.5 D3 D2 0.4

0.3 Use MOS %accessed ReRAM as main memory

Instruction per cycle (IPC) cycle per Instruction Use cross %point ReRAM as main memory Use PCM instead of ReRAM 0.2

D1 0.1 10 12 14 16 18 20 22 24 Total processor power and main memory dynamic power (Unit: W)

Figure 8.11: Pareto-optimal curves: energy and performance trade-off of the memory hier- archy. Main memory dynamic power is included for a fair comparison. memory modules would be available in the future in a DIMM form factor, with MOS- accessed ReRAM aiming at the high-performance market as it has faster write speed, while cross-point ReRAM aims at the low-cost market as it is more than 2X denser. Focusing on the design space exploration of ReRAM-based memory hierarchies, Fig- ure 8.11 shows the Pareto-optimal curves of the power-performance trade-off. The x-axis is the total power consumption of the processor chip, and the y-axis is the IPC performance. It can be observed from Figure 8.11 that a great amount of power consumption can be reduced by only taking a small amount of performance degradation. For instance, as shown in Fig- ure 8.11, design option D4 (using SRAM L1 and L2 caches but ReRAM L3 cache) reaches 0.600 IPC by consuming 12.58W total power. Compared to design option D7 (using SRAM on all L1, L2, and L3 caches) that reaches 0.661 IPC but consumes 23.90W of power, the achieved power reduction is 47% but the performance degradation is only 9%. The design option also meets the constraint set by 1010 write endurance as discussed in Chapter 8.2. If ReRAM write endurance is futher improved, more aggressive options (e.g. deploying L2 96

0.65

0.6

0.55

0.5

0.45 SRAM used in L1, L2, and L3 caches Instruction per cycle (IPC) cycle per Instruction 0.4 SRAM only used in L1 and L2 caches SRAM only used in L1 cache 0.35 No SRAM, only ReRAM used

0.3 10 12 14 16 18 20 22 24 Total processor power and main memory dynamic power (Unit: W)

Figure 8.12: Pareto-optimal curves (cross-point ReRAM as main memory): energy and performance trade-off under different constraints on SRAM deployment.

ReRAM caches) can further reduce the power consumption. For example, design option D3 (using ReRAM L2 and L3 caches) reaches 0.503 IPC by consuming only 11.44W total power. To demonstrate how different cache hierarchy designs have been explored, we list the design parameters of 7 design options (D1 to D7) in Table 8.3 and Table 8.4. We find that the Pareto-optimal curves are composed of several segments, such as D1- to-D2, D4-to-D6, etc. The switch from one segment to another segment comes about from SRAM deployment in certain cache levels. In general, significant IPC improvements are achieved by adding more SRAM resources, and greater reductions in power consumption come from replacing SRAM with ReRAM. To show the effect of SRAM deployment, we plot three other Pareto-optimal curves in Figure 8.12. This figure shows a pure-ReRAM cache hierarchy is on the global Pareto-front but it achieves less than 0.45 IPC, and that segment has a large slope. Thus, it suggests SRAM should still be deployed at least in the L1 caches. This conclusion is consistent with our previous estimation that the current ReRAM technology only allows the deployment of L3 ReRAM caches in terms of write endurance. 97

Table 8.3: On-die cache hierarchy design parameters of 7 design options D1 D2 D3 D4 L1 capacity 32KB 8KB 8KB 8KB L1 associativity 8 8 4 4 L1 memory type4 X-ReRAM SRAM SRAM SRAM L1 optimized for L RP RL RL L1 sensing scheme EX IN IN IN L2 capacity N/A 128KB 64KB 64KB L2 associativity N/A 16 8 8 L2 memory type N/A M-ReRAM M-ReRAM SRAM L2 optimized for N/A A RL RP L2 sensing scheme N/A EX IN IN L3 capacity N/A N/A 4MB 4MB L3 associativity N/A N/A 16 16 L3 memory type N/A N/A M-ReRAM M-ReRAM L3 optimized for N/A N/A L WP L3 sensing scheme N/A N/A EX IN IPC 0.123 0.449 0.503 0.600 Power consumption (W) 10.19 10.90 11.44 12.58 Silicon area (mm2) 22.83 23.06 24.00 24.90

L2 ReRAM cache deployment can achieve considerable power reduction by only sacrificing a small amount of performance. However, the feasibility of L2 ReRAM caches also depends on improvements in the write endurance of the future ReRAM technologies. Another benefit obtained from using ReRAM caches is silicon area reduction. Fig- ure 8.13 shows the Pareto-optimal curves of cache area and performance trade-offs, which have similar shapes to the ones in the power-performance trade-off. In Figure 8.13, the mini- mum chip area is the case when there is no on-chip cache. The processor core area (including memory controller and crossbar) is 22.8mm2 as estimated by McPAT [112]. Achieving the highest performance using pure-SRAM caches costs at least another 8mm2 of silicon area, while replacing the SRAM L3 cache with ReRAM can save more than than 3mm2 in chip area by degrading performance from an IPC of 0.66 to 0.60. We show the feasible region of designs with less than 3mm2 total cache area in Figure 8.14. This result is extremely useful

4Memory type abbreviation: M-ReRAM = MOS-accessed ReRAM, X-ReRAM = Cross-point ReRAM; Optimization abbreviation: RL = Read Latency, WL = Write Latency, RE = Read Energy, WE = Write Energy, RP = Read EDP, WP = Write EDP, L = Leakage, A = Area; Sensing scheme abbreviation: IN = Internal, EX = External. 98

Table 8.4: On-die cache hierarchy design parameters of 7 design options (Continued) D5 D6 D7 L1 capacity 32KB 8KB 8KB L1 associativity 8 8 4 L1 memory type SRAM SRAM SRAM L1 optimized for RL RP RL L1 sensing scheme IN IN IN L2 capacity 64KB 64KB 64KB L2 associativity 8 8 16 L2 memory type SRAM SRAM SRAM L2 optimized for RP RP WP L2 sensing scheme IN IN IN L3 capacity 32MB 4MB 4MB L3 associativity 16 16 16 L3 memory type M-ReRAM SRAM SRAM L3 optimized for RP RE RL L3 sensing scheme IN IN IN IPC 0.607 0.609 0.661 Power consumption (W) 16.91 20.93 23.90 Silicon area (mm2) 31.51 30.84 32.55 in the low-cost computing segment where the performance requirement is just-in-time but the chip cost is the first priority. Figure 8.14 also indicates using ReRAM caches can reduce power consumption and silicon area at the same time, which further improves the metric of EDAP.

8.6.2 Design Optimization

Running full design space exploration using an exhaustive search is time-consuming and may not be necessary in most cases. Therefore, to use this joint circuit-architecture model in a practical memory hierarchy design assistant, an efficient optimization method is required. In this work, we use a simplified simulated annealing [110] algorithm to find a near globally optimal solution. The simulated annealing heuristic is described in Algorithm 2. In this optimization methodology, we first randomly choose an initial design option, s0, and calculate its annealing energy function from the joint circuit-architecture model. The annealing energy function can be EDP, EDAP, or any other energy-performance-area 99

0.65

0.6

0.55

0.5

0.45 SRAM used in L1, L2, and L3 caches

Instruction per cycle (IPC) cycle per Instruction 0.4 SRAM only used in L1 and L2 caches SRAM only used in L1 cache 0.35 No SRAM, only ReRAM used

0.3 22 24 26 28 30 32 Total silicon area of the processor chip (Unit: mm 2)

Figure 8.13: Pareto-optimal curves (cross-point ReRAM as main memory): cache area and performance trade-off under different constraints on SRAM deployment.

Algorithm 2 Design space optimization algorithm state=s0, energy=E(state) repeat new state = neighbour(state), new energy = E(new state) if new energy < energy then state = new state, energy = new energy {Accept unconditionally} else if T(energy, new energy) > random() then state = new state, energy = new energy {Accept with probability} end if until energy stops improving in the last K rounds return state combination. The optimization loop continuously tries neighboring options5 of the current one. If the new design option is better than the previous one, it is adopted unconditionally; if not, it is adopted with probability depending on an acceptance function. The acceptance

5A neighboring option is generated by changing two parameters from the parameter set of L1/L2/L3 capacity, associativity, and memory type. 100

Figure 8.14: The global Pareto-optimal curve (cross-point ReRAM as main memory) and feasible design options with total cache area less than 3mm2.

probability, Paccept, is defined by Eqn 8.7 as follows,

1 if new energy < energy  Paccept(energy, new energy) =  energy/new energy if energy ≤ new energy < 1.3energy   0 otherwise   (8.7)  Figure 8.15 shows how the simulated annealing algorithm eventually evolves the memory hierarchy design options from an initial random option to a near globally optimal one in terms of EDP. In addition, Figure 8.16 shows the EDAP optimization path. Compared to exhaustive search of the same design space that takes more than 8 hours on an 8-core Xeon X5570 microprocessor, the proposed optimization methodology usually finds the near-optimal values in less than 30 seconds. This optimization scheme provides almost instant design decision given any specified performance, energy, or area require- ments. Furthermore, it becomes feasible to integrate this model into higher level tools that consider not only the memory system design trade-offs but also design trade-offs within 101

Optimized solution 0.6

0.5 Global Pareto %optimal curve Iso %EDP curve 0.4 Optimization path

0.3 Instruction per cycle (IPC) cycle per Instruction Random initialization point 0.2

0.1 10 12 14 16 18 20 22 24 Total processor power and main memory dynamic power (Unit: W)

Figure 8.15: The path of EDP optimization. microprocessor cores [104].

8.6.3 Discussion

Power consumption has been an issue for many years. Our exploration and optimization results demonstrate that both of the EDP and EDAP optimal points are close to the y- axis on the IPC-versus-power plot. In this design space region, ReRAM resources are heavily adopted in the memory hierarchy design (using L2 and L3 ReRAM caches). Even if performance constraints are applied, using ReRAM starting with the L3 cache always brings energy efficiency. Moreover, locating these energy-optimal points on the IPC-versus- area plot also shows a great silicon area saving achieved from ReRAM without incurring much performance degradation. Compared to the best values of pure-SRAM designs, the introduction of ReRAM on L2 and L3 caches improves EDP and EDAP by 36% and 51% on a scaled 32nm 4-core SPARC-V9-like processor chip, respectively. The memory technology shift from SRAM to ReRAM achieves these improvements for the following reasons, 102

Optimized solution 0.6

0.5

0.4

Random initialization point 0.3

Instruction per cycle (IPC) cycle per Instruction Global Pareto %optimal curve 0.2 Iso %EDAP curve Optimization path

0.1 200 300 400 500 600 700 Power and area product (Unit: W %mm 2)

Figure 8.16: The path of EDAP optimization.

• The compact ReRAM module size greatly reduces the silicon area used for on-chip memories (EDAP improvement), or allows more on-chip memory resource deployment to improve the performance (EDP and EDAP improvement);

• The relatively smaller ReRAM size implies shorter wordlines and bitlines in ReRAM cell array, and thus reduces the dynamic energy consumption per memory access (EDP and EDAP improvement);

• The non-volatility property of ReRAM eliminates the leakage energy consumption of memory cells (EDP and EDAP improvement).

Therefore, we envision a heterogeneous memory hierarchy as shown in Figure 8.17 and summarized in Table 8.5. In such a hierarchy, SRAM is used in L1 and possibly L2 caches, MOS-accessed ReRAM may be used in L3 and L2 caches and main memory, and low-cost cross-point ReRAM may be used in L3 caches and main memory. Eventually, we predict the 3D cross-point ReRAM will be well suited for secondary storage devices. 103

Chip package Microprocessor die

ReRAM main Core 0 Core 1 memory

I-L1 D-L1 I-L1 D-L1 ReRAM L2 L2 secondary L3 storage L2 L2

I-L1 D-L1 I-L1 D-L1

Core 2 Core 3

Level έ Level Ϋ Level ά Level Ϊ Level ή

Figure 8.17: A proposed universal memory hierarchy using ReRAM.

Table 8.5: Overview of the proposed universal memory hierarchy Level Memory type Write endurance requirement L1 cache SRAM 1013 [Chapter 8.2] SRAM or L2 cache 1011 [Chapter 8.2] MOS-accessed ReRAM MOS-accessed or L3 cache 1010 [Chapter 8.2] cross-point ReRAM MOS-accessed or Main memory 108 [98] cross-point ReRAM

8.7 Summary

ReRAM is a promising memory technology that fits into many levels of the memory hierar- chy. In this work, we first examine the write endurance requirements for deploying ReRAM in caches. Together with state-of-the-art wear leveling that alleviates inter-set write varia- tion, we enhance the LRU cache replacement policy with proactive invalidation to reduce intra-set variation. On average, our analysis shows that the write endurance requirement for ReRAM L3 caches is 1010 and that write endurance higher than 1011 or 1013 is needed for using ReRAM in L2 or L1 caches, respectively. Then, we build a circuit-level timing, power, and area estimation model for ReRAM. We use this model to explore a wide-range of ReRAM circuitry implementations and generate architecturally relevant parameters. After that, we integrate this circuit-level model into an ANN-based architecture-level model, and 104 create a general energy-performance-area optimization framework for the ReRAM-based memory hierarchy design in a joint circuit-architecture design space. Our validation results show that the proposed framework is sufficiently accurate for the purpose of design space exploration, and by using this framework we are able to rapidly explore a very large space of memory hierarchy designs and find good solutions in terms of energy-performance-area trade-offs. Our experimental results reveal the memory design preference for ReRAM in a 4-core CMP setting when the design targets EDP or EDAP goals. Our results show using ReRAM starting from L2 caches can achieve a 36% EDP improvement and a 51% EDAP improvement, and this meets current ReRAM write endurance constraints when coupled with proper wear-leveling. In general, this work is the initial effort to study the feasibility of building an energy- efficient ReRAM-based memory hierarchy. The fast read access, high-density, good scalabil- ity, non-volatility, and the wide-range of memory array designs make ReRAM a promising candidate for such a energy-efficient memory hierarchy, and the write endurance limit of ReRAM does not necessarily hold back ReRAM from cache applications. Our joint circuit- architecture analysis verifies the feasibility of such an ReRAM-based hierarchy, and also shows its benefit for area and power reduction. We hope this work could be a step towards a new generation of energy-efficient heterogeneous hierarchies. Chapter 9

Conclusion

Multiple non-volatile memory technologies including STTRAM, PCRAM, and ReRAM are emerging these days. These emerging non-volatile memory technologies have attractive properties of high density, fast access, good scalability, and non-volatility. Therefore, they have drawn the attention of the computer architecture community and challenged the role of SRAM and DRAM in mainstream memory hierarchy for the first time in more than 30 years. Given the desirable properties of these emerging memory technologies, much innovative research has been focusing on designing the next-generation of high-performance and low-power computing systems using these memory technologies. This dissertation tackles this hot topic from three different but highly-related aspects of views. First, a circuit-level performance, energy, and area models for various non-volatile mem- ories is built and described in the first part of this dissertation. The motivation of building this model comes from the current situation where all of these emerging non-volatile mem- ory technologies are still in their prototyping stage. Although various device-level research efforts have successfully demonstrated working non-volatile memory cells with promising properties optimized for different targets, such a large variation in the properties of mem- ory cells leaves the circuit-level parameters such as memory access latency, memory access energy, and memory module area as open questions. This brings uncertainty and difficulty for system-level researchers to evaluate the benefit of these new technologies. Therefore,

105 106 a circuit-level model that connects the device-level and the architecture-level research is required. Second, since non-volatile memory technologies generally have drawbacks in write speed, write energy consumption, and write endurance, it is necessary to use architecture-level design enhancement to alleviate such drawbacks. Thus, after building a circuit-level model for non-volatile memories, the second part of this dissertation focuses on using architecture- level techniques to mitigate the aforementioned shortcomings of non-volatile memory write operations. In addition, architecture-level evaluation is performed to show that the proposed techniques are all effective in reaching their goals. Last but not least, application-level case studies of adopting emerging technologies are conducted in this dissertation. These application-level case studies cover the range of sec- ondary storage, memory hierarchy, and checkpointing, and demonstrate how the non-volatile memory technologies can be used in these applications and greatly improve either the per- formance or the power efficiency. These case studies are also important to the current emerging non-volatile memory revolution because they validate the ideas of replacing tra- ditional memory/disk technologies (i.e. SRAM, DRAM, NAND flash, and HDD) with emerging non-volatile memory technologies. And, this can be one of the drivers to push the maturity of these technologies. We hope the work of this dissertation would be useful and have its impact on eNVM- related research on future high-performance and low-power computing systems. Bibliography

[1] K.-J. Lee, B.-H. Cho, W.-Y. Cho, S.-B. Kang et al., “A 90nm 1.8V 512Mb diode- switch PRAM with 266MB/s read throughput,” IEEE Journal of Solid-State Circuits, vol. 43, no. 1, pp. 150–162, 2008.

[2] M. H. Kryder and C. S. Kim, “After hard drives - what comes next?” IEEE Trans- actions on Magnetics, vol. 45, no. 10, pp. 3406–3413, 2009.

[3] L. M. Grupp, A. M. Caulfield, J. Coburn, S. Swanson et al., “Characterizing flash memory: Anomalies, observations, and applications,” in Proceedings of the Interna- tional Symposium on Microarchitecture, 2009, pp. 24–33.

[4] K. Tsuchida, T. Inaba, K. Fujita, Y. Ueda et al., “A 64Mb MRAM with clamped- reference and adequate-reference schemes,” in Proceedings of the International Solid- State Circuits Conference, 2010, pp. 268–269.

[5] S. J. Ahn, Y. J. Song, C. W. Jeong, J. M. Shin et al., “Highly manufacturable high density phase change memory of 64Mb and beyond,” in Proceedings of the Interna- tional Electron Devices Meeting, 2004, pp. 907–910.

[6] S.-S. Sheu, M.-F. Chang, K.-F. Lin, C.-W. Wu et al., “A 4Mb embedded SLC resistive- RAM macro with 7.2ns read-write random-access time and 160ns MLC-access capabil- ity,” in Proceedings of the IEEE International Solid-State Circuits Conference, 2011, pp. 200–201.

[7] D. Roberts, T. Kgil, and T. Mudge, “Using non-volatile memory to save energy in servers,” in Proceedings of the Design, Automation & Test in Europe, 2009, pp. 743– 748.

[8] K. Lim, P. Ranganathan, J. Chang, C. Patel et al., “Understanding and designing new server architectures for emerging warehouse-computing environments,” in Proceedings of the International Symposium on Computer Architecture, 2008, pp. 315–326.

[9] K. Lim, J. Chang, T. Mudge, P. Ranganathan et al., “Disaggregated memory for ex- pansion and sharing in blade servers,” in Proceedings of the International Symposium on Computer Architecture, 2009, pp. 267–278.

107 108

[10] M. Motoyoshi, T. Yamamura, W. Ohtsuka, M. Shouji et al., “A study for 0.18µm high-density MRAM,” in Proceedings of the Symposium on VLSI Technology, 2004, pp. 22–23. [11] M. Hosomi, H. Yamagishi, T. Yamamoto, K. Bessho et al., “A novel nonvolatile mem- ory with spin torque transfer magnetization switching: Spin-RAM,” in Proceedings of the International Electron Devices Meeting, 2005, pp. 459–462. [12] H. Tanizaki, T. Tsuji, J. Otani, Y. Yamaguchi et al., “A high-density and high-speed 1T-4MTJ MRAM with voltage offset self-reference sensing scheme,” in Proceedings of the Asian Solid-State Circuits Conference, 2006, pp. 303–306. [13] T. Kawahara, R. Takemura, K. Miura, J. Hayakawa et al., “2Mb spin-transfer torque RAM (SPRAM) with bit-by-bit bidirectional current write and parallelizing-direction current read,” in Proceedings of the International Solid-State Circuits Conference, 2007, pp. 480–617. [14] F. Pellizzer, A. Pirovano, F. Ottogalli, M. Magistretti et al., “Novel µTrench phase- change memory cell for embedded and stand-alone non-volatile memory applications,” in Proceedings of the International Symposium on VLSI Technology, 2004, pp. 18–19. [15] N. Matsuzaki, K. Kurotsuchi, Y. Matsui, O. Tonomura et al., “Oxygen-doped GeS- bTe phase-change memory cells featuring 1.5V/100µA standard 0.13µm CMOS op- erations,” in Proceedings of the IEEE International Electron Devices Meeting, 2005, pp. 738–741. [16] J. H. Oh, J. H. Park, Y. S. Lim, H. S. Lim et al., “Full integration of highly manufac- turable 512Mb PRAM based on 90nm technology,” in Proceedings of the International Electron Devices Meeting, 2006, pp. 49–52. [17] H.-R. Oh, B.-H. Cho, W.-Y. Cho, S. Kang et al., “Enhanced write performance of a 64-Mb phase-change random access memory,” IEEE Journal of Solid-State Circuits, vol. 41, no. 1, pp. 122–126, 2006. [18] T. Nirschl, J. B. Phipp, T. D. Happ, G. W. Burr et al., “Write strategies for 2 and 4-bit multi-level phase-change memory,” in Proceedings of the IEEE International Electron Devices Meeting, 2007, pp. 461–464. [19] S. Hanzawa, N. Kitai, K. Osada, A. Kotabe et al., “A 512kB embedded phase change memory with 416kB/s write throughput at 100µA cell write current,” in Proceedings of the International Solid-State Circuits Conference, 2007, pp. 474–616. [20] S. Raoux, G. W. Burr, M. J. Breitwisch, C. T. Rettner et al., “Phase-change random access memory: A scalable technology,” IBM Journal of Research and Development, vol. 52, no. 4/5, 2008. [21] D.-H. Kang, J.-H. Lee, J. Kong, D. Ha et al., “Two-bit cell operation in diode-switch phase change memory cells with 90nm technology,” in Proceedings of the Symposium on VLSI Technology, 2008, pp. 98–99. 109

[22] D. Kau, S. Tang, I. V. Karpov, R. Dodge et al., “A stackable cross point phase change memory,” in Proceedings of the IEEE International Electron Devices Meeting, 2009, pp. 27.1.1–27.1.4.

[23] S. Yoshitaka, K. Masaharu, M. Takahiro, K. Kenzo et al., “Cross-point phase change memory with 4F2 cell size driven by low-contact-resistivity poly-Si diode,” in Pro- ceedings of the Symposium on VLSI Technology, 2009, pp. 24–25.

[24] F. Bedeschi, R. Fackenthal, C. Resta, E. M. Donze et al., “A bipolar-selected phase change memory featuring multi-level cell storage,” IEEE Journal of Solid-State Cir- cuits, vol. 44, no. 1, pp. 217–227, 2009.

[25] J. J. Yang, M. D. Pickett, X. Li, D. A. A. Ohlberg et al., “Memristive switching mechanism for metal/oxide/metal nanodevices,” Nature Nanotechnology, vol. 3, no. 7, pp. 429–433, 2008.

[26] Y.-C. Chen, C.-F. Chen, C.-T. Chen, J.-Y. Yu et al., “An access-transistor-free (0T/1R) non-volatile resistance random access memory (RRAM) using a novel thresh- old switching, self-rectifying chalcogenide device,” in Proceedings of the International Electron Devices Meeting, 2003, pp. 750–753.

[27] L. Chen, Y. Xu, Q.-Q. Sun, H. Liu et al., “Highly uniform bipolar resistive switch- ing with Al2O3 buffer layer in robust NbAlO-based RRAM,” IEEE Electron Device Letters, vol. 31, no. 4, pp. 356–358, 2010.

[28] Z. Wei, Y. Kanzawa, K. Arita, Y. Katoh et al., “Highly reliable TaOx ReRAM and direct evidence of redox reaction mechanism,” in Proceedings of the International Electron Devices Meeting, 2008, pp. 293–296.

[29] Y. S. Chen, H. Y. Lee, P. S. Chen, P. Y. Gu et al., “Highly scalable hafnium oxide memory with improvements of resistive distribution and read disturb immunity,” in Proceedings of the International Electron Devices Meeting, 2009, pp. 105–108.

[30] K.-H. Kim, S. H. Jo, S. Gaba, and W. Lu, “Nanoscale resistive memory with intrinsic diode characteristics and long endurance,” Applied Physics Letters, vol. 96, no. 5, pp. 053 106.1–053 106.3, 2010.

[31] W. S. Lin, F. T. Chen, C. H. L. Chen, and M.-J. Tsai, “Evidence and solution of over-RESET problem for HfOx based resistive memory with sub-ns switching speed and high endurance,” in Proceedings of the International Electron Devices Meeting, 2010, pp. 19.7.1–19.7.4.

[32] W. C. Chien, Y. C. Chen, K. P. Chang, E. K. Lai et al., “Multi-level operation of fully CMOS compatible WOx resistive random access memory (RRAM),” in Proceedings of the International Memory Workshop, 2009, pp. 228–229.

[33] S. H. Jo, K.-H. Kim, and W. Lu, “High-density crossbar arrays based on a Si mem- ristive system,” Nano Letters, vol. 9, pp. 870–874, 2009. 110

[34] M.-J. Lee, Y. Park, B.-S. Kang, S.-E. Ahn et al., “2-stack 1D-1R cross-point structure with oxide diodes as switch elements for high density resistance RAM applications,” in Proceedings of the IEEE International Electron Devices Meeting, 2007, pp. 771–774.

[35] Y.-B. Kim, S. Lee, D. Lee, C. Lee et al., “Bi-layered rram with unlimited endurance and extremely uniform switching,” in Proceedings of the Symposium on VLSI Tech- nology, 2011, pp. 52–53.

[36] J. Condit, E. B. Nightingale, C. Frost, E. Ipek et al., “Better I/O through byte- addressable, persistent memory,” in Proceedings of the Symposium on Operating Sys- tems Principles, 2009, pp. 133–146.

[37] A. M. Caulfield, A. De, J. Coburn, T. I. Mollov et al., “Moneta: A high-performance storage array architecture for next-generation, non-volatile memories,” in Proceedings of the International Symposium on Microarchitecture, 2010, pp. 385–395.

[38] P. Zhou, B. Zhao, J. Yang, and Y. Zhang, “Energy reduction for STT-RAM using early write termination,” in Proceedings of the International Conference on Computer- Aided Design, 2009, pp. 264–268.

[39] M. K. Qureshi, M. M. Franceschini, and L. A. Lastras, “Improving read performance of phase change memories via write cancellation and write pausing,” in Proceedings of the International Symposium on High Performance Computer Architecture, 2010, pp. 1–11.

[40] B.-D. Yang, J.-E. Lee, J.-S. Kim, J. Cho et al., “A low power phase-change ran- dom access memory using a data-comparison write scheme,” in Proceedings of the International Symposium on Circuits and Systems, 2007, pp. 3014–3017.

[41] H. Chung, B.-H. Jeong, B.-J. Min, Y. Choi et al., “A 58nm 1.8v 1gb pram with 6.4mb/s program bw,” in Proceedings of the International Solid-State Circuits Con- ference, 2011, pp. 500–502.

[42] International Technology Roadmap for Semiconductors, “Process Integration, De- vices, and Structures 2010 Update,” http://www.itrs.net/.

[43] S. Schechter, G. H. Loh, K. Straus, and D. Burger, “Use ECP, not ECC, for hard failures in resistive memories,” in Proceedings of the International Symposium on Computer Architecture, 2010, pp. 141–152.

[44] E. Ipek, J. Condit, E. B. Nightingale, D. Burger, and T. Moscibroda, “Dynamically replicated memory: Building reliable systems from nanoscale resistive memories,” in Proceedings of the Architectural Support for Programming Languages and Operating Systems, 2010, pp. 3–14.

[45] N. H. Seong, D. H. Woo, V. Srinivasan, J. A. Rivers, and H.-H. S. Lee, “SAFER: Stuck-at-fault error recovery for memories,” in Proceedings of the International Sym- posium on Microarchitecture, 2010, pp. 115–124. 111

[46] M. K. Qureshi, J. P. Karidis, M. M. Franceschini, V. Srinivasan et al., “Enhancing lifetime and security of PCM-based main memory with start-gap wear leveling,” in Proceedings of the International Symposium on Microarchitecture, 2009, pp. 14–23.

[47] N. H. Seong, D. H. Woo, and H.-H. S. Lee, “Security refresh: prevent malicious wear-out and increase durability for phase-change memory with dynamically random- ized address mapping,” in Proceedings of the International Symposium on Computer Architecture, 2010, pp. 383–394.

[48] D. H. Yoon, N. Muralimanohar, J. Chang, P. Ranganathan et al., “FREE-p: Pro- tecting non-volatile memory against both hard and soft errors,” in Proceedings of the International Symposium on High Performance Computer Architecture, 2011, pp. 466–477. [49] S. J. E. Wilton and N. P. Jouppi, “CACTI: An enhanced cache access and cycle time model,” IEEE Journal of Solid-State Circuits, vol. 31, pp. 677–688, 1996.

[50] S. Thoziyoor, N. Muralimanohar, J.-H. Ahn, and N. P. Jouppi, “CACTI 5.1 technical report,” HP Labs, Tech. Rep. HPL-2008-20, 2008. [51] R. J. Evans and P. D. Franzon, “Energy consumption modeling and optimization for SRAM’s,” IEEE Journal of Solid-State Circuits, vol. 30, no. 5, pp. 571–579, 1995.

[52] M. Mamidipaka and N. Dutt, “eCACTI: An enhanced power estimation model for on- chip caches,” Center for Embedded Computer Systems, Tech. Rep. TR04-28, 2004. [53] N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi, “Architecting efficient interconnects for large caches with CACTI 6.0,” IEEE Micro, vol. 28, no. 1, pp. 69–79, 2008.

[54] X. Dong, X. Wu, G. Sun, Y. Xie et al., “Circuit and microarchitecture evaluation of 3D stacking magnetic RAM (MRAM) as a universal memory replacement,” in Proceedings of the Design Automation Conference, 2008, pp. 554–559.

[55] P. Mangalagiri, K. Sarpatwari, A. Yanamandra, V. Narayanan, et al., “A low-power phase change memory based hybrid cache architecture,” in Proceedings of the Great Lakes Symposium on VLSI, 2008, pp. 395–398. [56] X. Dong, N. P. Jouppi, and Y. Xie, “PCRAMsim: System-level performance, en- ergy, and area modeling for phase-change RAM,” in Proceedings of the International Conference on Computer-Aided Design, 2009, pp. 269–275.

[57] C. Xu, X. Dong, N. P. Jouppi, and Y. Xie, “Design implications of memristor-based RRAM cross-point structures,” in Proceedings of the Design, Automation & Test in Europe, 2011, pp. 1–6. [58] V. Mohan, S. Gurumurthi, and M. R. Stan, “FlashPower: A detailed power model for NAND flash memory,” in Proceedings of Design, Automation & Test in Europe, 2010, pp. 502–507. 112

[59] X. Wu, J. Li, L. Zhang, E. Speight et al., “Hybrid cache architecture with disparate memory technologies,” in Proceedings of the International Symposium on Computer Architecture, 2009, pp. 34–45.

[60] A. M. Caulfield, J. Coburn, T. I. Mollov, A. De et al., “Understanding the impact of emerging non-volatile memories on high-performance, IO-intensive computing,” in Proceedings of the International Conference for High Performance Computing, Net- working, Storage and Analysis, 2010, pp. 1–11.

[61] A. Akel, A. M. Caulfield, T. I. Mollov, R. K. Gupta, and S. Swanson, “Onyx: A pro- toype phase-change memory storage array,” in Proceedings of the USENIX Conference on Hot Topics in Storage and File Systems, 2011, pp. 1–5.

[62] T. Kgil, D. Roberts, and T. Mudge, “Improving nand flash based disk caches,” in Proceedings of the International Symposium on Computer Architecture, 2008, pp. 327– 338.

[63] S. Lee, K. Ha, K. Zhang, and J. Kim, “FlexFS: A flexible flash file system for MLC NAND flash memory,” in USENIX Annual Technical Conference, 2009, pp. 1–14.

[64] M. K. Qureshi, M. M. Franceschini, L. A. Lastras, and J. P. Karidis, “Morphable memory system: A robust architecture for exploiting multi-level phase change memo- ries,” in Proceedings of the International Symposium on Computer Architecture, 2010, pp. 153–162.

[65] J. Wang, Y. Liu, H. Yang, and H. Wang, “A compare-and-write ferroelectric non- volatile flip-flop for energy-harvesting applications,” in Proceedings of the Interna- tional Conference on Green Circuits and Systems, 2010, pp. 646–650.

[66] H. Volos, A. J. Tack, and M. M. Swift, “Mnemosyne: lightweight persistent memory,” in Proceedings of the International Conference on Architectural Support for Program- ming Languages and Operating Systems, 2011, pp. 91–104.

[67] M. Rasquinha, D. Choudhary, S. Chatterjee, S. Mukhopadhyay, and S. Yalamanchili, “An energy efficient cache design using spin torque transfer (STT) RAM,” in Pro- ceedings of the International Symposium on Low power Electronics and Design, 2010, pp. 389–394.

[68] H. Sun, C. Liu, N. Zheng, T. Min, and T. Zhang, “Design techniques to improve the device write margin for MRAM-based cache memory,” in Proceedings of the Great Lakes Symposium on VLSI, 2011, pp. 97–102.

[69] C. W. Smullen, V. Mohan, A. Nigam, S. Gurumurthi, and M. R. Stan, “Relaxing non-volatility for fast and energy-efficient STT-RAM caches,” in Proceedings of the International Symposium on High Performance Computer Architecture, 2011, pp. 50– 61. 113

[70] S. Thoziyoor, J.-H. Ahn, M. Monchiero, J. B. Brockman, and N. P. Jouppi, “A comprehensive memory modeling tool and its application to the design and analysis of future memory hierarchies,” in Proceedings of the International Symposium on Computer Architecture, 2008, pp. 51–62.

[71] International Technology Roadmap for Semiconductors, “The Model for Assessment of cmoS Technologies And Roadmaps (MASTAR),” http://www.itrs.net/models.html. [72] A. N. Udipi, N. Muralimanohar, N. Chatterjee, R. Balasubramonian et al., “Rethink- ing DRAM design and organization for energy-constrained multi-cores,” in Proceed- ings of the International Symposium on Computer Architecture, 2010, pp. 175–186. [73] S. Kang, W. Y. Cho, B.-H. Cho, K.-J. Lee et al., “A 0.1µm 1.8V 256Mb phase-change random access memory (PRAM) with 66MHz synchronous burst-read operation,” IEEE Journal of Solid-State Circuits, vol. 42, no. 1, pp. 210–218, 2007.

[74] F. Fishburn, B. Busch, J. Dale, D. Hwang et al., “A 78nm 6F2 DRAM technology for multigigabit densities,” in Proceedings of the Symposium on VLSI Technology, 2004, pp. 28–29. [75] Y. Zhang, S.-B. Kim, J. P. McVittie, H. Jagannathan et al., “An integrated phase change memory cell with Ge nanowire diode for cross-point memory,” in Proceedings of the IEEE Symposium on VLSI Technology, 2007, pp. 98–99.

[76] I. E. Sutherland, R. F. Sproull, and D. F. Harris, Logical effort: Designing fast CMOS circuits. Morgan Kaufmann, 1999. [77] M. A. Horowitz, “Timing models for MOS circuits,” Stanford University, Tech. Rep., 1983.

[78] E. Seevinck, P. J. van Beers, and H. Ontrop, “Current-mode techniques for high-speed VLSI circuits with application to current sense amplifier for CMOS SRAM’s,” IEEE Journal of Solid-State Circuits, vol. 26, no. 4, pp. 525–536, 1991.

[79] Y. Moon, Y.-H. Cho, H.-B. Lee, B.-H. Jeong et al., “1.2V 1.6Gb/s 56nm 6F2 4Gb DDR3 SDRAM with hybrid-I/O sense amplifier and segmented sub-array architec- ture,” in Proceedings of the International Solid-State Circuits Conference, 2009, pp. 128–129. [80] G. W. Burr, M. J. Breitwisch, M. Franceschini, D. Garetto et al., “Phase change memory technology,” Journal of Vacuum Science & Technology B, vol. 28, no. 2, pp. 223–262, 2010.

[81] K. Ishida, T. Yasufuku, S. Miyamoto, H. Nakai et al., “A 1.8V 30nJ adaptive program- voltage (20V) generator for 3D-integrated NAND flash SSD,” in Proceedings of the International Solid-State Circuits Conference, 2009, pp. 238–239,239a. [82] P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren et al., “Simics: A full system simulation platform,” Computer, vol. 35, no. 2, pp. 50–58, 2002. 114

[83] Standard Performance Evaluation Corporation, “SPEC OMP (OpenMP Benchmark Suite),” http://www.spec.org/omp/.

[84] C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The PARSEC benchmark suite: char- acterization and architectural implications,” in Proceedings of the International Con- ference on Parallel architectures and Compilation Techniques, 2008, pp. 72–81.

[85] L. Goux, D. T. Castro, G. Hurkx, J. G. Lisoni et al., “Degradation of the reset switching during endurance testing of a phase-change line cell,” IEEE Transactions on Electron Devices, vol. 56, no. 2, pp. 354–358, 2009.

[86] K. Kim and S. J. Ahn, “Reliability investigations for manufacturable high density PRAM,” in Proceedings of the IEEE International Reliability Physics Symposium, 2005, pp. 157–162.

[87] Storage Performance Council, “SPC Specifications: I/O Trace Repository,” http: //www.storageperformance.org/specs/#traces.

[88] R. A. Oldfield, S. Arunagiri, P. J. Teller, S. Seelam et al., “Modeling the impact of checkpoints on next-generation systems,” in Proceedings of the Conference on Mass Storage Systems and Technologies, 2007, pp. 30–46.

[89] G. Grider, J. Loncaric, and D. Limpart, “Roadrunner system management report,” Los Alamos National Laboratory, Tech. Rep. LA-UR-07-7405, 2007.

[90] S. E. Michalak., K. W. Harris, N. W. Hengartner, B. E. Takala, and S. A. Wender, “Predicting the number of fatal soft errors in Los Alamos National Laboratory’s ASCI Q supercomputer,” IEEE Transactions on Device and Materials Reliability, vol. 5, no. 3, pp. 329–335, 2005.

[91] Los Alamos National Laboratory, 2009, Reliability Data Sets, http://institutes.lanl. gov/data/fdata/.

[92] J. W. Young, “A first order approximation to the optimal checkpoint interval,” Com- munications of the ACM, vol. 17, pp. 530–531, 1974.

[93] J. T. Daly, “A higher order estimate of the optimum checkpoint interval for restart dumps,” Future Generation Computer Systems, vol. 22, no. 3, pp. 303–312, 2006.

[94] N. R. Adiga, G. Almasi, G. S. Almasi, Y. Aridor et al., “An overview of the Blue- Gene/L supercomputer,” in Proceedings of the Conference on High Performance Com- puting Networking, Storage and Analysis, 2002, pp. 60–71.

[95] D. Meisner, B. T. Gold, and T. F. Wenisch, “PowerNap: Eliminating server idle power,” in Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, 2009, pp. 205–216. 115

[96] B. C. Lee, E. Ipek, O. Mutlu, and D. Burger, “Architecting phase change memory as a scalable DRAM alternative,” in Proceedings of the International Symposium on Computer Architecture, 2009, pp. 2–13.

[97] P. Zhou, B. Zhao, J. Yang, and Y. Zhang, “A durable and energy efficient main memory using phase change memory technology,” in Proceedings of the International Symposium on Computer Architecture, 2009, pp. 14–23.

[98] M. K. Qureshi, V. Srinivasan, and J. A. Rivers, “Scalable high performance main memory system using phase-change memory technology,” in Proceedings of the Inter- national Symposium on Computer Architecture, 2009, pp. 24–33.

[99] R. Waser and M. Aono, “Nanoionics-based resistive switching memories,” Nature Materials, vol. 6, no. 11, pp. 833–840, 2007.

[100] K.-C. Liu, W.-H. Tzeng, K.-M. Chang, Y.-C. Chan et al., “Transparent resistive random access memory (T-RRAM) based on Gd2O3 film and its resistive switching characteristics,” in Proceedings of the International Nanoelectronics Conference, 2010, pp. 898–899.

[101] K. Eshraghian, K.-R. Cho, O. Kavehei, S.-K. Kang et al., “Memristor MOS con- tent addressable memory (MCAM): Hybrid architecture for future high performance search engines,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. PP, no. 99, pp. 1–11, 2010.

[102] P. J. Joseph, K. Vaswani, and M. J. Thazhuthaveetil, “Construction and use of linear regression models for processor performance analysis,” in Proceedings of the Interna- tional Symposium on High-Performance Computer Architecture, 2006, pp. 99–108.

[103] B. C. Lee and D. M. Brooks, “Accurate and efficient regression modeling for microar- chitectural performance and power prediction,” in Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Sys- tems, 2006, pp. 185–194.

[104] O. Azizi, A. Mahesri, B. C. Lee, S. J. Patel, and M. A. Horowitz, “Energy-performance tradeoffs in processor architecture and circuit design: A marginal cost analysis,” in Proceedings of the International Symposium on Computer Architecture, 2010, pp. 26– 36.

[105] E. Ipek, S. A. McKee, K. Singh, R. Caruana et al., “Efficient architectural design space exploration via predictive modeling,” ACM Transactions on Architecture and Code Optimization, vol. 4, no. 4, pp. 1:1–1:34, 2008.

[106] P. J. Joseph, K. Vaswani, and M. J. Thazhuthaveetil, “A predictive performance model for superscalar processors,” in Proceedings of the International Symposium on Microarchitecture, 2006, pp. 161–170. 116

[107] C. Dubach, T. Jones, and M. O’Boyle, “Microarchitectural design space exploration using an architecture-centric approach,” in Proceedings of the International Sympo- sium on Microarchitecture, 2007, pp. 262–271.

[108] W. S. Sarle, “Stopped training and other remedies for overfitting,” in Proceedings of the Symposium on the Interface of Computing Science and Statistics, 1995, pp. 55–69.

[109] D. W. Marquardt, “An algorithm for least-squares estimation of nonlinear parame- ters,” Journal of the Society for Industrial and Applied Mathematics, vol. 11, no. 2, pp. 431–441, 1963.

[110] S. Kirkpatrick, J. C. D. Gelatt, and M. P. Vecchi, “Optimization by simulated an- nealing,” Science Magazine, vol. 220, no. 4598, pp. 671–680, 1983.

[111] NASA Advanced Supercomputing (NAS) Division, “The NAS Parallel Benchmarks (NPB) 3.3,” http://www.nas.nasa.gov/Resources/Software/npb.html.

[112] S. Li, J.-H. Ahn, R. D. Strong, J. b. Brockman et al., “McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures,” in Proceedings of the International Symposium on Microarchitecture, 2009, pp. 469–480. Vita Xiangyu Dong

Xiangyu Dong was born in Shanghai, China. He received his B.E. degree in Electrical Engineering from Shanghai Jiao Tong University. He joined the Ph.D. program of the Department of Computer Science and Engineering at Pennsylvania State University in 2007. His research interests include non-volatile memory technology, computer architecture, and three-dimensional (3D) IC design. He has authored and co-authored nineteen conference papers, three journal papers, and three book chapters during his Ph.D. program. He is the student members of both IEEE and ACM. His work on modeling and leveraging emerging non-volatile memories was selected as one out of the three graduate winners of ACM Student Research Competition winners in 2011.