<<

The Pennsylvania University The Graduate School

ARCHITECTING -ADDRESSABLE NON-VOLATILE MEMORIES FOR

MAIN MEMORY

A Dissertation in Science and Engineering by Matthew Poremba

2015 Matthew Poremba

Submitted in Partial Fulfillment of the Requirements for the Degree of

Doctor of Philosophy

May 2015 The dissertation of Matthew Poremba was reviewed and approved∗ by the following:

Yuan Xie Professor of Computer Science and Engineering Dissertation Co-Advisor, Co-Chair of Committee

John Sampson Assistant Professor of Computer Science and Engineering Dissertation Co-Advisor, Co-Chair of Committee

Mary Jane Irwin Professor of Computer Science and Engineering Robert E. Noll Professor Evan Pugh Professor

Vijaykrishnan Narayanan Professor of Computer Science and Engineering

Kennith Jenkins Professor of Electrical Engineering

Lee Coraor Associate Professor of Computer Science and Engineering Director of Academic Affairs

∗Signatures are on file in the Graduate School. Abstract

New breakthroughs in memory technology in recent years has lead to increased research efforts in so-called byte-addressable non-volatile memories (NVM). As a result, questions of how and where these types of NVMs can be used have been raised. Simultaneously, scaling has lead to an increased number of CPU cores on a die as a way to utilize the area. This has increased the pressure on the memory system and causing growth in the amount of main memory that is available in a computer system. This growth has escalated the amount of power consumed by the system by the de facto DRAM type memory. Moreover, DRAM memories have run into physical limitations on scalability due to the nature of their operation. NVMs, on the other hand, provide high scalability well into the future and have decreased static power, one of the major sources of power consumption in contemporary systems. For all of these reasons, NVMs have the potential to be an attractive alternative or even complete replacement for DRAM as main memory. For these types of devices to be feasible, there are some obstacles that must be overcome in order for there to be a compelling reason for NVMs to augment or replace DRAM. Although the static power and scalability are better, NVMs suffers from lower performance, higher dynamic power, and lower endurance than DRAM. Furthermore, the availability of architectural and comprehensive circuit models to explore how these issues can be resolved at a high level are lacking. This dissertation addresses these issues by proposing several models for NVMs at both the architectural and circuit level. The architectural model, NVMain, is built around the assumptions that NVMs may not be complete replacements and thus provides flexibility to model complex memory systems including hybrid and distributed levels of memory. The circuit-level model, DESTINY, combines NVMs with more recent three-dimensional circuit design proposals to obtain performance and energy balanced memory designs. These two models are leveraged to explore several NVM memory designs. The first design employs a hybrid DRAM and NVM and addresses an issue of caching large amounts of NVM in the DRAM portion. The second design considers reworking memory bank design to provide an extremely high-density NVM bank with the capability to access individual sub-units of the memory bank. The final design leverages the high parallelism from access to individual sub-units to schedule memory requests in a more efficient manner.

iii Table of Contents

List of Figures vii

List of Tables x

Acknowledgments xi

Chapter 1 Introduction 1 1.1 Background ...... 5 1.2 Related Work ...... 7

Chapter 2 Simulation Framework for Non-volatile Memories 10 2.1 Introduction ...... 10 2.2 Motivation ...... 11 2.3 Implementation ...... 12 2.3.1 Energy Modeling ...... 12 2.3.2 Non- Support ...... 12 2.3.3 Fine-grained Memory Architecture ...... 13 2.3.4 Memory System Flexibility ...... 13 2.3.5 Verification ...... 14 2.3.6 Timing Verification ...... 14 2.3.7 Energy Verification ...... 15 2.3.8 Data Verification ...... 15 2.3.9 Simulation Speed ...... 15 2.4 Case Studies ...... 16 2.4.1 MLC Simulation Accuracy ...... 16 2.4.2 Hybrid Memory System ...... 17 2.4.3 DRAM ...... 19 2.5 Conclusions ...... 20

iv Chapter 3 Bank-level Modeling of 3D-stacked NVM and Embedded DRAM 21 3.1 Introduction ...... 21 3.2 Motivation ...... 22 3.2.1 Emerging Memory Technologies ...... 22 3.2.2 Modeling Tools ...... 23 3.3 Model Implementation ...... 24 3.3.1 eDRAM Model ...... 24 3.3.2 3D Model ...... 25 3.4 Validation Results ...... 26 3.4.1 3D SRAM Validation ...... 27 3.4.2 2D and 3D eDRAM Validation ...... 28 3.4.3 3D RRAM Validation ...... 29 3.5 Case Studies using DESTINY ...... 30 3.5.1 Finding the optimal memory technology ...... 30 3.5.2 Finding the optimal layer count in 3D stacking ...... 31 3.6 Conclusion ...... 31

Chapter 4 Improving Effectiveness of Hybrid-Memory Systems with High-Latency Caches 33 4.1 Motivation ...... 35 4.2 Implementation ...... 38 4.2.1 Managing the Fill Cache ...... 40 4.2.2 Re-routing Requests ...... 41 4.2.3 DRAM Cache Load ...... 42 4.2.4 Coalescing Fills ...... 42 4.2.5 Modifications to DRAM Cache ...... 43 4.3 Published Results ...... 44 4.3.1 Experimental Setup ...... 44 4.3.2 DRAM Cache Architectures ...... 45 4.3.3 Hardware Prefetcher ...... 46 4.3.4 Selection ...... 47 4.3.5 Baseline Results ...... 48 4.3.6 Average Request Latency ...... 48 4.3.7 Prefetcher Effectiveness ...... 50 4.3.8 Set Indexing Effectiveness ...... 50 4.3.9 Coalesced Requests ...... 51 4.3.10 Sensitivity of Fill Cache Size ...... 51 4.3.11 Application Classification ...... 52 4.4 Conclusion ...... 53

Chapter 5 Leveraging Non-volatility Properties for High Performance, Low Power Main Memory 54 5.1 Introduction ...... 55 5.2 Motivation ...... 56 5.2.1 Non-Volatile Memory Design ...... 56 5.2.2 The Non-Volatility Property ...... 57

v 5.3 Implementation ...... 59 5.3.1 Partial-Activation ...... 59 5.3.2 Multi-Activation ...... 60 5.3.3 Backgrounded Writes ...... 60 5.3.4 Ganged Subarray Groups ...... 61 5.4 Published Results ...... 62 5.4.1 Memory Controller and Scheduling ...... 64 5.4.2 Multi-Issue Memory Controller ...... 64 5.4.3 Address Interleaving ...... 65 5.4.4 Number of Column Divisions and Subarray Groups ...... 65 5.4.5 Impact of Backgrounded Writes ...... 67 5.4.6 Energy Comparison ...... 68 5.4.7 Design Optimization ...... 68 5.4.8 Sensitivity Study ...... 69 5.4.9 Future Devices ...... 70 5.4.10 Application to STT-RAM and RRAM ...... 71 5.4.11 Comparison with Contemporary DRAM ...... 71 5.5 Design Implementation ...... 72 5.5.1 Overhead Costs ...... 73 5.5.2 Area Overhead ...... 73 5.5.3 Yield and NVM Lifetime ...... 74 5.6 Conclusion ...... 75

Chapter 6 Early Activation Scheduling for Main Memories 76 6.1 Motivation ...... 77 6.1.1 Baseline System Design ...... 78 6.1.2 Oracle Analysis ...... 80 6.2 Results and Analysis ...... 80 6.2.1 Missed Prediction Implications ...... 80 6.2.2 Limiting Amounts of Early-ACTs ...... 81 6.2.3 Unrealized Performance Potential ...... 82 6.2.4 Memory Controller Implications ...... 82 6.3 Conclusions ...... 83

Chapter 7 Dissertation Conclusions 85 7.1 Future Directions ...... 87

Bibliography 88

vi List of Figures

1.1 Overview of Memory Architecture. Only one memory controller with one channel is shown, however any number of channels is possible...... 6

2.1 Relative error of an estimated write-pulse time compared to exact measurement based on data values...... 11 2.2 Overview of NVMain Architecture. Only one memory controller with one channel is shown...... 13 2.3 Calculated memory subsystem power of NVMain normalized to DRAMSim2. . . 15 2.4 Percentage of simulation time spent in memory subsystem...... 16 2.5 Absolute error of an estimated write-pulse time compared to exact measurement based on data values...... 17 2.6 Frequency of 2- data values written to MLC cells for various SPEC2006 bench- marks...... 18 2.7 IPC results and migration statistics for a hybrid memory...... 18 2.8 IPC results of DRAM Cache with varying prediction accuracy...... 19 2.9 Predictive DRAM Cache hit rate at various accuracies...... 20

3.1 High-level overview of DESTINY framework. Configurations are generated from extended input model and fed to NVSim core. Results are fined tuned via 3D model and filtered by optimization to yield result outputs...... 24 3.2 4-layer monolithically stacked RRAM Storage elements are sandwiched directly between layers of wordlines (south-east orientation) and bitlines (south-west ori- entation)...... 27

4.1 A single DRAM row in a DRAM cache. The 2KB row is divided into 32 64- byte cache line sized segments. The first 3 segments are used for tags while the remaining segments are ways in the cache...... 36 4.2 Example fill to a baseline DRAM cache. In (a), 3 tags are read and no commands are issued until data returns. The request misses in the cache and data is fetched from off-chip memory in (b). The DRAM cache may or may not be precharged if there are other requests to this DRAM bank. In (c), the bank is re-activated if needed, and the data is written followed by a tag write...... 37 4.3 Example snapshot of bank queues in SPEC2006’s bwaves benchmark...... 38 4.4 (a) Fill Cache entry being evicted where a coalesce can be performed. (b) The DRAM cache probing the Fill Cache for requests for bank 4...... 39

vii 4.5 Percentage of reused requests based on type. Types are referenced prefetches (RP), unreferenced prefetches (UP), and demand requests (UD). Average is 29.22% for RP, 41.88% for UP, and 28.90% for UD...... 41 4.6 Example set indexing schemes for DRAM caches. In (a), indexing is similar to SRAM caches where the lowest above the byte offset determine the set and upper bits determine the tag. In (b), a portion of the tag is after the byte offset, promoting row-buffer hits...... 43 4.7 diagram of the architecture of a main memory subsystem containing a Fill Cache (F$) along with state machine showing basic request flow in a DRAM cache memory system with Fill Cache...... 46 4.8 Speedup of DRAM cache with Fill Cache over DRAM cache baseline. Results for the baseline, STeMS baseline, and combined approaches shown. The x-axis represents the speed-up over the baseline of DRAM Cache with MissMap only, and the y-axis is the benchmark...... 48 4.9 Average DRAM cache request latency for the first 100 million execution cycles showing how Fill Cache averages out request latency during high memory periods (e.g., during application start). The darker line is the average request latency for the DRAM Cache + MissMap design and the lighter line is for the DRAM Cache + MissMap + Fill Cache Design. Average latency (x-axis) and the total execution time (y-axis) are in memory cycles...... 49 4.10 Accuracy and number of covered and uncovered requests issued to the DRAM cache by prefetcher...... 50 4.11 Percentage of fill requests able to be coalesced...... 51 4.12 Percentage of read requests where data was returned by the Fill Cache...... 52

5.1 Potential Energy Savings in NVM by opening 1/Nth of a Row Buffer...... 55 5.2 The design of a typical NVM memory. Two NVM sub-arrays are illustrated. One sub-array is an 8x4 structure, where 8 local bitlines are selected by 4 local Y- select. The global Y-select further selects out the final I/O bitline that is sensed by the global sense amplifier (S/A)...... 57 5.3 Non-volatility property allows of only top left cell in a tile. Neither of the right two columns are connected to S/A...... 58 5.4 The access schemes proposed in FgNVM. (a) Partial-Activation: Only one of the top two tiles is read from the bank, energy saved is in the other tile; (b) Multi-Activation: Data read from tiles in different rows, same back; twice the bandwidth potential; (c) Backgrounded Write: Upper-left tile is read, lower-right tile is written. Reduces read/write interference...... 59 5.5 Ganged subarrays use multiple SAGs accessed in parallel to increase row buffer sizes...... 61 5.6 IPC improvement over baseline PCM design compared to FgNVM, multi-issue FgNVM, and an Ideal Scenario. All results show 8×2 FgNVM designs...... 64 5.7 Performance difference with and without interleaving on rows and columns: Shows that interleaving rows and columns is beneficial to FgNVM designs...... 65 5.8 IPC impact of adding more subarray groups averaged over the SPEC benchmarks: Not much performance improvement by dramatically increasing subarrays, with a maximum around 4% in libquantum, gobmk, GemsFDTD, and soplex...... 66

viii 5.9 IPC impact of adding more column divisions averaged over the SPEC benchmarks: Large difference in IPC when adding more column divisions; ideal case moves opposite of real case...... 67 5.10 The ’nowb’ show IPC when Backgrounded Writes are disabled. The remaining bars show the IPC improvement when Backgrounded Writes are enabled. Up to 12% increase for lbm and 5% on average...... 67 5.11 Energy consumption normalized to baseline NVM prototype. Shows a significant decrease in most FgNVM configurations...... 68 5.12 The CCD hit rate (hit rate of data in all sensed column divisions). As more column divisions are added, the rate drops due to underfetch...... 69 5.13 Sensitivity of our tRCD and tCAS selections on 8×32. Shows sensitivity when decreasing tCAS is proportional to CCD hit rate...... 70 5.14 Study of reducing total tRCD and tCAS. This simulates performance of future devices with faster activation or sensing time...... 70 5.15 Application of FgNVM to a subarray-ganged RRAM design. Yields results similar to PCM design...... 71 5.16 FgNVM vs. a mid-grade DDR3 DRAM normalized to baseline NVM. Shows that FgNVM design can help approach the speeds of DRAM...... 72

6.1 Example of an Early-ACT in action. The top line shows the baseline case, where ACT is created and issued when demand request arrives at MC. Bottom is Early- ACT idea, where ACTs are issued speculatively before a demand request arrives. 77 6.2 Baseline system setup for testing Early-ACT concept. Four CPUs with three levels of cache, access predictor (AP), memory controller (MC), and 4 channels of memory...... 78 6.3 Memory access predictor accuracy at L3 cache across memory intensive SPEC2006 benchmarks. High accuracy is achieved for confident issuance of Early-ACT com- mands...... 79 6.4 Results of running Early-ACT oracle prediction on PCM type memories. The average total memory latency is shown with baseline, oracle pairs for each memory type...... 79 6.5 Results of running Early-ACT using a naive issuance of Early-ACTs. Baseline and simulation with Early-ACT are shown...... 80 6.6 Total number of Early-ACT requests. The difference between Early-ACT and oracle is the total number of false hits and misses combined...... 81 6.7 Instructions per cycle for each benchmark when limiting the number of Early- ACTs. Less-than a percentage series means Early-ACTs are only issued when the memory queue is less than said percentage full...... 82 6.8 Average number of ACT cycles saved when issuing an Early-ACT. This average can range from 0 to 48. In all cases, the oracle savings are much higher than a realistic implementation...... 83 6.9 Total number of ACT cycles saved across the entire simulation run. The differ- ence in savings between Early-ACT and oracle mirror the speedup shown in the performance figure...... 83

ix List of Tables

2.1 Simulated MLC Timing Parameters ...... 17

3.1 Validation for 3D SRAM model...... 28 3.2 Validation of 2D and 3D eDRAM...... 28 3.3 Design space exploration results of determining the optimal memory technology for a desired optimization target (refer Section 3.5.1). The table shows results on all parameters for comparison purposes...... 29 3.4 Validation of 3D RRAM...... 29 3.5 Design space exploration results of determining optimal number of 3D-stacked layers for various optimization targets for STT-RAM (refer Section 3.5.2). The table shows results on all parameters for comparison purposes...... 30

4.1 Experimental Setup Parameters ...... 42 4.2 Total memory traffic over first 2 billion cycles of each benchmark. Benchmarks selected are in bold...... 44 4.3 Categories of benchmarks run in our simulations. Some benchmarks belong to multiple categories...... 53

5.1 MPKI and WPKI of Simpoint slices...... 62 5.2 Experimental System Setup ...... 63 5.3 Summary of Area Overheads in FgNVM design...... 73

6.1 Experimental System Setup ...... 78

x Acknowledgments

I would like to acknowledge my gratitude to Dr. Yuan Xie who served as my research advisor for the past 7 years – 5 years during graduate school as well as two years as an undergraduate student. For his encouragement, motivation, and ability to push me forward while working on exciting research topics and his strong connections to top researchers which helped me gain exposure and get where I am today. Furthermore, I would like to thank my committee members Dr. John Sampson, Dr. Mary Jane Irwin, Dr. Vijaykrishnan Narayanan, and Dr. Kennith Jenkins for their help and advice on my dissertation proposal and throughout the years while my advisor was away. Moreover, for matching me with other students with similar research interests so that we may work together towards top-notch publications. Thank you to all students in the department who gave me the opportunity to work with them, both those who have departed and those who remain: Dr. Jin Ouyang, Dr. Cong Xu, Dr. Tao Zhang, Dr. Mike Debole, Dr. Qiaosha Zuo, Dr. Dimin Niu, Dr. Lian Duan, Dr. Xiaoxia Wu, Dr. Jishen Zhao, Dr. Xiangyu Dong and Dr. Jue Wang, Dr. Karthik Swaminathan, Dr. Kevin Irick, Dr. Guangyu Sun, Hsiang-Yun Cheng, Jia Zhan, Ping Chi, Jing Xie, Ivan Stalev, Hang Zhang, Yang Zheng, and Kaisheng Ma. Finally, I would like to acknowledge my family and friends for motivation and encouragement throughout the years.

xi Dedication

This dissertation is dedicated to my friends, family, and colleagues who have provided support, encouragement, and guidance throughout my graduate school career.

xii Chapter 1

Introduction

Struggles to improve memory technologies are becoming a large problem in increasing device capability for existing mainstream memory technologies such as DRAM. For several years there have been issues in decreasing DRAM half-pitch needed to improve memory capacity and price- per-bit metrics. As of the latest ITRS roadmap, there are still no known solutions to some of the problems being faced by DRAM, including reliable charge storage and sensing mechanisms [1]. These struggles are drawing a lot of attention towards redesigns and potentially even replacements for DRAM. Several have proposed the replacement of main memory with non-volatile memories (NVMs). The intrinsic characteristic of non-volatile memories is the fact that data is not lost. This implies the data does not need to be refreshed and that circuitry such as word-line drivers, sense-amps, and write drivers can be power gated without risk of data loss. Since refresh and stand-by power in the DRAM arrays themselves are two major sources of power dissipation in DRAM, this has the potential for high energy savings. Unfortunately the operational energy of non-volatile memories is nominally high for reads and is even worse for writes, which can eat away at much of these savings. Some of these proposals therefore consider hybrids of DRAM and non-volatile memories [2, 3] to combat this. Simply switching to the DRAM cells with non-volatile memory cells is not a viable option, since there are several differences in operation between the two technologies. Some of the issues with these technologies in to the operational energy are longer latency times. Specifically, the write latency on non-volatile memories can be orders of magnitude larger than the read latency. Due to the operation of the , however, it is difficult to optimize these write latencies. Some architectural-level approaches have been taken to help hide these high latencies, including write cancellation, write pausing [4], using write buffers, and smaller row buffer sizes [5]. However, there still remains much work to be done if non-volatile memories can replace DRAM as main memory. In addition to these issues, non-volatile memory research is increasingly focusing on other 2 unique characteristics of the memory cells. Unlike DRAM, non-volatile memories have the ability to reliably store multiple bits per cell. Some NVM technologies additionally allows for new memory layouts, such as cross-point designs [6] and high-density 3D-stacked memories [7, 8, 9]. However, such designs currently have additional problems with reliability and stability which causes the read and write latencies to potentially differ non-uniformly and have high ”sneak” current increasing overall memory power. As a direct result of this, architectural-level simulation with knowledge of application data can be very important to explore the interactions of circuit- level issues and provide more accurate results of research prototype designs.

The Problem: Although several works have already been published exploring design of non- volatile memory, several of these works may be inaccurate due to high-level abstractions made during simulation. The lack of accuracy in abstracting the difference in memory architectures can also mask problems that would exist in real-world embodiments of designs. As a result better simulation frameworks are needed to find these problems and explore solutions. In addition to this, very few works utilize the unique features (i.e., non-volatility) in these emerging memory technologies, and simply consider the lack leakage power as the main feature of NVMs. This leaves a large gap in research between the circuit-level implementations and architectural- level explorations of next-generation memory systems.

Demonstrated Solutions:

(I) The first portion of this dissertation introduces a simulation framework to model new char- acteristics of emerging bit-accessible non-volatile memories taking into account some of the miss- ing characteristics of NVMs. Current simulators neither have ability to model hybrid memory systems, endurance, or schemes nor accurate models for energy, write scheduling, and real application data. Furthermore, the monolithic design of the simulators makes adding support for emerging NVMs a challenge. We implemented the NVMain simulator, a highly flex- ible simulator allowing for modular memory design (e.g., memory controllers, interconnects, and banks are the modules). This modular design allows for easy implementation of hybrid memory systems as well as extensibility needed for exploring endurance, fault tolerance schemes, and write programming policies. In addition to this, the ability for per-module energy modeling and shared request flow allows for accurate energy models with data availability. We show that sim- ply increasing the service time of write requests in existing simulators is not sufficient to model non-volatile memories. This is especially true for multi-level cell (MLC) designs. We show that actual measured IPC can be highly inaccurate between applications, with a high correlation to their data patterns.

(II) The second portion of this dissertation leverages the need for bank-level models of energy and latency for given data pairs written to a memory cell. We extend the work in the first portion of this dissertation and build on previous widely known research tools to fill the gaps needed for 3 circuit-level memory simulation. We implement DESTINY, a 3D design-space exploration tool for SRAM, eDRAM and non-volatile memory. The values output from DESTINY can be used as input for NVMain. In addition to non-volatile types, the tool also fills the gaps of memory technologies missing from previous tools, such as eDRAM and general 3D stacked memories. The tool is also able to automate design space exploration and design caches with the optimal number of 3D layers or suggest a technology for a given optimization target. Therefore, this tool is not only useful as a supplement to NVMain, but also has high utility for general computer architecture research.

(III) The third portion of this dissertation explores the challenges of hybrid memory systems. For this work, we utilize a large DRAM memory as a cache-block sized last-level cache (LLC). Using such hybrid memories is increasing in popularity based on the intuition that the high write time of non-volatile memories can be easily hidden by a large, fast cache. In this hybrid memory design, write requests originate both from writebacks in higher level caches as well as ”Fill” requests from main memory. However, unlike their SRAM counterparts, DRAM-based caches are significantly slower than the next higher cache in the memory hierarchy. As a result, the increasing number of writes is more problematic when blocking read requests to the DRAM cache. For this work, we propose an additional ”Fill Cache” to be used to hold write requests. Unlike a write-buffer, the fill cache can be designed to match cache-sets to banks in the DRAM cache. As a result, the fill cache can be much larger and hold much more data than a write-buffer. The increase in size allows for a residency time that is much larger than a write buffer. We show that this allows for additional techniques to be used exclusively on fill caches. In addition to write buffering, we show that the fill cache can be used to coalesce row-buffer hits in the DRAM cache, filter requests and drop fills to the DRAM cache, as well as provide a fast path for accessing unfilled data such as long prefetch requests typically seen at lower-level caches.

(IV) The fourth portion of this dissertation involves the design of a non-volatile main memory that leverages the unique non-volatility characteristic of the memory cells. Previous works show that sense amplifiers used in non-volatile main memories are significantly larger than sense ampli- fiers in DRAM sub-arrays [10, 11, 12, 13]. As a result, non-volatile memories with a high enough resistance relative to the global I/O lines (i.e., the wires delivering the data) may employ global sense amplifiers. One prior work concludes it is more area and power efficient to reduce the size of row buffers and implement multiple SRAM-cached row buffers [5]. We study the data access patterns of multiple benchmark applications and found many interesting characteristics. First, the entirety of a row-buffer is generally not used. The global I/O lines therefore need not be sensed if the corresponding data bits are never accessed. Not sensing these columns can result in significant energy reduction over an activation cycle (i.e., when a row of data is sensed). Second, the depth of the queues provides enough insight to find requests on these unused I/O lines that reside in a different row than the request currently being sensed. As a result it is possible to sense both requests in parallel. We show that a low-area overhead design is possible because of the 4 non-volatility feature of the memory cells. Third, unused I/O lines in both cases can also be used to sneak write requests. As a result, the write requests can potentially be entirely non-blocking. Using these observations, we simulate a memory system with lower energy, higher performance, and minimal area overhead.

(IV) The fifth portion of this dissertation leverages the non-volatile memory design proposed in the forth portion in order to explore potential for architecture-level applications. Specifically, we consider the ability to decouple the standard row-based addressing system used to access main memory type designs and propose a method to speculatively open arbitrary rows using an activation command. We can do this by leveraging the property that multiple rows may be open in fine-granularity memory systems. This means if we can accurately predict which rows will become open in the near future, we can preemptively issue an activation to these rows so that data is available in the memory row buffer as soon as the transaction arrives at the memory controller. By doing so, the average activation time is reduced and the number of row buffer hits for read and write transaction increases, reducing overall memory latency and boosting system performance.

The goal of this dissertation is to explore the potential for non-volatile memory to replace DRAM as main memory, either partially or completely. Through the five portions listed above, we are able to introduce several techniques which can be used to improve performance or reduce energy consumption of a computer’s memory subsystem. Although non-volatile memory may be better suited for replacement in other places in the memory hierarchy, such as cache or solid-state disks, we limit the scope to main memory in an attempt to get power and performance metrics closer to DRAM. 5

1.1 Background

NVM Cell characteristics

Non-volatile memory cells are fundamentally different from current DRAM and SRAM-based memories because they store information as a resistance value rather than storing a charge. The main benefit of this is the ability to keep information without a power source. This is becoming increasingly important today as technology scaling smaller and smaller has begun to cause leakage power to be the dominant source of power dissipation. Non-volatile memories can be realized in multiple ways, but the most popular types of NVM devices today are Phase-Change Memory (PCM), Resistive RAM or (RRAM), and spin-transfer torque RAM (STT-RAM). Phase-change memory stores information by heating a chalcogenide element. Heating such an element to a high temperature by providing high current and quickly removing it causes the material to move and form an amorphous structure that is ”frozen” once the current is removed. Heating to a lower temperature that is hot enough for the material to move and gradually cooling it induces crystal growth in the cell. This crystalline structure is much lower resistance than the amorphous state, and thus we can provide two unique values to be stored in the cell. Resistive-RAM memories stores information in a metal-oxide insulator between two elec- trodes. Introducing a potential difference on the electrodes causes soft-breakdown of the metal- oxide material and generates oxygen vacancies in this layer. These vacancies move towards the opposing electrode causing a conductive filament to be formed. Flowing current in the opposite direction forces the oxygen vacancies back into the metal-oxide layer, removing the conductive filament. Again, the two different states of the cell are enough to store at least two measurable values in the cell. Spin-transfer torque RAM stores information in a reversible ”free” layer. This magnetic layer is sandwiched together with a layer with permanent magnetization used as a reference. By providing current flow in one direction, the magnetic field of the free layer can be changed. The difference in directions of the magnetic fields creates two unique resistance values which can be used to store data.

Memory Organization

Main memory design is typically organized into multiple channels, ranks, and banks. Figure 1.1 shows the architecture of a single memory channel. Additional memory channels would be similar to this figure. The memory controller is responsible for scheduling requests to the memory system. It is typically connected via an interconnect (e.g., a transmission-line from a CPU’s memory controller to the memory modules). Each memory rank consists of a group of banks. There may be multiple ranks on the same interconnect. Within each bank, there are multiple subarrays which subdivide the bank in the vertical direction and share a common local row decoder which is driven by the bank’s global row decoder. Each subarray is further subdivided in the horizontal direction into multiple tiles. Each tile typically passes through a multiplexer to reduce the amount 6

Rank Rank Memory Controller Commands: Sub-Array . . .

Address Trans RAS, CAS, PRE, REF, Mapping lator .. tiles .. PowerDown, Refresh PowerUp… Bank Request Interconnect

Scheduling . . . Incoming Requests Incoming Types:

OffChip, OnChip, Request Queue (Optical) Bank

Figure 1.1: Overview of Memory Architecture. Only one memory controller with one channel is shown, however any number of channels is possible.

II I I II II I

III I

Figure 1.2: Design of DRAM tiles. Four tiles shown in 2x2 layout. of data output by each tile. Non-volatile memory prototypes have followed this similar organization, in order to reduce the cost per bit of the memory. Figures 1.2 and 1.3 show the low-level design of a 2x2 set of tiles for DRAM and NVM respectively. DRAM typically has one voltage-based sense amplifier per bitline. However, non-volatile memory prototypes typically employ current-based sense amplifiers which are much larger than the sense amplifiers used in DRAM [10, 11, 12, 13]. As a result, these sense amplifiers may be placed globally, and isolation transistors are used at the edge of each tile. 7

g gg g gg gg

ggg g

Figure 1.3: Design of NVM tiles. Four tiles shown in 2x2 layout.

The device-level and circuit-level characteristics of emerging non-volatile memory are important to consider when developing architectural-level designs so we can utilize the unique circuit design properties advantageously.

1.2 Related Work

Main Memory Modeling Research on main memory typically uses verified simulators as a baseline for their work. DRAMSim2 [14] is a popular tool for modeling DRAM memories. In addition to this tool, USIMM [15] and DrSim [16] also exist to simulate DRAM. More recently, a event-based model [17] is included with the gem5 [18] simulator. These works utilize timing constraints provided by JEDEC specifications [19, 20, 21] to evaluate performances and measure- ments provided from DRAM device datasheets to calculate energy according to memory vendor guidelines [22]. Other tools such as DRAMPower [23] can utilize memory command scheduling traces to determine memory system power more accurately. In addition to architecture-level models, several circuit and device models have been developed for design space exploration of individual memory banks or ranks in order to optimize a memory design for more specific constraints compared to commodity memories. The CACTI tool [24] simulates SRAM caches but has since been extended to support eDRAM and DRAM. This tool resulted in several derivative tools for more specific purposes under the same general framework. Mamidipaka and Dutt propose eCACTI [25] which adds a leakage model into CACTI and Li et al. [26] propose CACTI-P which models low-power caches (e.g., cache with sleep transistors). Chen el al. presented CACTI-3DD [27] which adds a TSV model for DRAM memory and the 3DCACTI [28] tool provides the ability to model 3D SRAM. A device level model is also provided 8 by the RAMBUS corporation [29] to explore the trade-offs when sizing specific transistors (e.g., wordline drivers or sense amplifiers).

NVMs for Cache Memory Because of the many different types of NVMs, there is an oppor- tunity to use NVMs at various levels of the memory hierarchy and may even be a candidate for [30]. As a result, some work have explored utilizing NVMs at the cache-level. The most obvious usage is drop-in replacements of NVMs to replace SRAM or eDRAM caches as a last-level cache. Chang et al. compare an STT-RAM LLC to SRAM and eDRAM and show minimal performance impact using STT-RAM for large energy reduction [31]. Similar results were shown for GPGPUs by Samavatian et al. which utilize multiple levels of cache [32]. However, implementation of NVMs in higher level caches was show to dramatically reduce performance. Another popular area of exploration is NVMs stacked on volatile RAMs, such as NV-SRAM and NV-Flip-flops (NV-FF) known as normally-off computing. These devices can be used to periodically back up the data in the volatile cells by directly copying the data to the NVM cell. An overview of the project was given as a special session [33] describing a memory hierarchy for a processor [34] and very low power devices such as sensors [35] and medical devices [36]. More recently this concept has been refined by improving the NV-SRAM/NV-FF devices [37].

NVMs for Main-Memory Several works consider improvements to DRAM main memory that may be applicable to non-volatile memory designs. In [38] the authors show that subdividing memory banks into multiple subarrays can provide better performance than having an increasing number of memory banks of smaller size. The DRAM-specific implementation is much lower area than the ”many-bank” counterpart. In [39] the authors propose subdividing bitlines within memory arrays using additional isolation transistors to reduce the effective capacitance of near- memory; cells closer to the sense amplifiers. The result is faster near-memory but a slower far- memory (i.e., cells beyond the isolation transistor). The authors use a row-swapping approach to move more frequently accessed data into the near-memory. Another work [40] proposes accessing a single tile of memory in order to reduce energy consumption of activating the additional tiles in the subarray. This work increases the area of the DRAM die to allow for enough internal wires to carry the data from a single tile. Other papers exist addressing the issues of NVMs directly, such as the high write latency in PCM. In [5], the authors design a main-memory based phase change memory where a normal DRAM sized row buffer is replaced with several smaller row buffers. The main idea is that sense amplifiers are much larger than extra SRAM buffers needed to temporarily store previously opened row buffers. Their results conclude that using row-buffer sizes smaller than contemporary DRAM and leveraging multiple SRAM-based caches for recently opened rows provides the best trade-off in terms of energy, performance, and area. In [4], the authors mitigate the write performance impact on reads by allowing for write cancellation and write pausing. The motivation shows that write requests interfering with read requests cause the average memory latency of reads to increase by more than double. As a result, system performance as a whole greatly decreases. However, phase-change memory has the unique characteristic of predictively changing 9 its resistance. This allows writes to be canceled or ”paused” and restarted later, so that read requests can be performed. Other works have considered using a fixed portion of main memory as persistent memory. This new class acts as memory between main memory and storage class memory with characteristics from both. Persistent memory can be accessed in the same was as main memory through loads and stores while keeping data during system power cycles similar to a hard drive. The benefits of this memory are faster access to storage class data without the need to page in and out data from a backing storage. However, the implementation of such memory has several issues with data consistency, such as need for journaling, write-ordering, and atomicity. As a result, several works have explored solutions in this area, including requirements [41, 42], design of the hybrid main memory [43], consistency solutions [44], and design [45, 46, 47]. Chapter 2

Simulation Framework for Non-volatile Memories

2.1 Introduction

In the past decades, the evolution of dynamic random access memory (DRAM) technology has continued to provide a low latency, high-bandwidth and low-power main memory solution. In particular, the JEDEC-DDR families (DDR-DDR4), have been extensively studied in academia as they are the mainstream memory standards. As a result, several DRAM simulators have been developed in recent years to accelerate research on the main memory subsystem [14, 15, 16]. Nonetheless, these DRAM simulators have little flexibility to be extended for novel DRAM technologies, such as Wide I/O [48] and LPDDR2-N [21], as well as 3D-stacked DRAM [49, 50]. In addition, these simulators are dedicated to DRAM modeling and are not built to pro- vide accurate models of emerging nonvolatile memory (NVM) technologies (e.g., STT-RAM, PCM, and RRAM), which desire endurance and fault modeling, programming policy models (e.g., singe-set-multiple-reset vs. single-reset-multiple-set for MLC PCM), and hybrid memory systems as well as need for updated energy and data modeling. As a result, a new memory simulator equipped with NVM support as well as high flexibility and friendly user interface is beneficial for researchers in memory subsystem so they can quickly start their studies with little modification. We demonstrate the NVMain simulator to fill this gap. Many features have been designed in NVMain to support both DRAM and NVM simulation. Moreover, NVMain enables the decoupling of memory controller, DIMMs, and interconnect between them, allowing for hi- erarchical energy computation rather than full-system estimation from empirical measurements. A distributed timing control is used to simplify latency modeling in custom memory designs. In addition to this, a fine-grained memory bank model, MLC support, and inlined cache with bet- ter support for heterogeneity are available to encourage users to explore various memory system designs. 11

150% 125% 100% 75% 50%

Relative Relative Error 25% 0%

Figure 2.1: Relative error of an estimated write-pulse time compared to exact measurement based on data values.

2.2 Motivation

Energy modeling and performance evaluation are the most important aspects the need to be addressed for new simulation frameworks. Typical DRAM memory simulators utilize published numbers for current (IDD values) measured empirically on actual devices. This approach is problematic when published data is not available, such is the common case for non-volatile memories and prototype memory system designs. Additionally, calculations using only these values solely include device power; interconnect and memory controller power are not captured in this measurement. Simple models for measuring performance in non-volatile memory can be designed by encap- sulating the non-volatile memory timings as DRAM timings. Typical non-volatile memories have longer write times. This can be modeled in two separate ways: as write-back or write-through. If we assume there is an SRAM or flip-flop based row-buffer to hold data while a row is opened (e.g., RDBs in LPDDR2-N [51] or the proposal in [5]), the write is a write-back and the time can be modeled by increasing the precharge timing tRP to the worst-case value for a write. If there is no buffer, writes are write-through and must be written back to cells immediately. Modifying the write recovery timing tWR is a simple way to model this approach. These timing modifications become more complicated with MLC type cells. Worst case timing for MLC is typically 7-8 times larger than SLC timings, due to a program and verify approach used. As a result, the data values for a particular application as well as programming policy models can play a significant role in simulating the most important aspects for simulation frameworks – performance and energy. The results shown in Figure 2.1 show the relative error of IPC values when estimating using tWR and an exact approach which inspects data. In the figure, gobmk and zeusmp performance is underestimated (i.e., actual performance is higher) using the estimated approach, indicated by black. The remaining benchmarks are overestimated (i.e., actual performance is lower) by the shown amount, indicated by grey. Our data inspection approach assumes there are enough write drivers to write the maximum number of bits per-device (64-bits in this case). Therefore, the simplest and most efficient programming policy in terms of latency is to have each write driver write data simultaneously. 12

The total time for the multi-bit write request is simply the maximum time of any write driver. As a result, writing all zeros is dramatically different than writing a memory word of random zeros and ones. For the exact approach we considered three different maximum deviation values for write pulses ranging from 0 to 3. Compared to no deviation in write pulses for an exact approach, the maximum error was 7.5% at 3 standard deviations. This shows that even allowing the write pulse count to deviate by as much as 50%, the estimates are much more disparate, indicating that the actual data values are extremely important for accuracy.

2.3 Implementation

There are multiple research interests in non-volatile memory that currently have modeling dif- ficulties. Prior NVM simulations have been done by simply reusing existing DRAM simulators with NVM timing parameters. Nonetheless, such simulation is unable to capture the unique features in NVMs, which desires correct endurance modeling, fault recovery, and MLC opera- tion with high accuracy in terms of energy and latency. In addition, a hybrid system that has both NVM and DRAM, each with their own timings, is also becoming increasingly popular but requires more flexible simulators to be modeled easily.

2.3.1 Energy Modeling Existing DRAM memory simulators utilize published current numbers (IDD values) measured on an actual device. This approach is problematic when published data is not available, such is the common case for non-volatile memories and prototype memory system designs. In order to work around this problem, we provide two device level energy models in NVMain 2.0: Current-mode and Energy-mode, and provide independent power calculations for remain- ing modules in the memory subsystem. The Current-mode model operates similar to standard DRAM simulators using IDD values for power calculation and is applicable to the simulated DRAM systems or DRAM portions of hybrid memory systems. Alternatively, Energy-mode allows use of values readily obtained from circuit-level simulators such as NVSim [52] or CACTI [53]. Each operation increments the system energy usage. Standby and powerdown energies are also calculated using the simulated leakage results.

2.3.2 Non-volatile Memory Support Non-volatile memory requires concepts outside the realm of DRAM. DRAM simulators typically ignore data being written since timing and energy parameters are agnostic to these values. In contrast, the innovations on endurance improvements, fault recovery mechanisms, and MLC are commonly studied in NVM systems. Although this work is not designed for comparing endurance mechanisms, many of these techniques employ some form of data encoding, which changes the data value ultimately written to the memory cells. We show in Section 2.4.1 how important 13

Rank Rank Memory !Controller Commands: Sub Array .!! .!! . Address Trans RAS, CAS, PRE, REF, Trace Mapping lator ! PowerDown, Refresh Prefet Visual PowerUp… cher izer GEMS, ! Cache Bank Request Interconnect bank

Scheduling . !!

GEM5, .

FindStarvedReq FindRowBufHit Types: !! Endurance FindClosedBank FindOldestReq OffChip, . Simulator OnChip, Request Queue

Endurance Interface (Optical) Bank

Figure 2.2: Overview of NVMain Architecture. Only one memory controller with one channel is shown. data values can be and posit that simulating any data encoding for endurance or fault recovery mechanisms if very beneficial.

2.3.3 Fine-grained Memory Architecture

In NVMain 2.0, a bank is no longer the most basic memory object. Instead, sub-array are defined as the basic blocks to support various sub-array-level parallelism (SALP) [38] modes seamlessly. The sub-array is selected similar to other memory objects using the address translator. This object also contains the MLC write, endurance, and fault models. NVMain 2.0 also allows for fine-grained refresh, including all-bank refresh as in DDR, per-bank refresh as in LPDDR, or bank-group refresh as in DDR4 [20].

2.3.4 Memory System Flexibility Figure 2.2 shows the high-level design of NVMain [54]. In the figure, each box represents a memory base object. A distributed timing model is used whereby all timings related to a specific memory object are tracked by that object itself. For example, the Sub-Array tracks the most commonly found memory timing parameters (e.g., tRCD, tCAS, tRP), while the rank objects keep track of rank timing parameters (e.g., tRTRS). All timing parameters are considered before any memory commands can be issued and the worst case value is taken. Therefore, the distributed timing model is functionally equivalent to other simulators using monolithic memory controllers which track all timing values. We leverage this distributed approach to introduce two new concepts in NVMain 2.0: Ad- vanced Address Translators and Memory Object Hooks. First, we change the flow of requests between memory objects to always invoke the address translator. Second, we allow for “plugin” 14 memory objects called hooks that can snoop in-flight requests and potentially change request flow.

Address Translators Typically, the address translator takes a and extracts specific bit values to ultimately determine the destination bank of a request. In this work, we further generalize this concept by simply defining the address translator as a function taking a memory address and returning the destination bank. An example system utilizing this change is presented in 2.4.2, where a hybrid memory system with hardware-based page migrator is implemented. In this particular case, the address is translated as normal and in parallel the address is checked against the migrated pages list. If found, the channel is changed to the “fast” memory and automatically rerouted to that channel’s memory controller.

Memory Object Hooks Hooks are external memory objects which can snoop on requests arriving or returning from particular memory objects. For example, one may want to inspect all memory requests arriving at a specific rank or bank object. In this work, hooks are used to inspect memory commands arriving at each rank in order to provide output suitable for verification against Micron’s model [55] or DRAMPower2 [23], which require input at the rank level. Inspection at the bank level is useful for visualization purposes, where we wish to instrument a specific bank and observe memory requests in action.

Other Memory Objects Due to the distributed nature of the simulator design, it is possible to implement many different types of memory objects that can be selected at simulation time. Several of these memory objects are provided with the NVMain 2.0 distribution, including but not limited to FCFS, FRFCFS, FRFCFS with write queue, and DRAM cache memory con- trollers; transmission-line and TSV-based interconnect models; and DDR and LPDDR2-N style banks. These objects exist solely for the purpose of a quick start and a reference material for implementing your own memory objects. More details are available at http://nvmain.org/.

2.3.5 Verification

In order to verify the correctness of NVMain 2.0, we show the validation results in this section. First, the verification for DDR3-like timings was performed against Verilog models available from Micron [55]. Then, we use the DRAMPower2 simulator [23] to obtain energy number and compare it with NVMain 2.0 in both Current-mode and Energy-mode. Next, the modeling of endurance data encoders and real MLC behavior requires proper data verification. Finally, we show our simulator’s speed.

2.3.6 Timing Verification We re-verified the timing model in NVMain 2.0 by using the Verilog model [55] as in NVMain 1.0. A simple DRAM-only system is simulated with an FRFCFS memory controller. Hooks are used to output requests arriving at each rank in Verilog format. We use ModelSim to run the Verilog model and checked the transcript for violations. All memory behaviors matched and no 15

5w!a{  bëa / bëa 9    

9 9  

b  5w!a{  

Figure 2.3: Calculated memory subsystem power of NVMain normalized to DRAMSim2. timing violation was found. The instructions for re-running verification are included with the NVMain 2.0 distribution.

2.3.7 Energy Verification We compare Current-mode against DRAMSim2 by running them in gem5. NVMain is configured as closely as possible to DRAMSim2 where DRAM timings and an FCFS memory controller are in use. Results of the comparison are shown in Figure 2.3 (NVMain-C series). Discrepancies in the comparison are mostly due to differences in the implementation of memory controller scheduling. NVMain schedules requests more frequently, resulting in a higher activity factor. Energy-mode is verified against DRAMPower2 [23] with rank-level traces generated by hooks. We use DRAMPower2 to estimate energy values for each memory operation and compare Energy- mode and Current-mode results, respectively. Results of this verification are shown in the NVMain-E series.

2.3.8 Data Verification To verify the correctness of data manipulation, we design a system with no caches and develop a small micro-benchmark that writes known values to large memory arrays repeatedly. We then check whether data is correctly read from simulator memory or trace file by comparing the observed data with the known values.

2.3.9 Simulation Speed NVMain’s simulation speed is measured by profiling gem5 for a fixed number of memory requests. The percentage of time spent in the memory subsystem is shown in Figure 2.4, normalized to 16

 bëa &b 5 ( bëa &   5 (        { { b 5w!a{                     !       

Figure 2.4: Percentage of simulation time spent in memory subsystem.

DRAMSim2. We also evaluate an MLC memory with the exact approach depicted in Sec- tion 2.4.1. The results show that NVMain can outperform DRAMSim2 in most tests, even when data operation is strictly modeled. Note that the memory scheduling in NVMain has not been fully optimized for the simulation speed. We expect to further improve the speed in the near future.

2.4 Case Studies

2.4.1 MLC Simulation Accuracy

Simple models for measuring performance in non-volatile memory can be designed by encapsu- lating the non-volatile memory timings as DRAM timings —either through tRP or tWR. These timing modifications become more complicated with MLC memories. Worst case timings for MLC can be 7-8 times larger than SLC timings, due to a program and verify approach used. We demonstrate the possible inaccuracy when simulating exact and estimated MLC timings for a 2-bit MLC memory. The estimated timing value assumes 3 SET pulses based on the average SETs across Table 2.1. At runtime NVMain inspects the data being written to memory and counts the number of each bit pattern1. Each combination of 2-bit values is given a programming time using a Gaussian distribution based on the mean in Table 2.1 and standard deviation of 1. Our write model assumes enough write drivers for the entire word. Thus, write time is the maximum programming time of any cell in the entire word. The results in Figure 2.5 show the absolute error of IPC values between the estimated and exact approaches in the first column. gobmk and zeusmp performance is underestimated, indi- cated by black. The remaining benchmarks are overestimated, indicated by gray. We ran four

1The counting technique used is highly optimized and utilizes simple bitwise operations to perform each count. 17 Table 2.1: Simulated MLC Timing Parameters

Value Mean SETs Value Mean SETs 00 0 01 7 10 5 11 1 Error Average Write Cycles Estimated 150% 500

s

r xxxxxxxxx e l o 120% 400 c rr y E

90% 300 C e te 60%xxxxxxx 200 i Wr solut

b 30% 100 A

0%xxxxxxx 0

Figure 2.5: Absolute error of an estimated write-pulse time compared to exact measurement based on data values. simulations of the same approach with maximum set pulse deviation ranging from 0 to 3. The maximum error compared to 0 deviation was 7.5% at 3 standard deviations. This shows that even allowing the write pulse count to deviate by as much as 50%, the results are much more disparate than the estimated technique. The second column shows the average number of writes cycles and the estimated value. Comparing with the first column, we can see a strong correlation between misestimate of write cycles and IPC error. Inaccuracy is directly related to the distribution of 2-bit data values being written to the MLC cells. The distribution is heavily skewed in most benchmarks, and there is no clear pattern as shown in Figure 2.6. Even benchmarks with as little as 10% combined 01/10 values begin to overestimate IPC. These results strongly motivate the need of data-aware simulation, particularly when utilizing MLC cells.

2.4.2 Hybrid Memory System

We implemented a hybrid memory system similar to [56] which uses a hardware-based page migrator to demonstrate the usefulness of advanced memory controllers and memory hooks. This simulated system uses a memory controller which employs a lookup table to find the destination of migrated pages and launch migration accordingly. Secondly, a memory hook is used to snoop on requests arriving at the memory controllers. This implementation uses a biased coin to decide whether to migrate a page or not for simplicity. Each time a request is issued, there is a small chance it will be migrated to another memory 18

100% 00 01 10 11 cy 80% en u 60% eq r f 40% e u l 20% va te

i 0% r W

Figure 2.6: Frequency of 2-bit data values written to MLC cells for various SPEC2006 bench- marks.

           

t       Lt/ !  í a  at   L  a  

Figure 2.7: IPC results and migration statistics for a hybrid memory. channel designated as “fast” memory. In theory, if a page is truly a “hot page”, it will eventually be migrated to fast memory. In this example, DRAM is used as fast memory and the remaining channels use NVM as slow memory. Migrations are done by using a single swap buffer. Our hook object injects real read requests to fill this buffer2. Once the buffer is filled up, real write requests are issued to the memory controllers. After writes complete, the buffer is now free for another swap to take place. The results in Figure 2.7 show a sweep of migration probabilities ranging from 0.5% to 32%. Each result is the mean value of ten SPEC2006 benchmarks. Using this simple technique, the IPC can be improved between 19% and 33%. The last column shows the percentage of migrations using the right y-axis. Since there is only one swap buffer to exchange requests, the actual number of requests that result in a migration (denoted “Migration %”) does not reach 1% and thus the

2Requests may be served directly from the buffer during migration. 19

         Lt/ Lt/  

b  5w/ 

Figure 2.8: IPC results of DRAM Cache with varying prediction accuracy.

increase in memory traffic due to migrations is minimized. Address translators and hooks may also load new statistics and print out results if necessary. Using this we collect the number of requests which could not be migrated because they are already in fast memory. This number increases as the migration probability becomes more aggressive, since the data being used by the application becomes completely migrated to the fast memory. In Figure 2.7, “Waits” shows the percentage of migrations that could not start because another migration is in progress. As expected, it increases as the migration probability increases and saturates around 19%.

2.4.3 DRAM Cache

Another example of a more complex memory system is the one utilizing a giga-scale DRAM Cache. Our implementation of such system involves a memory controller which creates a new root system that acts as backing memory. Normally all requests go to DRAM Cache first and are routed to backing memory on a miss. The address translator is now an access predictor [57] which determines if data reside in the DRAM Cache. If the predictor guesses the data is not in the DRAM Cache, the request bypasses DRAM Cache and is sent directly to backing memory3. In this work we sweep over a range of predictor accuracy from 60%-100%, including the case with no predictor to evaluate the impact of prediction accuracy. The results are shown in Figure 2.8. The baseline “no DRC” stands for a standard off-chip memory system. Interest- ingly, gobmk, mcf, soplex, and most notably libquantum even drop below the performance of a prediction-less scheme if accuracy is too low. Therefore, we observe that the prediction accuracy must be pretty high to ensure that the predictor can really help performance. Figure 2.9 also reveals that hit rate is not the major factor in this performance loss. Instead, the problem occurs

3This bypass is not allowed for dirty data in the DRAM Cache. 20

90% 80% 70% 60% none 100%

80%xxxxxxxxx ate

R 60% t i H 40%xxxxxxxx C R

D 20%

0%xxxxxxxx

Figure 2.9: Predictive DRAM Cache hit rate at various accuracies. because DRAM Cache does not see the requests that directly go to backing memory4. Further- more, as prediction accuracy decreases, more mispredictions cause non-trivial memory traffics to both DRAM Cache and backing memory, which in turn increases average access times.

2.5 Conclusions

In this chapter we introduced the several techniques to provide more accurate simulation of non-volatile memories. We implement an energy and based on simulated results and values, rather than empirical measurements. We verify the results based on existing power modeling tools. We show that estimation of NVM timings by encapsulation in DRAM timings is not sufficient to provide accurate performance comparisons. In the future, we can use different write latency and energy models other than the 2-bit MLC model provided. For example, we can simulate a program and verify technique for reliability-aware modeling for cross-point RRAM at the architectural level. We believe this simulator is highly beneficial to the research community and has become a repository for reference memory research proposals.

4This is specific to our implementation and is easily modified. Chapter 3

Bank-level Modeling of 3D-stacked NVM and Embedded DRAM

3.1 Introduction

The previous chapter demonstrated the importance of knowing data values when simulating memory, especially for main memories utilizing multi-level cells (MLC). Data values are not only important for performance, but energy values can be more accurately determined by combining the data value knowledge with simulated energy estimates for writing all of the possible data values (i.e., possible values for an MLC cell). In this chapter we describe work which extends prior work in order to allow for estimation of MLC type and devices not currently supported by existing simulators. Recent trends of increasing system core-count and memory bandwidth bottleneck have ne- cessitated use of large size on-chip caches. For example, ’s Ivytown processor has 37.5MB SRAM LLC [58]. To overcome the limitations of SRAM, viz. high leakage power and low den- sity, researchers and designers have explored alternate memory technologies, such as eDRAM, STT-RAM, RRAM and PCM. These technologies enable design of large size caches, for example, Intel’s Haswell processor employs a 22nm 128MB L4 eDRAM cache [59] and IBM’s POWER8 processor uses a 22nm 96MB L3 eDRAM cache [60]. In parallel, research has also been directed to novel fabrication techniques such as 3D integration that enables vertical stacking of multiple layers [61]. 3D stacking provides several benefits such as high density, ability to integrate diverse memory technologies for designing hybrid caches and higher flexibility in routing signals, power and clock. Lack of comprehensive and validated modeling tools, however, hinders full study of emerging memory technologies and design approaches. Although a few modeling tools exist, they model only a subset of memory technologies, for example NVSim [62] models only 2D designs of SRAM and NVMs but not eDRAM. As an increasing number of commercial designs utilize 3D stacking 22

[63, 64], architecture and system level research on 3D stacking has become even more important. A few 3D modeling tools exist such as CACTI-3DD [27] and 3DCacti [28], however, they do not model NVMs. Since different tools use different modeling frameworks and assumptions, comparing the estimates obtained from different tools may be incorrect. Also, tools such as 3DCacti do not provide modeling of recent technology node (e.g. 32nm). Thus, a single, validated tool which can model both 2D and 3D designs using prominent memory technologies is lacking. Due to this, several architecture-level studies on 3D caches (e.g. [65]) derive their parameters using a linear extrapolation of 2D parameters which may be inaccurate or sub-optimal. In this chapter, DESTINY, a 3D design-space exploration tool for SRAM, eDRAM and non- volatile memory is presented. DESTINY utilizes the 2D circuit-level modeling framework of NVSim for SRAM and NVMs. Also, it utilizes the coarse- and fine-grained TSV (through silicon via) models from CACTI-3DD. Further, DESTINY adds the model of eDRAM and two additional types of 3D designs. Overall, DESTINY enables modeling of both 2D and 3D designs of five memory technologies (SRAM, eDRAM and three NVMs). Also, it is able to model technology nodes ranging from 22nm to 180nm. We have compared the results from DESTINY against several commercial prototypes [64, 66, 67, 63, 68, 69, 70] to validate 2D design of eDRAM and 3D designs of SRAM, eDRAM and RRAM in DESTINY (Section 3.4). We observe that the modeling error is less than 10% for most cases and less than 20% for all cases. This can be considered reasonable for an academic modeling tool and is also in range with the errors produced by previous tools [62]. DESTINY provides the capability to explore a large design space which provides important insights and is also useful for early stage estimation of emerging memory technologies. For example, while it may be straightforward to deduce the optimal memory technology for some parameters (e.g. the technology with smallest cell size is likely to have lowest area), this may not be easy for other parameters such as read/write EDP (energy delay product), since they depend on multiple factors. Clearly, use of a tool such as DESTINY is imperative for full design space exploration and optimization, for example, finding the optimal memory technology or number of 3D stacked layers for a given optimization target. We conclude this paper by highlighting the possible extensions to DESTINY.

3.2 Motivation

3.2.1 Emerging Memory Technologies

We now briefly discuss the mechanism of each memory technology. For more details, we refer the reader to previous work [71, 62]. STT-RAM: STT-RAM utilizes a magnetic tunnel junction (MTJ) as the primary mem- ory storage. An MTJ consists of two ferromagnetic layers. The reference layer has a fixed magnetic polarization while a free layer has a programmable magnetic polarization. Current passing through the MTJ allows the free layer to change polarization. The MTJ resistance is 23 low when both layers are polarized in the same direction while polarization in opposite directions yields high resistance. These two resistance values are used to determine the “1” and “0” states, respectively. RRAM: RRAM uses a metal oxide material between two metal electrodes to store values. The value depends on the concentration of oxygen vacancies in the metal oxide. Applying current to the two electrodes can move these oxygen vacancies to either form or break down a filament which allows for high conductance in the metal oxide. A filament formed by oxygen vacancies has low resistance state representing a “1”. When a filament is broken down there is a small concentration of oxygen vacancies and thus high resistance state, representing “0”. PCM: PCM uses a chalcogenide material such as GeSbTe (GST) for data storage. The GST can be changed between crystalline and amorphous phases by heating the material for certain periods of time. A “SET” operation crystallizes the GST by applying medium temperature (300 C) for a relatively long period of time (150ns). This allows the material to move and restructure itself into crystalline form. A “RESET” operation the material to an amorphous phase by applying high temperature (800 C) for a shorter period of time (100ns) and quickly removing heat. This causes the material to melt and remain in an amorphous phase when cooled. The crystalline phase is low resistance corresponding to a “1” bit, while the amorphous phase is high resistance corresponding to a “0” bit. eDRAM: In eDRAM, the data are stored as a charge in a which is either a deep- trench capacitor or stacked capacitor between metal wire layers on a die. Access is controlled using a single transistor with the capacitor connected to the drain terminal. The gate of the transistor is used to access the device while the source terminal is used to read or write to the capacitor. Typically, a charged capacitor represents a “1” while a discharged capacitor represents a “0”.

3.2.2 Modeling Tools

A few existing tools provide modeling capability individually for different memory technologies, such as SRAM, DRAM, eDRAM, and NVMs. CACTI tool [24] simulates SRAM caches and has been extended to support eDRAM and DRAM. Also, several improvements have been made to CACTI to improve its modeling capability/accuracy. Mamidipaka and Dutt propose eCACTI [25] which adds a leakage model into CACTI and Li et al. [26] propose CACTI-P which models low- power caches (e.g., cache with sleep transistors). Chen el al. presented CACTI-3DD [27] which adds a TSV model for DRAM memory, however this tool is designed for DRAM and hence, does not allow accurate modeling of 3D SRAM caches. 3DCACTI [28] provides the ability to model 3D SRAM, however this tool has not been updated to support technology nodes below 45nm. None of these tools model emerging NVMs. NVSim provides 2D modeling of SRAM, RRAM, STT-RAM, PCM and SLC NAND Flash. However, none of these tools provide the comprehensive design space exploration capability as provided by DESTINY. 24

f fwfwf f P Pf f f

Figure 3.1: High-level overview of DESTINY framework. Configurations are generated from extended input model and fed to NVSim core. Results are fined tuned via 3D model and filtered by optimization to yield result outputs. 3.3 Model Implementation

DESTINY is designed to be a comprehensive tool able to model multiple memory technologies. Figure 3.1 shows a high-level diagram of DESTINY. DESTINY framework utilizes the 2D circuit- level model in NVSim, which was extended to model 2D eDRAM and 3D design of SRAM, eDRAM and monolithic NVMs. For a given memory technology, the device-level parameters (e.g. cell size, set-voltage, reset voltage) are fed to the DESTINY. Then, possible configurations are generated which are passed to the circuit-level modeling code. For 3D designs, 3D modeling is also done and the configurations include different number of 3D layers (e.g. 1, 2, 4, 8, 16 etc.). The designs which are physically infeasible are considered as invalid and are discarded, for example, if the refresh period of an eDRAM design is greater than its retention period, it is considered invalid. This reduces the number of possible options. The remaining configurations are passed through an optimization filter which selects the optimal configuration based on a target such as least read latency or least area etc. As we show in Section 3.5.1, DESTINY also allows design space exploration across multiple memory technologies, for example, finding an optimal technology for a given metric/target. For such use cases, device-level parameters for multiple memory technologies can be fed as input to DESTINY (shown in left of Figure 3.1). Using these, the best results for each technology are found which are further compared to find the optimal memory technology. Similar approach is also used for finding the optimal layer count for a given target (Section 3.5.2). We now discuss the specific extensions made in DESTINY modeling framework.

3.3.1 eDRAM Model

NVSim provides an incomplete eDRAM model which has also not been validated against any pro- totype. For enabling modeling of eDRAM, we separate the peripheral and device logic to simulate multiple types of technologies. It is well-known that eDRAM requires refresh for maintaining and typical retention periods range from 40µs – 100µs [64, 63] for temperature in 25 the range of 380K. We implemented a refresh model based on Kirihata et al. [72], whereby all sub- arrays are refreshed in parallel, row-by-row. This approach provides the benefit that the refresh operations do not significantly reduce the availability of banks to service requests. DESTINY can also be extended to model other refreshing schemes such as refreshing the mats in parallel. From performance and feasibility perspective, eDRAM cache designs which provide bank avail- ability (i.e., the percentage of time where the bank is not refreshing) below a threshold are not desired and hence, they are discarded by DESTINY. Since retention period of eDRAM varies exponentially with the temperature [73], DESTINY scales the retention period accordingly to model the effect of the temperature. DESTINY computes the refresh latency, energy, and power which are provided as the output of the tool.

3.3.2 3D Model Several flavors of 3D stacking have been explored in the literature, such as face-to-face, face-to- back, and monolithic stacking [74, 75]. In all these approaches except monolithic stacking, dies are bonded using various techniques (e.g., wafer-to-wafer, die-to-wafer, or die-to-die bonding). These approaches differ in terms of their effect on die testing and yield. Wafer-to-wafer can potentially reduce yield by bonding a dysfunctional die anywhere in the stack. Die-to-wafer and die-to-die can reduce this by testing individual dies, although it has adverse effect on alignment. The most common 3D stacking is known as face-to-back bonding. In this form, through silicon vias (TSVs) are used to penetrate the bulk silicon and connect the first metal layer (the back) to the top metal layer (the face) of a second die. In face-to-face bonding, the top metal layer of one die is directly fused to the top metal layer of a second die. Monolithic stacking does not use TSVs at all. Instead, monolithically stacked dies build devices on higher metal layers connected using normal metal layer vias if necessary. Each approach has its own advantages and disadvantages. Face-to-back must carefully con- sider placement and avoid transistors when being formed through the bulk silicon while face-to- face does not. Therefore, face-to-face has the potential for higher via density. The downside of face-to-face is that only two layers can be formed in this manner. Monolithic stacking benefits from the highest via density, however this technique cannot be applied in a design which requires transistors to be formed on higher layers, since this can destroy previously formed transistors. The 3D model of DESTINY facilitates all of the aforementioned types of 3D stacking. Sep- arate from this, the granularity at which TSVs are placed can be either coarse- or fine-grained similar to the approach in CACTI-3DD [27]. This granularity will define how many TSVs are placed and what portions of a cache (e.g., peripheral circuits or memory cells) will reside on different dies. We utilize these models in our validation. First, a model for direct die-to-die stacking with face-to-face bonding is provided [70]. Secondly, the monolithic stacking model for 3D horizontal RRAM is provided [68]. Face-to-back bonding using TSVs is utilized elsewhere [69, 66]. A few previous works (e.g. [27]) assume that TSVs in face-to-back are buffered, which may lead to redundant buffering in some designs and increases the latency and energy overhead of the 26

TSVs. This overhead may be acceptable in large-sized DRAMs which are modeled in CACTI- 3DD, but is unacceptable in caches which are relatively smaller in size and becomes increasingly obvious with smaller memory macro designs. Further, several memory peripheral components already provide full-swing logic signals which do not require extra buffering. In our work, we provide a TSV model which may act as a buffered or unbuffered TSV as well as vias used in face-to-face bonding. With increasing number of layers, the number of memory mats in each layer are reduced and hence, we need to select a scheme for folding of the memory banks. Our coarse- and fine-grained models assume simplistic folding scheme, where the mats are equally divided in all the layers. In coarse-grained model, TSVs are used to broadcast undecoded row and column select signals to all layers at once. One logic layer is assumed to provide output in this model over a shared TSV bus spanning all layers. The fine-grained model differs by broadcasting decoded row and column signals to all layers. It is assumed that a dedicated logic layer is used for all pre-decoder units. The resulting design uses more TSVs, but its area and latency may be reduced. Monolithically stacked horizontal RRAM (HRRAM) [68] uses a concept similar to cross-point designs (see Figure 3.2). The limitations of the cross-point designs, however, are the increased sneak current and voltage drop associated with increasing subarray size. In 3D design, this limitation becomes even more prominent and it further limits the subarray size of 3D-stacked RRAM as the sneak current can potentially flow into upper layers as well. To address this limitation, we extended the cross-point model in NVSim to account for the increased number of wordlines and bitlines in the third dimension. An example of HRRAM is shown in Figure 3.2 with 4 layers monolithically stacked. This monolithic stacking implies that there are no TSVs between memory layers. Instead, the memory cells are sandwiched between wordlines and bitlines and additional layers are added similar to adding more metal layers. Our 3D RRAM model considers current flow between all inactive bitlines and wordlines when a single cell is read. This model dramatically reduces the number of valid designs when considering RRAM, so we cannot simply stack cross-point arrays which are considered optimal for a 2D design. Typically this effect can be slightly mitigated using diode accessed cells as in [68] and hence, we provide the ability to model diodes or no access device.

3.4 Validation Results

To show the accuracy of DESTINY, we validate it against several industrial prototypes. We first obtain the cache/macro configuration from the corresponding prototype papers and use them to set the device-level input parameters for DESTINY. Finally, we compare the results from actual value and projected value from DESTINY and obtain percentage modeling error. As we show below, the modeling error is less than 10% in most cases and less than 20% in all cases. 27

ReRAM Cell

Wordlines Bitlines

Figure 3.2: 4-layer monolithically stacked RRAM Storage elements are sandwiched directly be- tween layers of wordlines (south-east orientation) and bitlines (south-west orientation).

3.4.1 3D SRAM Validation

We validate the 3D SRAM model of DESTINY against two previous works [69, 70] which utilize hSpice models to simulate latency and energy of 3D-stacked SRAM caches. Hsu and Wu [69] sweep over various cache sizes ranging from 1MB to 16MB. Their work assumes stacking at the bank level, that is a 2D planar cache containing N banks can be stacked up to N -layers. Since NVSim does not model banks, we only compare against the smallest cache size. Our proposed design assumes shared vertical bitlines, which corresponds to the fine-grained model in DESTINY. Analogous to their bank folding method, we assume a fixed configuration for two layers and fold along a single dimension in the bank layout to estimate four layer latency and energy. Our two layer design assumes a 4×32 bank layout1. Based on the aspect ratio of our SRAM cells and the size of the subarray, this design attempts to keep area square, which is likely the configuration of an hSpice model. The four layer design folds along the number of mats per column assuming a 4×16 bank layout. Table 3.1 shows the validation results of DESTINY against the 3D SRAM design in Hsu and Wu [69]. Notice that all errors are less than 4%. Puttaswamy and Loh [70] explore the design space of 3D SRAM for 65nm technology node. Their work considers a range of cache sizes from 16KB to 4MB. As explained above, we assume a fixed cache dimension for each cache size and fold the four layer design in half to measure results.

1The bank layouts are specified as mat×mat, and subarray layouts are specified as wordline×bitline. 28

Table 3.1: Validation for 3D SRAM model.

Design Metric Actual Projected (DESTINY) Error 1MB [69] Latency 1.85 ns 1.91 ns +3.54% 2 layers Energy 5.10 nJ 5.05 nJ -0.98% 1MB [69] Latency 1.75 ns 1.80 ns +2.68% 4 layers Energy 4.5 nJ 4.51 nJ +0.18% 4MB [70] Latency 7.85 ns 7.23 ns -7.91% 2 layers Energy 0.13 nJ 0.13 nJ -2.59% 4MB [70] Latency 6.10 ns 6.95 ns +14.03% 4 layers Energy 0.12 nJ 0.13 nJ +4.75% 2MB [70] Latency 5.77 ns 5.78 ns +0.05% 2 layers Energy 0.12 nJ 0.13 nJ +2.74% 2MB [70] Latency 4.88 ns 5.53 ns +13.5% 4 layers Energy 0.12 nJ 0.13 nJ +8.46% 1MB [70] Latency 3.95 ns 3.90 ns -1.11% 2 layers Energy 0.11 nJ 0.11 nJ -0.13% 1MB [70] Latency 3.07 ns 3.04 ns -0.85% 4 layers Energy 0.11 nJ 0.11 nJ -0.89%

These validation results are also shown in Table 3.1. Clearly, all errors are less than 15% and thus, DESTINY is reasonably accurate in modeling 3D SRAM caches.

3.4.2 2D and 3D eDRAM Validation

As stated before, the eDRAM model in NVSim is incomplete and has not been validated. Hence, we validate both 2D and 3D model of eDRAM. The prototype works referenced below typically provide information at the macro level, rather than a full bank. Macros are well suited for verification since they are a memory dense unit (i.e., there is no test circuitry, ECC logic, etc.) and are closest to the modeling assumptions of DESTINY and hence we compare against a macro. Note that for fair comparison, we remove ECC, spare, and parity wordlines and bitlines as these are not modeled in DESTINY.

Table 3.2: Validation of 2D and 3D eDRAM.

Design Metric Actual Projected (DESTINY) Error 2D 2Mb [64] Latency <2 ns 1.46 ns — 65nm Area 0.665 mm2 0.701 mm2 +5.42% 2D 1Mb [63] Latency 1.7 ns 1.73 ns +1.74% 45nm Area 0.239 mm2 0.234 mm2 -2.34% 2D 2.25Mb [67] Latency 1.8 ns 1.75 ns -2.86% 45nm Area 0.420 mm2 0.442 mm2 +5.31% 3D 1Mb [66] Latency <1.5 ns 1.42 ns — 2-layers Area 0.139 mm2 0.149 mm2 +9.32%

Barth and Reohr et al. [64] present a 65nm 2D eDRAM prototype. To validate against it, we use the 2Mb macro layout with total 8 subarrays and thus, use the organization of a 1024×2048 29 bank layout. Klim et al. [67] and Barth and Plass et al. [63] present 45nm 2D eDRAM designs. We validate against them using a subarray layout of 256×1024 as used by them. From Table 3.2, it is clear that the modeling errors in 2D eDRAM validation for all cases is less than 6%. Golz et al. present a 3D eDRAM prototype with 2 layers in 32nm technology [66]. We use the 1Mb array as our validation target. Based on the 16Kb µArray size of 32×512 and 1Mb layout, we assume two 1024×512 subarrays. From Table 3.2, the modeling error in area is less than 10% and thus, DESTINY can be considered as reasonably accurate.

Table 3.3: Design space exploration results of determining the optimal memory technology for a desired optimization target (refer Section 3.5.1). The table shows results on all parameters for comparison purposes.

Opt. Target Optimal Tech. Area Read Latency Write Latency Read Energy Write Energy Read EDP Write EDP Area RRAM 0.95 µm2 46.66 ns 58.57 ns 0.24 nJ 0.14 nJ 11.11 8.00 Read Latency STT-RAM 6.61 µm2 2.78 ns 5.40 ns 1.12 nJ 0.74 nJ 3.10 4.01 Write Latency eDRAM 16.95 µm2 3.56 ns 1.76 ns 0.52 nJ 0.50 nJ 1.85 0.88 Read Energy eDRAM 11.83 µm2 78.74 ns 40.21 ns 0.17 nJ 0.40 nJ 13.13 15.92 Write Energy SRAM 47.08 µm2 126.84 ns 86.05 ns 2.40 nJ 0.02 nJ 304.02 1.67 Read EDP eDRAM 13.20 µm2 3.92 ns 2.74 ns 0.29 nJ 0.35 nJ 1.15 0.95 Write EDP eDRAM 14.15 µm2 3.28 ns 1.82 ns 0.38 nJ 0.35 nJ 1.24 0.64

3.4.3 3D RRAM Validation

Our final validation target is a monolithically stacked RRAM memory [68], also known as 3D horizontal RRAM 2. In monolithically stacked designs, additional wordlines and bitlines are stacked directly by fabricating extra metal layers with NVM cells used in place of vias. This type of design does not use TSV or flip-chip style bonding. Our validation therefore considers our more detailed model of cross-point architecture spanning multiple layers. We design the simulated memory as a RAM 8Mb in size. The design consists of 4 subarrays each accessed in parallel with a 64-bit input bus. We again remove the ECC logic and specify two monolithically stacked layers per die with one die total. The results of validation are shown in Table 3.4. Table 3.4: Validation of 3D RRAM.

Metric Actual [68] Projected (DESTINY) Error Read Latency 25 ns 24.16 ns -3.37% Write Latency 17.20 ns 20.13 ns +17.05%

It is clear that the read latency projection of DESTINY is very close to the value reported in [68] while the error in write latency is higher. This can be attributed to the fact that Kawahara et al. [68] use a write optimization to reduce sneak current, which is not modeled in DESTINY. Furthermore, the range of acceptable write pulse times according to their shmoo plot is very wide, ranging from 8.2ns – 55ns. Our prediction falls in the lower end of the plot which is closer to the 8.2ns write pulse for a total of 17.2ns write time at the bank level.

2Note that vertical RRAM design has also been proposed which reduces the fabrication cost, however we are not aware of any prototypes. 30

Table 3.5: Design space exploration results of determining optimal number of 3D-stacked layers for various optimization targets for STT-RAM (refer Section 3.5.2). The table shows results on all parameters for comparison purposes.

Opt. Target Optimal Layers Area Read Latency Write Latency Read Energy Write Energy Read EDP Write EDP Area 16 2.07 µm2 51.14 ns 45.30 ns 2.87 nJ 2.63 nJ 146.68 119.27 Read Latency 16 2.90 µm2 1.77 ns 5.38 ns 3.15 nJ 2.90 nJ 5.58 15.60 Write Latency 16 3.15 µm2 1.82 ns 5.36 ns 3.17 nJ 2.92 nJ 5.77 15.63 Read Energy 1 19.21 µm2 148.11 ns 113.43 ns 0.36 nJ 0.14 nJ 53.13 15.48 Write Energy 1 19.78 µm2 148.11 ns 113.43 ns 0.77 nJ 0.13 nJ 113.89 15.28 Read EDP 4 7.59 µm2 2.94 ns 6.23 ns 0.91 nJ 0.69 nJ 2.67 4.28 Write EDP 2 12.31 µm2 5.63 ns 6.93 ns 0.70 nJ 0.48 nJ 3.95 3.31

3.5 Case Studies using DESTINY

With increased number of options for memory technologies and fabrication techniques, the num- ber of possible design options increase exponentially. Hence, a designer must make a right choice to optimize for a given target. DESTINY can be a very useful design-space-exploration and decision-support tool for such scenarios. In this section, we present features of DESTINY to demonstrate this.

3.5.1 Finding the optimal memory technology

We first show the capability of DESTINY to find the best memory technology for a given opti- mization target. We consider five options viz. RRAM, STT-RAM, PCM, eDRAM, and SRAM. Each cache has same configuration, viz. 1-layer 32MB 16-way cache designed with 32nm node. Table 3.3 shows the optimal memory technology found by DESTINY for each of the seven dif- ferent optimization targets. The results can be understood as follows. The RRAM is designed as a cross-point style memory which is the most area efficient way to design RRAM, resulting in low area usage. STT- RAM has the lowest read latency, however the energy required is higher than that of eDRAM resulting in eDRAM design having the lowest read energy and read EDP. NVMs are known to have high write latency/energy and hence, they are not optimal for write latency/energy/EDP. The write energy of SRAM is lower than eDRAM, however the write latency for SRAM is much higher than that of eDRAM when optimized for write energy, resulting in eDRAM having the lowest write EDP. The major reason why eDRAM is optimal over SRAM in terms of latency is due to the size of the H-tree based interconnect in our design. Since our 32MB cache is designed as a single bank, H-tree latency dominates in both designs. However, the SRAM cell size is approximately four times larger than that of eDRAM (146 F 2 vs. 39 F 2), which makes the size of the SRAM cache nearly four times larger. This increase in overall size has a direct impact on the total latency. It is noteworthy that PCM was not found optimal for any optimization target used by us. PCM typically has very high read and write energy and latency compared to the other technolo- gies. For area optimization, PCM comes close to RRAM (0.97 µm2 for PCM compared to 0.95 µm2 for RRAM), however due to the relatively small cache size (32MB) the peripheral circuitry 31 required for PCM makes its area larger than that of RRAM. The results of this study show that different optimization targets can potentially yield different memory technologies as the optimal cache design and DESTINY can be a convenient tool for finding the best technology for each target. It is also noteworthy that the results obtained here hold for the particular cell-level parameters used as input for each technology and other parameters/configurations may yield different technologies as optimal for each target.

3.5.2 Finding the optimal layer count in 3D stacking

We now show the capability of DESTINY to find optimal number of 3D die layers for a given optimization target. In this case, DESTINY explores both a 2D design and different number of layers in 3D design. In other words, it explores designs with 1, 2, 4, 8 and 16 layers (the maximum layer count is fixed to 16). We use an STT-RAM cache with 32nm, 32MB, 16 ways (same as above). Table 3.5 shows the results. It is clear that for different optimization targets, different number of layers are found as optimal. The results can be understood as follows. The area is computed as the maximum size of any die in the stack and hence, it is minimized when the layer count is set to the largest value. Latency is also optimized for the maximum number of layers since 3D stacking enables shorter global interconnect as the subarray sizes become much smaller. However, this is not always true in general. To confirm this, we checked with progressively reduced cache sizes and found that the TSV latency begins to dominate the overall latency at a size around 4MB, thus the design with maximum number of layers is not selected as the optimal result. The energy is minimized by avoiding the overhead of TSVs and hence, the design with 1 layer, i.e. a 2D design consumes the least amount of energy. The optimization of EDP presents an interesting case, since, as shown above, the trends of variation in energy and latency values are opposite. For this reason, it is expected that an intermediate value of layer count will be optimal for EDP. With increasing layer count, the write latency decreases at a slower rate than the read latency. For this reason, the optimal value of write EDP is obtained at 2 layers while that of read EDP is obtained at 4 layers. Clearly, the choice of the number of layers can have a profound effect on the optimal value of the different parameters and a tool such as DESTINY is vital for performing design optimization.

3.6 Conclusion

Due to the emerging nature of these memory technologies, only a limited number of prototypes have been demonstrated. Due to the lack of prototypes, we could not validate 3D STT-RAM and 3D PCM, although based on our validation results with 3D RRAM, we expect that DESTINY will be accurate in modeling them also. We plan to perform these validations as these prototypes become available. Further, we plan to extend DESTINY to model MLC (multi level cell) support for NVMs and also model other emerging memory technologies such as race track memory [71]. 32

Furthermore, we plan to integrate DESTINY in a performance simulator (such as Gem5) to enable architecture/system-level study of these technologies at different levels in and find the optimal memory technology for a given workload. In this paper, we presented DESTINY, a comprehensive, validated tool for modeling both 2D and 3D design of prominent conventional and emerging memory technologies. We described the modeling framework of DESTINY and also performed validations against a large number of industrial prototypes. We demonstrated the capability of DESTINY to perform design space exploration over memory technologies and 3D layer counts. We believe that DESTINY will be useful for architects, CAD designers and researchers. Chapter 4

Improving Effectiveness of Hybrid-Memory Systems with High-Latency Caches

The emerging 3D-stacked integrated circuits (3D ICs) has been envisioned as a promising tech- nology to continue to improve performance in CPUs. One of the most commonly studied appli- cations of 3D integration is to enable stacked DRAM with large capacity to be integrated with processor cores either via vertical integration with through-silicon-via (TSV) or 2.5D integration via interposer approaches. The earliest work have looked at solutions to effectively utilizing the large capacity stacked memory to both reduce latency and increase bandwidth between the CPU and memory subsystem [50] [76]. These works assume the entire system memory can be stacked on top of the processor. However, the capacity for stacked DRAMs still ranges from hundreds of [77] [78] to the ranges [76]. Consequently, using stacked DRAM as part of the main memory means using it as either a subset of addressable memory or utilizing it to cache different portions and granularities of main memory. Utilizing stacked memory as a cache is seen as a more near-term solution [79] because of the ability to manage the memory contents completely in hardware (i.e., without operating system support) and without costly data migration between on-package stacked DRAM and off-package regular main memory [80]. As a result, more recent works have been exploring using 3D-stacked memory as a last-level cache (LLC) [79][81] rather than as part of the main memory. A stacked cache can also be used to accelerate off-chip non-volatile type memories and reduce the number of writes. We leverage the benefits of our work in Chapter 2 in order to setup a hybrid memory system consisting of high-latency main memory and a lower latency LLC which operates similar to DRAM. As a result the LLC is much higher latency than traditional SRAM LLCs. Unlike on-die SRAM caches, the study on stacked memory caches so far have focused on using 34

commodity DRAM to increase the capacity of the cache. The stacked memory caches explored are designed using a traditional off-chip DRAM process technology rather than on-chip embedded DRAM process technologies. This means that accesses to stacked memory caches require the standard row and column commands of activate and precharge. Organizing the stacked memory similar to off-chip memory provides benefits of higher density, but greatly increases the access latency. On top of this, supporting the necessary tag array for such a cache is only feasibly placed within the DRAM itself. As a result, early work in stacked DRAM as cache dismisses the idea of cache line sized blocks in stacked DRAM as it would require three accesses for tag check, data access, and tag update. However, more recent work has shown that co-placement of tag and data within the same activation region can optimize access by reducing the number of activation cycles˜citeMissMap,AlloyCache. One major issue with such large scale caches, however, is the ratio in access time between stacked DRAM and off-chip memory. Early work has assumed 1:3 ratios while more recent conservative estimates assume 1:2 or 1:1.5 ratios, meaning the stack memory is approximate 50-100% faster than off-chip memory. Traditional ideals for caching suggest that high hit rates reduce the overall memory access time. However, in the case of stacked DRAM the close ratio to off-chip memory can increase the queuing delays at the stack DRAM if the cache is always consulted on higher level cache misses. The major problem occurs due to the interleaving of filling and accessing cache lines in the cache. Accessing the cache refers to a read request which comes from higher level caches. These requests are generally more important than fill requests to the cache, since they are on the critical path of instruction execution. However, fill requests cannot be completely ignored because accesses to the cache will not hit until the fill is completed. Although there is general a high reuse time between cache lines in lower level caches, this concept becomes important when using prefetching. In some cases it is also beneficial to schedule cache fill ahead of cache accesses. Caches are typically designed to store spatially local addresses in consecutive cache sets. Doing so allows for higher cache associativity for a fixed size set. Because of this organization, request interleaving also tends to cause high amounts of row-buffer misses. In a stacked DRAM cache this refers to the cache needing precharge a row and activate a new row if sets are stored in separate rows, which is the most desirable design to obtain the highest associativity. Row-buffer hits are much less expensive comparatively. The interleaving of fill and access requests in the cache controller means that a fill request may be able to result in a row buffer hit, even if a read request is ready to go. Due to the desire to prioritize access requests over fill requests and the possibility for fill requests to delay access requests in combination with high-latency, the size of the transaction queue needs to be relatively large to hold all of these requests. This can result in extremely large queuing latencies when multiple requests are bound for the same bank in a stacked DRAM cache. Again, this delay can be further exacerbated with the introduction of prefetch requests. Due to the nature of caching, it is not necessary to complete all fill requests in order to have correct system functionality. In this chapter, we propose an optimized 3D DRAM cache design and explore the issues mentioned above. Throughout the remainder of this chapter, ”DRAM 35

Cache” refers to a 3D or 2.5D stacked DRAM memory to which cache line sized blocks are read and written. We consider various designs with modified cache set indexing to allow for multiple spatially local cache lines to be filled to the same cache set. We introduce a small, L1-sized cache called the Fill Cache which is used to manage fill requests and track the load of banks in the DRAM cache. The Fill Cache replaces the transaction queue for stacked DRAM caches and has differences mirroring buffers vs. caches. Where are transaction queue holds data the will be filled or accessed in the future, a Fill Cache holds data that may or may not be filled or accessed in the future. The enumerated roles of the Fill Cache are as follows:

1. Write Buffering. The Fill Cache inherently acts like a write buffer since all fill requests are writes and it holds requests until a good opportunity for fill arises.

2. Coalescing. Although the Fill Cache is not designed to be very large, it is large enough that multiple requests bound for the same cache set can build up in the Fill Cache, which can increase row-buffer hits with our modified cache set indexing.

3. Fill Request Filtering. We explore adding a few bits to each cache line to identify request properties. We specifically track if the request is prefetch or demand and if it has been filled or is still waiting. Using this information we selectively drop requests (i.e., choose not to fill) during periods of high memory traffic.

4. Bank ”Load” Tracking. The number of access requests in the DRAM cache’s transaction queue is tracked to get an approximate delay for a particular bank. We can use this information for selectively dropping requests or re-routing access requests (i.e., choose not to access) to main memory.

4.1 Motivation

3D-stacked DRAM has been intensively studied and prototype chips have been demonstrated by companies such as MICRON and Tezzeron []. 3D-stacked DRAM can be integrated with processor cores either via vertical integration such as with through-silicon-via (TSV) or 2.5D integration using interposer approaches. Both approaches allow for extremely high pin counts and low latency interfaces between the memory and processor’s on-chip memory controllers. The high pin count allows for a higher number of channels and/or wider data buses which are not otherwise possible from off-chip memory due to the limited number of pins available on processor packaging. A DRAM cache can therefore be utilized as a large-capacity LLC with reduced latency and increased bandwidth between the CPU and memory subsystem [50] [76]. With a 3D-stacked DRAM last-level cache, requests to the DRAM cache are faster than off-chip DRAM for a few reasons. The electrical proximity of the stacked DRAM itself lends to less time spent on the interconnect between the memory controller and the stacked memory. The stacked DRAM is presumed to run at the same clock frequency as the memory controller itself. While this does not make the DRAM memory itself respond with lower latency, it means 36

DRAM cache requests do not need to go through extra clock synchronization circuitry like off- chip requests do, which reduces the overall latency of DRAM cache requests. The higher pin count allows for implementation of wide-I/O DRAMs and more channels to be used. Wide-I/O DRAMs are able to return the same amount of data in fewer clock cycles, freeing the data bus for other banks or ranks on the same channel to use and more importantly reducing the overall latency of requests. Although the 3D-stacked DRAM cache is expected to help performance improvement, there exists a few design challenges that may offset the aforementioned benefits and potentially hurt the system performance. In particular, tag array storage overhead, cache miss overhead, and the in- terfering of fill requests and access requests are critical for the effectiveness of using large-capacity 3D-stacked DRAM cache. In particular, there have been two basic proposals on managing ac- cesses tag storage and miss overhead in DRAM caches. Storing Tag Array in 3D-Stacked DRAM Cached. In a large-capacity 3D-stacked DRAM cache, the tag array is often too large to be designed as an SRAM cache [79]. Storing the tags in the DRAM can also be inefficient. If the tags are stored in different rows in DRAM than the data, looking for data stored involves two activate-to-precharge cycles on a hit, and one on a miss. A third activate-to-precharge cycle may be required to update the DRAM cache tags. More recent work has exploited row-buffer locality by storing tags for data in the same row as the data, as shown in Figure 4.1 [79]. In this figure, a single 2KB row of DRAM is shown divided into thirty-two 64-byte cache line segments. The first three segments are used for tag storage while remaining 29 segments are used for cache lines and represent the total number of ways in the cache. The three tags at the beginning of the row are read to determine hit or miss status. Upon a miss, memory requests are redirected to main memory as normal and the bank can be closed. On a hit, the row remains open, and reading data results in a row buffer hit. This technique is coined Compound Access Scheduling in [79]. Our baseline and proposed designs both employ this tag-in-row organization. Miss Penalty in 3D-stacked DRAM. Even with faster stacked memory, misses to the large-capacity DRAM cache still incur significant overhead. A DRAM cache alone has been shown to not be so effective alone due to this overhead [79]. Previous work proposes a MissMap, which stores partial tag data to determine if a cache line resides in the DRAM cache. By storing

DRAM row: 29 x 64-byte cachelines + 192 for tags = 2048 bytes

Tags + 29 ways of data Mode bit

Figure 4.1: A single DRAM row in a DRAM cache. The 2KB row is divided into 32 64-byte cache line sized segments. The first 3 segments are used for tags while the remaining segments are ways in the cache. 37

DRC Command ACT READ READ READ PRE Bus ACT WRITE WRITE PRE

DRC Data DATA DATA DATA DATA DATA Bus

Off-Chip Command Bus ACT READ PRE

Off-Chip DATA Data Bus (a) (b) (c)

Figure 4.2: Example fill to a baseline DRAM cache. In (a), 3 tags are read and no commands are issued until data returns. The request misses in the cache and data is fetched from off-chip memory in (b). The DRAM cache may or may not be precharged if there are other requests to this DRAM bank. In (c), the bank is re-activated if needed, and the data is written followed by a tag write.

only if cache lines are resident and not the full tag, a larger amount of cache lines can be tracked in the MissMap. The MissMap may be designed to evict cache lines from the DRAM cache if there is no space left in the MissMap to mark its presence. Even using this approach, the MissMap can still represent almost the entire DRAM cache, depending on cache size and workload. Since the MissMap tracks page-sized regions, the workload needs some amount of spatial locality to be effective. An example of filling data in a DRAM cache without any form of tag cache is shown in Figure 4.2. Each request to the DRAM cache must preform a tag lookup in order to find the location of the data in the DRAM cache. During this time, the bank will be idle. The tag read in this example resulted in a cache miss, so a request must be sent to the off-chip memory. Once the data is returned from main memory, it can be sent back to the higher-level cache which issued the request. However, the data will not officially be filled in the DRAM cache until the second write (the tag update) is complete. Any read request to this address must be scheduled after the fill request to ensure memory consistency. The use of a tag cache or MissMap optimizes the average request latency by quickly deter- mining hit or miss status of a request without performing a tag lookup in the DRAM cache. This prevents the tag lookup in part (a) for miss requests and sends them directly to the off-chip main memory. These miss requests will subsequently be filled in the DRAM cache by reading the tags as in part (a) and writing the data and tag as in part (c). For requests known to hit, the DRAM cache will perform a full tag lookup to locate the way holding the data as in part (a) immediately followed by a read or write of the hit data and a tag update as shown in part (c). Cache Hits vs. Cache Filling. irregardless of the underlying method used to determine residency of a cache line in the DRAM cache, this work concentrates on reducing overall access latency for any fill or access requests that do arrive. Even with a reduction in requests from reduced miss requests, the DRAM cache can still have a significant amount of incoming requests. Requests that hit in the DRAM cache are served from the DRAM cache under normal oper- ation. Requests that miss in the DRAM cache will be redirected to main memory, but should be later filled in the DRAM cache. This means that all requests that miss in upper level caches will go to the DRAM cache, either in the form of a hit (a read/write request) or a fill from a miss (a 38 write request). In contrast, the number of write requests in certain applications is much lower than the number of read requests. As a result, using a simple write buffer may not very effective, since the buffer is quickly filled and must begin draining. A simple solution would be to increase the number of entries in the write queue. But since the write queue must be searched similar to a content associative memory (CAM), increasing the size increases energy needed to search it. Figure 4.3 below shows an example snapshot of some of the bank queues when simulating the bwaves application from the SPEC2006 benchmark. Several requests arrive to the DRAM cache in short succession and all of these requests need to be issued to the same bank. These requests become backed up by several fills to the DRAM cache prior to this point in time. Due to the high latency of DRAM caches, as little as 3-4 read requests to the same bank can cause a sharp increase in latency for the incoming requests. As a result, the impact of fill queuing can be seen in most of the SPEC2006 benchmark applications. The exact number depends on the difference between the off-chip main memory latency and DRAM cache latency, but we cannot reasonably assume the DRAM cache will be much more than 2 or 3 times faster than off-chip memory.

Bank 0 Bank 1 Bank 2 Bank 3 R 0x16a4d1c0, C

R 0x16a4d1e0, C

R 0x16a4d200, ?

R 0x16a4d220, C

Figure 4.3: Example snapshot of bank queues in SPEC2006’s bwaves benchmark.

4.2 Implementation

In order to better manage these fill requests, we propose a unit called a Fill Cache which will hold all incoming fill requests until a good opportunity to issue the fill request arises. Using this unit, we hope to avoid the problem of fill requests delaying read requests to the cache as seen in Figure 4.3. We do this by managing the fill requests in the Fill Cache by write buffering the requests, coalescing fills, and actively dropping requests or re-routing them to off-chip memory. Effectively, the fill cache can be seen as an indexed set of write buffers. The locations of memory requests in the fill cache correspond to the bank to which the request is destined. From this organization, we can only search the set applicable to the bank being accessed in our optimizations, reducing the search space over a single large write buffer. This organization additionally allows for further non- 39

5 4 Evict 16 ?

bank bank Set i 4 7 3 2 3 3 4 3 Set i 4 7 3 2 3 3 4 3 12 11 22 17 6 6 10 8 row 12 11 22 17 6 6 10 8 row

Coalesce Fill

(a) Eviction (b) Probing

Figure 4.4: (a) Fill Cache entry being evicted where a coalesce can be performed. (b) The DRAM cache probing the Fill Cache for requests for bank 4. write-buffer optimizations, such as write coalesing and bank load tracking for request dropping. Basic Operations. The basic operation of the Fill Cache is described as follows:

1. Fill requests coming back from main memory and write requests from the higher level cache shall go to the Fill Cache first. The address of the fill request is used to select which set of the Fill Cache will store the request. The Fill Cache places the fill request in the DRAM cache’s transaction queue if there are no access requests pending for the same bank.

2. Read requests leaving the L2 cache may check the Fill Cache for the address. If the address happens to reside in the Fill Cache, the data is returned immediately. Since the Fill Cache is L1-sized, we simulate a request served by the Fill Cache to return on the next CPU clock cycle.

3. If the request is not found in the Fill Cache, it proceeds to either the DRAM cache or off-chip main memory based on the miss status determined by the MissMap.

4. Data may stay in the Fill Cache until space is needed to maximize the chances of serving from the Fill Cache.

An Illustration Example. A pictographic example of the Fill Cache operation is shown in Figure 4.4. In Figure 4.4(b), bank 4 of the DRAM cache has no requests in it’s transaction queue and it probes the Fill Cache for fill requests destined to bank 4. In this case, a request is found and the fill request is returned. If no request is found the Fill Cache will record the bank as idle. Any future requests to a bank marked idle are cached and forwarded to the bank’s transaction queue immediately. The bank is marked as busy until it issues a probe again. In Figure 4.4(a), set i is full and a fill request must be evicted. In this example, a fill request to bank 3, row 6 is evicted. During this eviction, a second request is found which can be issued simultaneously. This mechanism is described in Section 4.2.4 From this description, the Fill Cache seems similar to a write queue. The major difference is how the Fill Cache fill requests are managed and organization of fill into cache sets rather 40 than a single large queue. Since our design looks for fill requests to issue when a DRAM cache’s transaction queue is empty, a cache may be more efficient than a full associative buffer.

4.2.1 Managing the Fill Cache

Similar to a write buffer, the size of the Fill Cache is finite and fill requests that haven’t been issued when it is full need to be completed before further fill requests can be added. Based on the operation of the fill cache, a request will be evicted when the set a future fill request maps to is full. The victim that is chosen in this eviction process must now be filled in DRAM cache. Since we are trying to reduce the average latency of requests, performing some smarter man- agement of which evicted requests are sent to the DRAM Cache can provide some benefit. With the Fill Cache we may choose to drop requests from being filled in the DRAM cache during periods of high pressure on the DRAM cache. If a request is dropped, the request is not filled in the DRAM cache, and therefore latency is guaranteed to decrease. However, dropping too many requests will eventually increase latency by reducing the hit rate of the DRAM cache. We chose to drop requests from being filled in the DRAM cache by speculating the reuse of the data. To do this we add some extra flag bits to each cache line to mark any information desired for the algorithm which will decide to drop requests. In our design, we propose a total of 3 flag bits for each cache line: isReferenced, isPrefetch, isFilled. The isReferenced bit is set when a request is referenced. Referenced means the address of the fill was requested from a higher level cache. This request can be served by the Fill Cache directly by design (bypassing the DRAM cache and off-chip memory). isPrefetch is set if the fill request was created by a prefetch request. In our work, we issue prefetch requests to better test dropping requests. We assume requests from higher level caches also have the ability to be marked as prefetches. isFilled is set when the data is filled to the DRAM cache. The Fill Cache is probed for fill requests when a DRAM cache bank’s transaction queue is empty. Using the flags for referenced, prefetch, and filled requests, several different priority schemes can be crafted to drop or evict requests from the Fill Cache. A dropped request in the Fill Cache refers to a request that will be removed from the Fill Cache and never placed in the DRAM cache’s transaction queue. An eviction means the request will be removed from the Fill Cache and will be immediately placed in the DRAM cache’s transaction queue. An example of prioritizing requests to drop or evict which we use in our work is shown in the list below:

1. Drop Filled Requests

2. Drop Unfilled, Referenced Prefetches

3. Drop Unfilled, Unreferenced Prefetches

4. Evict Unfilled Demand Requests

This priority listing is not necessarily the best approach or every workload, but we describe our thought process below. Filled requests have already been filled, so they will hit in the DRAM 41

Referenced Prefetch Request Unreferend Prefetch Request Demand Request

100% 90% 80%

70% 60% 50% 40% 30% 20% 10%

(%totalof requests) 0% RequestTypes ReusedDRAM in Cache

Figure 4.5: Percentage of reused requests based on type. Types are referenced prefetches (RP), unreferenced prefetches (UP), and demand requests (UD). Average is 29.22% for RP, 41.88% for UP, and 28.90% for UD.

cache. Generally filled requests should be dropped before unfilled requests. If unfilled requests are dropped, they will not hit in the DRAM cache on the next access. If unfilled requests are evicted, it can cause avoidable in the DRAM cache queue. Next, prefetch requests should be dropped before demand requests, since demand requests are non-speculative. Intuitively, we may decide to drop referenced prefetch requests and the demand request or requests which triggered it before unreferenced prefetches. The idea behind this is if a prefetch is referenced, then it was a successful prefetch. Since the prefetch was successful, there is a decent chance the prefetch will be successful again in the future. The demand request or requests that triggered the prefetch are also dropped, hoping that the prefetch will be triggered again in the future by this request. Figure 4.5 shows the reuse patterns of requests issued to the DRAM cache across the SPEC benchmarks. The graph shows that unreferenced prefetches have the highest chance to be reused in the future, so we may choose to fill these over referenced prefetches. In our analysis in Sections 4.3.10 and 4.3.7, we find that around 99% of prefetches are served by the Fill Cache and accuracy is quite high. This implies that the reuse of unreferenced prefetches is not due to the prefetch being late or unsuccessful.

4.2.2 Re-routing Requests

Another technique similar to dropping requests is to re-route read requests to the off-chip memory. We can re-route requests to off-chip main memory during periods of high pressure on the DRAM cache. Doing this can provide load balancing between the DRAM cache and off-chip memory. Re-routing requests must be done carefully in order for requests to be coherent. Our approach uses the dirty-bit from the MissMap to determine if the request is clean or potentially dirty. The dirty-bit in the MissMap specifies if any cache line in the region represented by the MissMap entry is dirty. If the bit is not set, we know all of the cache lines in this region are clean. In this case it is safe to issue the read request to main memory and bypass the DRAM cache. If the dirty bit is set, we do not know if the data has been modified, so the request must be sent to 42 the DRAM cache. If the MissMap implementation provides dirty bits for every cache line in the region, we would be able to re-route requests more aggressively. In our work we only consider a single dirty bit.

4.2.3 DRAM Cache Load

Both request dropping and re-routing of requests rely on knowing when the DRAM cache is being heavily utilized. In order to determine this, we track the load of each bank and feed this information into the Fill Cache to better manage fill requests. The load of a bank refers to the number of read requests currently in a bank’s transaction queue. If the number of read requests is high, it may be beneficial to re-route a request to main memory (which may decrease the latency for this request). The number of read requests can similarly be used to drop fill requests from the Fill Cache.

4.2.4 Coalescing Fills

When we select a fill request to evict from the Fill Cache, we can also look at other requests in the same set in parallel. Due to the write buffering property, there is a good chance of multiple spatially local address to reside in the Fill Cache. These spatially local fill requests may be destined to the same bank and same row. If this is the case, we can easily piggyback further fill requests to exploit row buffer locality. To do this, multiple fills to the same row are selected and marked as filled. The requests are added to the DRAM cache transaction queue. At this point, the memory controller will automatically issue the second data write. In this case, we also only need to update the tags once, so for each coalesced request an extra write cycle is spared.

Table 4.1: Experimental Setup Parameters

CPU Parameters Caches CPU Type Out-of-Order L1D$ and L1I$ MESI, 32kB, 2 ways Frequency 3.2 GHz L2 (LLC) MESI, 4MB, 8 ways Miss Map 1.25MB, 20 ways Fill Cache 64kB, 8 ways DRAM Cache Off-Chip DRAM Frequency 1 GHz (2GHz DDR3) Frequency 800 MHz (1.6 GHz DDR3) Bus Width 128 bits Bus Width 64 bits Channels 4 Channels 4 Ranks per Channel 1 Ranks per Channel 1 Banks per Rank 4 Banks per Rank 8 Rows per Bank 16,384 Rows per Bank 32,768 Bytes per Row 2,048 Bytes per Row 2,048 Total Size 512 MB Total Size 2 GB

tRCD - tCAS - tRP - tWR - tRAS 8-8-9-12-26 tRCD - tCAS - tRP - tWR - tRAS 18-9-18-20-36

tWTR - tRTP - tRRD - tFAW - tRFC 2-2-6-30-65 tWTR - tRTP - tRRD - tFAW - tRFC 5-5-6-30-65 43

4.2.5 Modifications to DRAM Cache

(a) Tag Rank Cache Line (b) Tag Cache Line Offset Offset

DRC Set Bank Channel DRC Set Bank Rank Channel

Figure 4.6: Example set indexing schemes for DRAM caches. In (a), indexing is similar to SRAM caches where the lowest bits above the byte offset determine the set and upper bits determine the tag. In (b), a portion of the tag is after the byte offset, promoting row-buffer hits.

In order for these techniques to work more effectively given the operation of the DRAM cache, we make a simple design optimizations to the memory controller for the DRAM cache. The memory controller decodes the addresses of incoming requests to the proper channel, rank, bank, row, and column values in order to issue DRAM commands. We add a simple shift of the bits selected in this address decoding to allow for better coalescing opportunities. Typical DRAM memories use set indexing similar to what is shown in Figure 4.6(a). Off-chip memory generally has the column or portion of the column on less significant bits to encourage row-buffer hits. In our scheme, the channel, rank, and bank are chosen first to select the ap- propriate bank in the DRAM cache. If row or column are used in the less significant bits, the DRAM cache would have an increased number of requests going to the same bank in different rows of the DRAM cache in programs with spatial locality. This increased number of requests to the same bank is the exact problem we are trying to solve, therefore using channel, rank, and bank first is almost necessary. The remaining lower bits are used to determine the DRAM cache set (marked DRC Set in Figure 4.6). The DRAM cache set is equivalent to the row in the DRAM bank. Upper bits after this indexing are used as tag bits. The scheme in Figure 4.6(b) may be more useful in spatially local programs, as it enables row-buffer hits and coalescing. Lower bits immediately following the cache line byte offset are used as a portion of the tag. In (a), spatially local data would be distributed across multiple banks in different ranks and channels. Depending on the number of bits used as part of the tag, this data can map to the same DRAM row, allowing for slightly faster fills of sequential data. The scheme shown in Figure 4.6(b) is similar to using larger cache lines. However, the cache lines are still tagged and managed as individual 64-byte cache lines. This means that even if two sequential cache lines are filled together, they may be evicted at different times. The number of bits used as part of the tag should be minimal to prevent inducing large amounts of conflict misses. Since the DRAM cache already has a large number of ways, using 1 or 2 bits immediately following the byte offset does not have a large impact on the number of conflict misses or performance on our benchmarks. Using 1-bit and 2-bits between the byte offset and DRAM cache set index are named 1-bit skip indexing and 2-bit skip indexing in this paper, 44 respectively. Using more than one or two bits may not benefit as much from coalescing and row buffer hits. In our design, the starvation threshold for our memory controller is set to four, meaning the row will be closed if a request to another row exists.

Table 4.2: Total memory traffic over first 2 billion cycles of each benchmark. Benchmarks selected are in bold.

Benchmark Memory Traffic (MB) Benchmark Memory Traffic (MB) astar 39.8 bwaves 1141 bzip2 77.9 cactusADM 1159 dealII 12.8 gcc 24.5 GemsFDTD 261 gromacs 1.41 gobmk 2986 h264ref 0.35 hmmer 0.01 lbm 3367 leslie3d 805 libquantum 1205 mcf 2822 milc 1179 namd 12.8 omnetpp 27 sjeng 25.0 soplex 18.2 sphinx3 56.2 xalancbmk 50.4 zeusmp 0.21

4.3 Published Results

The following subsections discuss the experimental setup and methodology behind choosing our benchmarks. We also overview the architectural design of our DRAM Cache baseline and Fill Cache design The results of our simulations are presented at the end of this section along with more in-depth analysis of the results.

4.3.1 Experimental Setup

For our experiments we use gem5 [18] and an in-house cycle-accurate memory simulator on top of it to model the off-chip memory and DRAM cache. Our memory simulator honors all of the DRAM timings necessary for accurate simulation in both the off-chip memory and DRAM caches. The system modeled in gem5 is a two-level cache hierarchy using the MESI protocol. Detailed configuration for gem5 and timings for both DRAM cache and off-chip memory are provided in Table 4.1. Our DRAM cache configuration only consists of 4 banks per channel. This number is selected due to the impact of the tFAW timing on the DRAM cache. The tFAW timing specifies the maximum number of activations that can occur in tFAW cycles. Because the organization of the baseline DRAM cache places sequential cache lines in different rows, row buffer hits can only occur if two strided requests happen to occur at a multiple of the number of total banks in the DRAM cache. As a result, the row buffer hit rate is extremely low (¡1%), and increasing the 45

number of banks beyond this has negligible performance increase, since further activations can not be scheduled. We use the SPEC2006 benchmark suite for performance evaluation of different configurations. We first profiled the SPEC benchmarks to find benchmarks with sufficient memory traffic to exercise the DRAM cache. Experiments are run on this subset of benchmarks in the results. Each benchmark was fast-forwarded by 1 billion instructions and simulation is performed over the next billion instructions. Fast-forwarding is used to fill the caches before data collection is started so the effectiveness can be seen in the results. Benchmarks are run in gem5 using system emulation mode. Due to limitation with system emulation mode having no paging mechanism, the memory size must be larger than the program’s working set. As a result, we run single core benchmarks with a proportionally small DRAM cache as compared to [79]. Despite the configuration, we still observed high increases in request latency due to fill queuing, as analyzed in section 4.3.6.

4.3.2 DRAM Cache Architectures

Here we talk about the architectural design of the DRAM caches we use for our baseline and the DRAM cache used in conjunction with the Fill Cache. We make only slight modifications to the DRAM cache for use with a Fill Cache. All of the unmentioned design points in the Fill Cache optimized DRAM cache can be assumed to be the same as the baseline design. The off-chip memory system uses a FRFCFS memory controller in all the experiment runs (i.e., requests that do not go to the DRAM cache will go to this memory subsystem). Baseline DRAM Cache Architecture We use a first-ready first-come-first-serve (FR- FCFS) memory controller [82] to control both the off-chip main memory and the DRAM cache. The controller for the DRAM cache is modified to ”lock” each bank after a tag lookup. Locked banks may not be scheduled to, in order to prevent accesses being divided into multiple activate- precharge cycles. Accesses can potentially be divided because the command bus is idle after the tag lookup request is issued. However, the next command, which will read or write on a hit or do nothing on a miss, can not be issued until the tags are actually read to determine residency of the address. We found that hit requests being divided causes severe performance degradation in the DRAM cache. Since the final tag read returns only a few cycles after it is requested, it makes sense to lock the banks to prevent the scheduler from dividing the access. For requests that miss in the DRAM cache and are issued to off-chip main memory, the bank can be safely unlocked as this turn-around time is longer than a DRAM cache access. The dotted activate and precharge requests in Figure 4.2 in Section 4.1 represents this potential second request to the DRAM cache while issuing to off-chip main memory. For a tag cache we use the MissMap approach to filter out definite misses. Our MissMap implementation tracks cache lines fetched in 4KB regions, resulting in a 36-bit tag for the page address, a 64-bit bit-vector marking cache lines that are resident in the DRAM cache, and a single dirty bit for the entire 4KB region. The MissMap is force-synchronized with the DRAM 46

F$

Request / Fill Data Main LLC with DRAM Cache Logic Memory MissMap DRAM Cache

Figure 4.7: Block diagram of the architecture of a main memory subsystem containing a Fill Cache (F$) along with state machine showing basic request flow in a DRAM cache memory system with Fill Cache.

cache, meaning evictions in the MissMap result in evictions in the DRAM cache, and evictions in the DRAM cache result in evictions in the MissMap. The MissMap is designed by using a portion of the L2 cache as in [79]. Accesses to the MissMap are done in parallel with L2 cache accesses, so the MissMap data is available to the DRAM cache when an L2 miss occurs and is issued to the DRAM cache. More timings and the DRAM cache size are provided in Table 4.1 in Section 4.2.4. Fill Cache Optimized DRAM Cache. We optimize the 3D DRAM cache design by introducing a Fill Cache to manage the fill requests to the DRAM cache. A Fill Cache would reside between the last level of SRAM cache and the DRAM cache controller or in the memory controller. An architectural block diagram is shown in Figure 4.7. The Fill Cache is labeled as F$ in the diagram. Finally, we experiment with a full memory system, which includes hardware prefetching. The effectiveness of the Fill Cache is demonstrated well with prefetching enabled. We implement a modern hardware prefetcher as described in the next section.

4.3.3 Hardware Prefetcher

In-order to better demonstrate the effectiveness of the Fill Cache and to create prefetch requests to test managing the Fill Cache, we implemented a DRAM cache-level (LLC-level) prefetcher based on the STeMS [83] prefetcher. The purposes of this prefetcher is to generate more memory traffic and more bursts of requests to stress test the Fill Cache. We choose the STeMS prefetcher in order to have a more modern prefetcher in use. We also experimented using a get-next-N prefetcher which fetched 2, 4, or 8 more cache lines after each access. The results made the fill cache artifically good, since there are a flood of requests with each request, and the fill cache helps to prevent them from interfering with normal DRAM cache read requests. The STeMS prefetcher learns pattern sequences by creating offsets from the first miss that occurs from a particular program counter. STeMS learns patterns over a generation. A generation is defined as the time between the first access until a block is replaced from the cache. Patterns 47

learned are reconstructed into a reconstruction buffer. Prefetches are made by issuing a certain number of requests for addresses in this reconstruction buffer. If a certain number prefetches issued from the reconstruction buffer are successful, further requests from the buffer are issued. Streams of prefetches are ordered based on their prefetch hit activity. The least recently used stream is chosen as a victim when a new stream needs to be allocated. More details on the operation of STeMS can be found in the original paper [83]. Our implementation of STeMS is located at the DRAM cache level, rather than at the L1 cache level originally proposed in their work. Although this prefetcher is oddly position at the LLC rather than the L1 cache, we need to place it here in order to learn the access patterns to the LLC, rather than the overall application access pattern. For this level, the definition of a generation would be too long to create patterns, since a block may not be evicted for an extremely long period of time. We change this by learning patterns up to a fixed length. Patterns are similarly rebuilt into a reconstruction buffer. Streams are deallocated if prefetches do not result in a sufficient number of hits. Patterns that successfully prefetch a majority of their blocks may be extended to include longer streams. If a pattern is not successful, a duplicate of the pattern at the same program counter may be created. We also change the algorithm slightly by weighting the patterns based on success. If a pattern is not successful, other patterns are checked for the unsuccessful address. If a pattern containing the unsuccessful address is found, it is loaded into the reconstruction buffer, replacing the previous pattern. To help reduce the number of redundant prefetches, we search the DRAM cache’s MissMap vector to find blocks that already exist in the DRAM cache. The MissMap holds the existence of 64-byte cache lines in a 4KB region. Here we assume most prefetches are issued for address relatively close to the trigger address, meaning we could use information from the MissMap to filter redundant prefetches. Note that we do not make additional MissMap lookups when filtering requests, but instead use the bit-vector from the demand request’s lookup. We can tune the aggressiveness of this prefetcher in two ways. We may increase the number of addresses fetched from the reconstruction buffer or decrease the number of prefetches which need to be successful before further requests from the buffer are issued. In our experiments we use moderate amounts for each of these knobs. The number of addresses fetched is limited to 4 and the number of prefetches which need to be successful is 2.

4.3.4 Benchmark Selection

The total memory traffic, which includes both reads and writes to memory, is given in Table 4.2. This traffic refers to data leaving the L2 cache in our setup. This memory traffic will become traffic to and from our DRAM last-level cache. Benchmarks with at least a few hundred megabytes of memory traffic are selected for simulation. These benchmarks are displayed in bold in the table. We choose these benchmarks since they have large amounts of data being read and written from the L2 cache. Some traffic sizes are at more than twice as large as the DRAM cache. This large sized data set helps to ensure the DRAM cache is exercised and performs all standard cache 48

1.6

DRC DRC + MM + Prefetch DRC + MM + F$ DRC + MM + F$ + Prefetch DRC + MM 1.5 1.4 1.3 1.2 1.1 1 0.9 0.8

0.7 Relative Relative SpeedupoverDRC MissMap + 0.6 bwaves GemsFDTD gobmk lbm leslie3d libquantum mcf milc Average

Figure 4.8: Speedup of DRAM cache with Fill Cache over DRAM cache baseline. Results for the baseline, STeMS baseline, and combined approaches shown. The x-axis represents the speed-up over the baseline of DRAM Cache with MissMap only, and the y-axis is the benchmark.

operations (i.e., fills, accesses, evictions).

4.3.5 Baseline Results

Figure 4.8 shows the results of our simulation runs. The baseline runs consist of two different designs for the DRAM cache based on designs in previous works. The first baseline is a DRAM cache alone, without using a MissMap or Fill Cache. The second baseline adds in a MissMap, but does not use a Fill Cache. Our simulations include our optimized design with a Fill Cache. For reference, simulation results where a prefetcher is used without a Fill Cache is provided As seen in Figure 4.8, we find that the memory system with the miss map is more effective than the original baseline, similar to [79]. This makes sense since requests that miss in the DRAM cache only baseline will result in two requests going to the DRAM cache. The first request results in a miss, which is sent to main memory. Upon returning from main memory, the request must now be filled in the DRAM cache. Adding the MissMap to the subsystem allows for misses to go straight to main memory. For our Fill Cache design, we measured an IPC improvement between 16% and 75% for our memory intensive benchmarks. Simulation was also done on non-memory intensive benchmarks. Some of the benchmarks improved by a few percent, however the working sets of these benchmarks are simply too low to see major benefit like our memory intensive benchmarks. As for the more computationally bound applications, we did not find any performance degradation from our Fill Cache architecture.

4.3.6 Average Request Latency

The main source of performance increases at the main memory level generally stem from de- creased request latencies. Prefetching helps to hide memory latency in general. MissMaps work nearly the same way – by decreasing the latency of requests by routing compulsory misses di- 49

95 120

DRAM Cache Memory DRAM Cache Memory 115 90 110 85 105 100 80 95 90 75

85

MemoryLatency(Cycles) MemoryLatency(Cycles)

70 80

0 5 0 5

30 35 75 85 10 15 20 25 40 45 50 55 60 65 70 80 85 90 95 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 90 95

100 100 Instruction (x1 million) Instruction (x1 million)

(a) bwaves (b) lbm

130 150

DRAM Cache Memory DRAM Cache Memory 120 140 110 130 100 120 90 110 80 100

70 90 MemoryLatency(Cycles) MemoryLatency(Cycles) 60 80

50 70

0 5 0 5

85 85 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 90 95 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 90 95

100 100 Instruction (x1 million) Instruction (x1 million) (c) mcf (d) leslie3d

Figure 4.9: Average DRAM cache request latency for the first 100 million execution cycles showing how Fill Cache averages out request latency during high memory periods (e.g., during application start). The darker line is the average request latency for the DRAM Cache + MissMap design and the lighter line is for the DRAM Cache + MissMap + Fill Cache Design. Average latency (x-axis) and the total execution time (y-axis) are in memory cycles. rectly to the main memory. However, these types of latency hiding have a lot of work that is done in the background by caches to ensure that the next request to this address will be a hit. Prefetch requests and MissMap misses must be filled in the DRAM cache after the request is already returned to the higher-level cache that requested the data. Combining the MissMap and prefetching can result in requests flooding the DRAM cache with fills, increasing latency of both fill requests and demand reads. In our work the main sources of decreases in average request latency are coalescing, write buffering, and dropped/re-routed requests. Figure 4.9 shows the average latency of requests comparing the baseline DRAM cache with MissMap to the DRAM cache with Fill Cache design. For space we show four applications and their average latency in memory cycles over the first 100 million cycles after fast forwarding. We notice in all cases, our Fill Cache design has much smoother latency than designs without it. In leslie3d, the number of requests to the DRAM cache becomes extremely high. The queuing latency increase proportionally and overtime shorter requests lower the overall average latency. The Fill Cache design avoids this backup by rerouting requests to main memory. Write buffer fill requests also helps since this application has a high hit rate in the DRAM cache. In mcf the pattern has a large burst of memory requests at the beginning. The Fill Cache drops several of 50

Covered Uncovered Accuracy

100.00% 90.00% 80.00%

70.00% 60.00% 50.00% 40.00%

+ Accuracy + (%) 30.00% 20.00%

10.00% Covered/UncoveredPrefetches (%) 0.00% bwaves GemsFDTD gobmk lbm leslie3d libquantum mcf milc Average

Figure 4.10: Accuracy and number of covered and uncovered requests issued to the DRAM cache by prefetcher.

successful prefetch requests at this point. The bwaves application is fairly smooth in both cases. Reduction in latency comes from coalesced prefetches and write-buffering to prioritize reads. lbm benefits from write-buffering to prioritize reads.

4.3.7 Prefetcher Effectiveness

The effectiveness of the prefetcher is measured using the number of covered and uncovered mem- ory accesses as well as the accuracy. Results of these measurements are displayed in Figure 4.10. Coverage for our applications is quite low. On average coverage is 19%, ranging from 5% to 35%. Determining prefetch behavior at the last-level cache can be more difficult than determining be- havior at the L1 cache. Despite the low coverage, the number of prefetches issued in some of the more prefetch amenable applications such as bwaves, GemsFDTD, gobmk, and milc is enough to cause increased queuing latencies in the DRAM Cache + Prefetching simulation runs. With the Fill Cache added, these queuing latecies are decreased and the number of coalesced cache fills is increased significantly.

4.3.8 Set Indexing Effectiveness

Using set indexing alone results in a large increase in the number of row buffer hits in the DRAM cache banks. With set indexing and prefetching enabled, the number of row buffer hits in the DRAM increases from 0.44% to 8.2% on average. The performance results of set indexing does not noticeably change in any of the benchmarks, however. This similar result to normal cache indexing due to the number of banks available to the DRAM cache. In a few applications, such as bwaves, queuing latency can be increased at certain points in the program. In bwaves, the number of requests mapping to the same bank happens to increase due to shifting the mask for the bits used. Despite having no noticeable impact on performance, modifying the set indexing allows for coalescing of requests from the Fill Cache and generally saves energy due to decreased activations. 51

60% Fills Coalesced 50%

40%

30%

20%

Percentage CoalescedofFills 10%

0% bwaves GemsFDTD gobmk lbm leslie3d libquantum mcf milc Average

Figure 4.11: Percentage of fill requests able to be coalesced.

4.3.9 Coalesced Requests

A significant performance increase from our benchmarks comes from the increase in coalesced requests. Using set indexing and prefetching alone, the row-buffer hit rate increased to 8.2%. With our Fill Caches coalescing, this hit rate is increased even further. On average 28.53% of requests are coalescing. The percentage of coalesced requests ranges from 15.44% in bwaves to 54.09% in gobmk. This result is shown in Figure 4.11 for all our benchmarks. Since fill requests can be generated from read misses as well as write requests, we see major benefit across all applications rather than only write-intensive applications.

4.3.10 Sensitivity of Fill Cache Size

In order to further explore the benefit of using a fill cache, we also test the combined implemen- tation with the STeMS prefetcher using a smaller 32KB sized fill cache. In theory the fill cache benefits as long as it is at least as large as the size of the DRAM cache queue. Larger sizes allow for requests to potentially hit from the fill cache even after the request is filled to the DRAM cache. More requests hitting from the fill cache also allows for more interesting drop policies to remove less useful fill requests. We compare these two size points using the number of evictions before filling the request, number of requests served from the fill cache, and the number of requests dropped from the fill cache. The results from the combined approach use a 128KB fill cache. Across the workloads tested, 99.98% of all requests in the fill cache were evicted after they had been filled in the DRAM cache. If a request is removed from the Fill Cache before it is filled, the request will simply be issued to the DRAM cache, but this will result in increased overall latency. With the smaller 32KB fill cache, 99.96% of all requests are evicted after they have already been filled in the DRAM cache. The major differences between the two sizes are the number of 52

40%

Fill Cache Serves 35%

30%

25%

20%

15%

10%

5% RequestsServed Directly from F$ (%)

0% bwaves GemsFDTD gobmk lbm leslie3d libquantum mcf milc Average

Figure 4.12: Percentage of read requests where data was returned by the Fill Cache.

dropped requests and the percentage of requests that are served directly from the fill cache. The larger 256KB fill cache served an average of 16.74% of all requests to main memory, while the number of requests served from the smaller 32KB cache was 14.28% on average. Read requests served from the Fill Cache across all applications is shown in Figure 4.12. In the larger cache, prefetches account for 95% of requests served from the fill cache, while in the smaller cache 99% of requests are served from the fill cache. We can see that Figure 4.12 is close to the prefetch coverage times the prefetch accuracy shown in Figure 4.10 because of this. This implies that the effectiveness of the fill cache does not increase much as the fill cache size increases, so smaller fill caches may also be viable.

4.3.11 Application Classification

Finally, based on our analysis we classify the benchmarks that were run into categories of ap- plications and how the Fill Cache helps each type of application gain performance. We choose three categories that can summarize the types of simulated: High Hit Rate applications (i.e., read-intensive and successful reads), Write-Intensive applications, and Prefetch Amenable appli- cations. We define a High Hit Rate to be a DRAM cache read hit rate of over 40%, write-intensive to have more than 25% writes, and prefetch amenable to have a high prefetch coverage. The categories for each of the benchmarks run is shown in Table 4.3 below. The Fill Cache helps with High Hit Rate applications by prioritizing reads to the DRAM cache and from re-routing read requests for known clean blocks to main memory. An example of a High Hit Rate application is leslie3d. The hit rate for reads in this application is 76.7% and read requests consist of 78.3% of all requests. Write-Intensive applications benefit from coalescing due to the modified set indexing. Write buffering in these applications also helps prioritize read requests first. An example of Write-Intensity can be seen in the bwaves application. In this 53

Table 4.3: Categories of benchmarks run in our simulations. Some benchmarks belong to multiple categories.

High Hit Rate leslie3d, libquantum, mcf Write-Intensive bwaves, GemsFDTD, lbm, leslie3d, libquantum Prefetch Amenable bwaves, GemsFDTD, gobmk, milc

application the number of write requests is 47.6% of all requests. Approximately 34.6% of fill requests are coalesced. Prefetch Amenable applications also benefit from coalescing and can benefit from request dropping and fast access to prefetch hits from the Fill Cache. One such prefetch amenable application is milc. Milc coalesces 50.4% of fill requests and serves 99% of prefetch hits from the Fill Cache. Based on our result we expect similar applications with these characteristics to gain benefit from Fill Cache Enabled DRAM Caches.

4.4 Conclusion

In this chapter we presented techniques to further optimize a high-latency 3D DRAM cache. Specifically, problems that can arise between the interaction of fill requests to the cache and read requests are addressed. We observed that naive issuing of requests results in increased latency in several memory-intensive applications. A Fill Cache which manages fill requests and a modified stacked DRAM cache architecture was proposed to help with this problem. Results showed that the functions of the fill cache allowed for an average of 1.18x speed up over our baseline DRAM cache with MissMap. Prioritization of read requests, write buffering and coalescing of fills, and allowing for requests to be re-routed or dropped all contributed towards a more effective stacked DRAM cache. From our analysis we created categories of applications and ways they can benefit from the fill cache. Chapter 5

Leveraging Non-volatility Properties for High Performance, Low Power Main Memory

DRAM has been the de facto technology of main memory for decades. As the DRAM technology is hitting various fundamental scaling limits, the energy and capacity scaling in DRAM has become very difficult [84, 20, 1]. Alternatively, the emerging memory technologies such as phase- change memory (PCM), resistive RAM including Memristor-based memory (RRAM), and spin- transfer torque RAM (STT-RAM), can potentially provide significant capacity, energy-efficiency, and performance benefits than conventional DRAM technologies [85, 5, 86, 87]. As the scalability of non-volatile memories (NVMs) is predicted to go well into the future [1], these technologies may eventually have a lower cost per bit ($/b) than DRAM. Even though these emerging NVM technologies are promising, some new challenges, such as the read and write latency as well as the cell endurance (i.e., the number of writes during the lifetime), must be addressed before they can be adopted as mainstream technologies. Several prior work has explored the opportunity of mitigating these problems by proposing various write techniques that exploit properties of the memory cells. For example, multiple row buffers are evaluated in PCM-based main memory [5]. Write pausing and write cancellation are proposed to mitigate the large write latency [4]. In addition, suppressing duplicate writes, write-buffering, and even hybrid memory system are developed to relax the constraint of write latency/power [86, 88, 89, 90]. These prior work, however, still rely on the traditional memory bank structure and thus fail to exploit the benefit from the intra-bank optimization. In contrast, in this work we propose a novel intra-bank optimization to improve the read and write performance for NVM memories. 55

100%

1/2 Row 1/8th Row 1/32nd Row 90% 80% 70% 60% 50% 40% 30%

20% Energy Energy Savings over Full Row 10% 0%

Figure 5.1: Potential Energy Savings in NVM by opening 1/Nth of a Row Buffer.

5.1 Introduction

In general, memory systems are hierarchically divided into channel, rank, and bank. Bank is the atomic unit that can be accessed independently and is organized as a series of rows and columns. As a result of this atomicity, bank-level parallelism has been extensively explored to improve memory performance. However, a bank can be further divided into multiple subar- rays. Some researchers have realized that the coarse granularity of banks inevitably limits the amount of parallelism available to memory. Kim et al. have studied subarray-level parallelism in DRAM [38] and Udipi et al. have verified the benefit when accessing multiple subarrays in parallel, respectively [40]. These approaches, however, are unable to be adopted directly in non-volatile memories. In [40] the authors propose to subdivide bank rows and even access only single subarrays. Unfortunately this technique requires massive area overhead ranging from a 12.5% up to 100% increase. In [38] the authors leverage the local sense amplifier design of DRAM to allow multiple rows to be open at once (so-called MASA). They briefly discuss the energy implications of MASA, which are negligible in DRAM. In NVMs, the energy consumption of an activation is much greater than that of DRAM. It is also costly for NVMs to keep multiple local sense amplifiers open since the energy is not dominated by background energy (i.e., near-zero leakage power, no refresh power). As a result, naively reusing MASA in NVMs can dramatically increase the power footprint, which is not affordable nor acceptable in terms of power and area. Furthermore, it is possible to have only global sense amplifiers in NVMs and thus the idea can not be applied at all. Therefore, a fine granularity NVM should be redesigned to address the issue. Figure 5.1 shows the potential energy savings in NVM by reading only a portion of a memory row. Even small segmentation of a row in NVMs can lead to enormous energy savings nearly proportional to the percentage of the row that is opened 1. However, in typical DRAM memories,

1This figure includes read, write, and background energy consumption 56

such segmentation of rows results in too much area overhead to be feasible [40]. Fortunately, according to our observation, the fine-grained addressing (in byte) and non-volatility of emerging NVMs allow for such segmentation without significant increase in die area. Our goal in this work is to leverage the non-volatility to enable intra-bank access in non- volatile memories at low energy cost. Our proposed design, Fine-granularity Non-Volatile Memory (FgNVM), splits the subarrays of memory into multiple subarray groups and column divisions. The FgNVM design eliminates the restriction that only one row in each bank may be opened at any given time, and that all the data in this row be sensed. As a consequence, FgNVM provides the ability for three new types of memory accesses to the same bank. The first type is Partial-Activation that only partially activates 1/Nth a row. The second access type combines multiple Partial-Activations to allow for a heterogeneous row buffers activated as one, which is called Multi-Activation. The final access type is referred to as Backgrounded Writes, which allows for reads and writes to be performed in parallel by taking advantage of Multi-Activation.

5.2 Motivation

In this section we present background information about the structure of prototype NVM memory banks. This also explains the baseline design we are comparing against in future sections. We then present opportunities to exploit these designs in a new fashion for interesting optimizations. These optimizations are the basis of our proposed design in the next section.

5.2.1 Non-Volatile Memory Design Main memories are typically laid out as a series of memory cells in rows and columns as shown in Figure 5.2. The simplest NVM memory is referred to as a tile and consists of 8 cells across 2 rows and 4 columns as well as peripheral circuitry in this example. Multiplexers (MUX) connect multiple bitlines to a single I/O line. Recent prototypes refer to these multiplexers as Y-select and there may be a Local Y-select for tiles and Global Y-select for the entire bank. Sense amplifiers (S/A) are used to compare the NVM cell resistance against a reference to determine the stored data value. When memory is read, a wordline is selected and the shared multiplexer control signals (MUX CTL) are sent to the multiplexers to select the correct column for output. The sense amplifiers are used to determine the memory data more quickly, and also function as a row buffer to hold the sensed data. For write operations, the sense amplifiers are typically paired with write drivers to change the stored data value. Real tiles are much larger than 2 rows by 4 columns, normally 512 rows by 512 columns up to 4K rows by 4K columns. Memory banks consist of a large matrix of tiles (e.g., 32 x 32). At this much larger scale, global wordlines are also used to drive the wordlines in each tile in a subarray. Global bitlines connect to all multiplexer outputs of each tile. As a result, the row buffer size is increased by the number of row-wise tiles in a bank. It is important to note that sense amplifiers in NVMs are typically much larger than the width of a single column of memory. Several prototypes have sense amplifier sizes measuring between 57

Wordline Driver Local Bitline Wordline Column Row

I/O Lines Local Y-select

Subarray

Tile

Sense Amplifier Global Y-select & Row Buffer Vref Vref

Figure 5.2: The design of a typical NVM memory. Two NVM sub-arrays are illustrated. One sub-array is an 8x4 structure, where 8 local bitlines are selected by 4 local Y-select. The global Y-select further selects out the final I/O bitline that is sensed by the global sense amplifier (S/A).

104F 2 and 105F 2 in area [10, 11, 12, 13], where F 2 is the minimum feature size of the fabrication process. Typical NVM cells range from 4 − 40F 2, which means the multiplexer must output one global bitline for every 4-32 local bitlines in a tile (i.e., a high degree of column multiplexing is required). As a result, the row buffer size is much smaller than the number of columns across an entire bank. Sense amplifiers may also differ in NVM designs depending on the technology used. NVM technologies with small low resistance state values (e.g., STT-RAM) require local sense amplifiers. Local sense amplifiers are placed in the tiles and their output is connected to the global bitlines. RRAM and PCM have may high enough low resistance state to utilize global sense amplifiers. This means the multiplexer output is connected to the global bitlines and there is only one set of sense amplifiers in the bank.

5.2.2 The Non-Volatility Property One observation that can be made from Figure 5.2 is the fact that multiple columns are connected to one or more multiplexers before the sense amplifier. When we select a wordline in an NVM design, we are simply changing the wordline voltage. As long as the multiplexer does not have any column selected, there is no path to the sense amplifier. The non-volatility property means that cell data won’t be lost even if it is not sensed. This is not possible in DRAM due to memory data stored as a charge. Once the wordline is 58

Cell Read

MUX_CTL1 MUX MUX MUX_CTL2 (Select Column 0) (Select None) S/A S/A I/O Lines

Figure 5.3: Non-volatility property allows reading of only top left cell in a tile. Neither of the right two columns are connected to S/A. selected, data in a DRAM cell is lost to the bitline. As a result, data in DRAM must always be read by the sense amplifier so the data may be restored. A simple example of this concept is shown in Figure 5.3, a zoomed in version of the previous example in Figure 5.2. The MUX for the two right columns can be disabled if we do not wish to sense cells in these columns. Disabled cells on the selected wordline are effectively infinite resistance, allowing sensing of the enabled cells without disturbing disabled cells. Note that the memory design remains the same; we only duplicated the MUX control signals in order to control each MUX individually. This concept can be applied at the bank level. We can assume that within a memory tile, the multiplexer control signals are shared, meaning the same inputs are selected for each multiplexer in a tile. Then, we can create individual MUX control signals for one or more groups of tiles at the bank level. This allows us to activate only a portion of the entire bank. This creates the possibility for energy savings by reducing overfetch. That is, reducing the amount of data sensed which is never read from the row buffer. Additionally, this frees the global bitlines to be used for other purposes, such as reading or writing to a different row. Latency savings are also possible via increased parallelism of reads and hiding the latency of long write requests. 59 o DecoderRow DecoderRow o DecoderRow DecoderRow o DecoderRow DecoderRow uaryGroupSubarray uaryGroupSubarray uaryGroupSubarray

Cell Cell Cell Global Global Array Array Global Array RowDecoder RowDecoder RowDecoder Column Mux Column Mux Column Mux Column Mux Column Mux Column Mux o DecoderRow DecoderRow o DecoderRow DecoderRow o DecoderRow DecoderRow

Cell Array

Column Mux Column Mux Column Mux Column Mux Column Mux Column Mux

Column Column Column Sense Amplifiers & Row Buffer Sense Amplifiers & Row Buffer Division Sense Amplifiers & Row Buffer Division Division (a) Partial-Activation (b) Multi-Activation (c) Backgrounded Write Figure 5.4: The access schemes proposed in FgNVM. (a) Partial-Activation: Only one of the top two tiles is read from the bank, energy saved is in the other tile; (b) Multi-Activation: Data read from tiles in different rows, same back; twice the bandwidth potential; (c) Backgrounded Write: Upper-left tile is read, lower-right tile is written. Reduces read/write interference.

5.3 Implementation

We explained how the non-volatility property allows for potential energy savings and reduced latency. Here we propose three new memory access types exploiting this property to create a Fine- granularity Non-Volatile Memory (FgNVM). First, energy can be saved by Partial Activations which reduce the number of cells read from NVM banks when not necessary. Next, we can utilize the freed bitlines in order to provide read parallelism via Multiple Activations. Finally, freed bitlines can also be used to mitigate write latency using Backgrounded Writes. In this section we give more detail about the three new types of access modes supported by an FgNVM design. In our explanation, groups of memory tiles in the same row are denoted as subarray groups (SAGs). Groups of memory tiles in the column column are denoted as column divisions (CDs). Note that these groups are logical and that a subarray group or column division may contain multiple rows or columns of physical tiles.

5.3.1 Partial-Activation A Partial-Activation is a row activation in which only a subset of bits in the row are sensed. This approach allows for reduced energy consumption by only sensing a subset of data based on requests from the memory controller. A simple example of this operation is shown in Figure 5.4a. In the example, only the multiplexer control signals for the upper-left tile are enabled, as indicated by a black background on the column MUX. Only data bits from the upper left array are sensed in this case, even though the wordline is selected across the entire upper row of tiles. Similarly, the bottom row has a wordline selected, but none of the multiplexer control signals are enabled. Therefore no data is sensed from the bottom tiles and they do not conflict with the sensing of the upper-left tile. In this case, we can reduce the sensing energy of reading this particular row by up to 50%. 60

5.3.2 Multi-Activation Multi-Activation can fully make use of the freed bitlines to allow for multiple rows to be selected and sensed in parallel 2. Figure 5.4b illustrates how Partial-Activation works. Compared to Figure 5.4a, this time the multiplexer control signals for the upper-left tile and the lower-right tile are enabled, as indicated by the column multiplexer blocks shaded black. In this way, the memory can sense data from different rows simultaneously. Again, the rest of tiles (upper-right and lower-left tile in the figure) are still disconnected. Therefore, no data is sensed in those disconnected tiles and the sensing operation in the active tiles is not disturbed. In the baseline NVM design, accesses to two tiles must be serialized due to lack of parallelism. By leveraging Multi-Activation, two tiles can be accessed in parallel to improve memory band- width and latency. The time to perform two reads is reduced by a factor of two accordingly. The Multi-Activation, however, still has two constraints. First of all, it cannot activate any other tile in the same CD as a tile being sensed. For instance, the upper-left and lower-left tiles in the figure cannot be sensed in parallel (and similarly for upper-right and lower-right tiles). Additionally, if two rows are to be sensed, but the rows reside in the same tile or SAG, the data cannot be sensed in parallel as only one wordline can be selected within a tile or SAG at a time. These constraints limit the maximum subarray-level parallelism that can be achieved.

5.3.3 Backgrounded Writes Backgrounded Writes operate similarly to Multi-Activation except sensing is replaced with write driving. An example of a backgrounded write is shown in Figure 5.4c. Different from Multi- Activation, the lower-right tile is now being written instead of read. In the example, the upper- left and lower-right tiles are both selected, as indicated by the column MUX shaded black. While a time-consuming write operation to the lower-right tile takes place, the upper-left tile is still available to be accessed. As shown, a read operation is taking place while the write operation is still incomplete. Similar to Multi-Activation, it is prohibited to read from a tile in the same CD that is being written. Furthermore, the subarray group is also unavailable until the write completes. In this simple bank (2×2 tiles), only one tile is available for read. Nonetheless, for more realistically sized banks such as a 32×32 tile bank, the remaining 31×31 tiles are still available for up to 31 reads. In that case, approximately 93.8% of data in the bank is still able to be accessed during a backgrounded write operation. In general, the percentage of tiles available in an n by m tile bank can be determined by Equation (5.1).

(n − 1)(m − 1) (5.1) nm

2In the case of a single shared command bus, subsequent activations must wait for one command cycle 61

Subarray Group

Subarray Decoder

Global Global Gang 1

Row

Subarray Decoder

Global Global Gang 2 Row MUX_CTL

2x Data Figure 5.5: Ganged subarrays use multiple SAGs accessed in parallel to increase row buffer sizes.

5.3.4 Ganged Subarray Groups One major concern that arises with NVMs is the smaller row buffer sizes due to the high degree of column multiplexing described in Section 5.2.1. This is especially true when local sense amplifiers are used, for example in STT-RAM. In order to alleviate this concern and allow for the FgNVM concept to be applied, we can use ganged tiles across different subarrays. For this approach multiple SAGs share the same row decoder signals and multiplexer control signals. By performing a read or write operation across a gang of tiles, we can increase the number of data bits that can be read from a SAG. For large subarray gangs, this allows for increased scope of the FgNVM idea, since it can be applied to all types of NVM memories. An example of ganged subarrays is shown in Figure 5.5, where each gang consists of 2 SAGs. If we assume each tile provides 16 bits, the baseline case of reading from one subarray provides 64 bits – enough for only one cache line 3. With just one cache line per row buffer, it is not desirable to have more than one CD. However, with ganged subarrays, we can combine the outputs of two subarrays to provide up to two cache lines and thus up to two CDs. It is worth noting that the number of sense amplifiers used in ganged subarrays per cache line is the same as the unganged approach and therefore the additional sensing energy is negligible. However, in our example, the grouping of subarrays allows for up to 2 CDs by reducing the maximum number of SAGs. Furthermore, the area overhead of running additional I/O lines is also negligible. This is because they are offset by the reduction in MUX control signals which are now shared per-gang.

3Note that typical memory modules use multiple devices operating in parallel to provide 64 bytes. 62

Table 5.1: MPKI and WPKI of Simpoint slices.

Benchmark MPKI WPKI Benchmark MPKI WPKI astar 31.72 12.69 mcf 68.88 13.69 bwaves 17.33 6.04 milc 23.24 5.15 GemsFDTD 49.29 24.48 soplex 18.40 5.67 gobmk 15.12 7.38 zeusmp 30.97 6.07 leslie3d 18.94 4.18 libquantum 54.84 15.73 lbm 54.77 23.33

5.4 Published Results

CPU Simulation

We evaluate our design using the gem5 [18] CPU and cache simulator. Our experimental setup models a Nehalem-like CPU [91] as shown in Table 5.2. We evaluated single-threaded benchmarks using the SPEC2006 benchmark suite [92]. Simpoint [93] is used to identify a single region of interest within these benchmarks one quarter billion instructions in size. All 30 SPEC2006 benchmarks were profiled using simpoint and we selected benchmarks with 10 misses per kilo- instruction (MPKI) at the last-level cache. Table 5.1 overviews the MPKI and WPKI of our selected benchmarks. We use gem5 in system emulation mode and restore from a checkpoint to run a single Simpoint slice.

Memory Simulation

We use an in-house simulator verified against a Micron VerilogHDL model [55] and DRAM- Power2 [23] as the baseline to simulate our FgNVM design. Since the FgNVM design does not limit scheduling algorithms, we utilize a first-ready first-come first-serve (FRFCFS) memory controller scheduler [94]. We discuss the implications of the memory controller in more detail in Section 5.4.1. Our simulated memory is an FgNVM design based on the PCM baseline prototype in [95]. We consider this PCM design as it is one of the more mature prototypes available, includes timing parameters, and has density comparable to contemporary DRAM devices. Later in this chapter we consider the use of Ganged Subarrays on an RRAM memory designs. The overall memory system design is similar to the standard main memory hierarchy. We assume memory is an off-chip dual in-line (DIMM) divided into channels, ranks, and then banks. Each rank is assumed to have 8 PCM devices, each of which has a global 1KB row buffer and provides 8 bits of output over 8 DDR cycles to provide one 64B cache line per column command. Table 5.2 outlines our memory parameters for what is referred to as ”FgNVM” throughout the rest of this chapter. We choose a mid-ranged FgNVM with 8 SAGs and 2 CDs as a starting point so that we may reasonably compare against multiple independent banks. Section 5.4.4 shows results using different values of SAGs and CDs. 63

Table 5.2: Experimental System Setup

4-cores, 2GHz, Out-of-Order Processor 128 ROB entries, 5 issue width 48 LQ size, 32 SQ size L1I-cache 2 cycle latency, 32kB, 2-way 4 mshr, degree 2 tagged prefetcher L1D-cache 5 cache latency, 32kB, 4-way 16 mshr, degree 2 stride prefetcher L2-cache 8 cycle latency, 256kB, 8-way 16 mshr, degree 4 stride prefetcher L3-cache 24 cycle latency, 4MB, 16-way 16 mshr, degree 8 stride prefetcher 4GB PCM memory, 400 MHz, 4 channels 1 rank, 8 banks, 16K rows, 8K columns Main Memory 1024-bit row buffer, FRFCFS 64 write drivers, 32 queue entries 2 column divisions, 8 subarray groups tRCD=25ns, tCAS=95ns, tRAS=0ns PCM Timings tRP=0ns, tCCD=4cy, tBURST=4cy tCWD=7.5ns, tWP=150ns, tWR=7.5ns 533MHz, tRAS=37.5ns, tRCD=13ns DRAM Timings tRP=13ns, tCAS=13ns Section 5.4.11 800MHz, tRAS=35ns, tRCD=13ns tRP=13ns, tCAS=13ns

Experimental Results

Results of our specified design are shown in Figure 5.6. These results show IPC improvements normalized to the NVM baseline. We compared the results of an 8 subarray group, 2 column division FgNVM design against a memory system with 128 banks per rank. Each bank is sized to be the same as any (SAG,CD) pair. Based on our configuration in Table 5.2 this equates to the same amount of accessible memory arrays. Remember that an SAG being paired with a CD prevents all tiles in that CD from being accessed until the previous access is complete. Other columns may access that SAG assuming the same row is being accessed. Because of this, the 128 bank design will have more overall free memory arrays than FgNVM.

Perfect Memory

Some results show are compared with perfect memories. The perfect memory designs assume a row buffer hit always occurs. This is useful in section 5.4.4 to show performance trends if the memory controller were to have knowledge of the future. Results from perfect memory designs are labeled as N×M Perfect in figures throughout the remainder of the paper. 64

3

FGNVM 128 Banks FGNVM+Multi-Issue

2.5

2

1.5 SpeedupoverBaseline

1 Relative Relative

Figure 5.6: IPC improvement over baseline PCM design compared to FgNVM, multi-issue FgNVM, and an Ideal Scenario. All results show 8×2 FgNVM designs.

5.4.1 Memory Controller and Scheduling Subdividing memory banks into nearly independent tiles increases the complexity of the memory controller and potential scheduling. For each SAG that is added, the memory controller requires a counter to know which wordline is selected, if any, and at what time. Additional CDs need to keep track of which SAG is connected to its bitlines, if any. Each CD also tracks when the next read and write can be issued to itself. If a read is issued to a CD which is not connected to any SAG, the CD must first sense the data before reading. This occurs in tCAS (column address strobe) time. If a read is issued and data is already sensed, the timing is tCCD (column-to-column decode) time to read from the row buffer. If the SAG is not connected to the appropriate wordline, the read can not be issued. Write commands only require that the CD is connected to a SAG with the appropriate wordline selected. Activates may occur when no write command or sensing operation is in progress. Activations change the selected wordline which means any data in associated row buffers is lost. Our work assumes write-through operation for writes, and that no precharge command is necessary.

5.4.2 Multi-Issue Memory Controller Our original experiment assumed a FRFCFS memory controller is used. This scheduling scheme is suitable because we are limited by the off-chip bus, which assumes only one request can be sent per cycle. That means that so far any multi-activation is not truly parallel but actually sent in different cycles back-to-back. Therefore, our FRFCFS operates the same as always – One command per cycle. There are simply more constraints to check. To explore the possibility of simultaneous request issues, we simulated a wide memory bus which duplicates control and data signals. In this bus design we allow for up to 4 memory requests to be issued at any time. We implemented a Multi-Issue Memory Controller which 65

2 Col IPC Row IPC Both IPC 1.5

1

0.5

0

Figure 5.7: Performance difference with and without interleaving on rows and columns: Shows that interleaving rows and columns is beneficial to FgNVM designs. operates normally using an FRFCFS algorithm, searching each queue in round-robin fashion and issues all non-conflicting requests which are ready. Unlike normal operation, we also allow the round-robin to search the next queue during the same cycle. This process is repeated until either the maximum number of 4 requests are found or all queues have been searched in this cycle. The FgNVM+Multi-Issue series in Figure 5.6 shows the IPC results of this simulation. This interconnect and memory controller modification gains and additional 6.2% performance over the single issue FgNVM design. For the rest of this paper, all FgNVM results use this multi-issue memory controller and wide memory bus.

5.4.3 Address Interleaving Another simple optimization our memory controller makes use of is address interleaving to create more intra-row and intra-column parallelism. By interleaving rows, we are placing each memory row in a SAG based on the modulo of the row address, rather than the using the address as a direct index. This places spatially local rows in different SAGs, which allows for portions of each row to be open simultaneously. Likewise with CDs, interleaving uses the modulo of the column address. In theory this allows cache lines to be stripped across CDs, allowing them to be accessed in parallel. Results for interleaving are shown in Figure 5.7 using an 8×8 FgNVM design. Row interleav- ing provides a significant boost in performance in most cases. Column interleaving did not make a huge impact in most benchmarks. Although in theory sequential cache lines are stripped across CDs, they would otherwise reside in the same CD as the previous address without interleaving. The rest of the experiments in this paper leverage the combined row and column interleaving in their results.

5.4.4 Number of Column Divisions and Subarray Groups The number of CDs and SAGs for our baseline designs in Figure 5.6 was 8 SAGs and 2 CDs, respectively. Here we run two sweeps over the number of CDs and SAGs in our FgNVM design. 66

2x8 8x8 32x8 2x8 Perfect 8x8 Perfect 32x8 Perfect 2x8 SLP 8x8 SLP 32x8 SLP 8 1.8

7 1.65

6 1.5

5 1.35

4 1.2

Relative Relative Speedup 3 1.05

2 0.9 SubarrayParallelism

1 0.75

0 0.6 astar bwaves GemsFDTD gobmk lbm leslie3d libquantum mcf milc soplex zeusmp hmean Figure 5.8: IPC impact of adding more subarray groups averaged over the SPEC benchmarks: Not much performance improvement by dramatically increasing subarrays, with a maximum around 4% in libquantum, gobmk, GemsFDTD, and soplex.

These results are normalized to a control design with a fixed number of either 8 CDs or SAGs. Figure 5.8 shows the results with a fixed number of 8 CDs sweeping from 2 up to 32 SAGs. Assuming 64K rows in per bank in our baseline design and 2048 rows per subarray as in [95], 32 is the maximum number of SAGs without further reducing the size of tiles. Two SAGs is the minimum, otherwise there is no opportunity for SLP at all. The IPCs relative to 8 SAGs (our baseline) range from an average of 39% performance improvement with only 2 SAGs to a 41% IPC improvement with 32 SAGs. Across the benchmarks we observed very little benefit in this design for dramatically increasing the number of SAGs. In the best cases, libquantum, gobmk, GemsFDTD, and soplex realized a 4% improvement increasing from 2 to 32 subarrays. This small increase is due to the limited amount of SLP exhibited in single threaded bench- marks. The lines in Figure 5.8 shows the average SLP across all banks for each benchmark. SLP is defined as the average number of outstanding requests in a bank at any given simulation cycle. This average is only sampled when at least one request is outstanding in the bank. Based on this figure, none of the benchmarks average more than two outstanding requests at a time. Ad- ditionally, the SLP does not dramatically increase as the number of SAGs increases, suggesting the applications themselves have low SLP. This observation is also true in the case of perfect memory. Figure 5.9 shows results from adding additional CDs to the design. This figure sweeps from 2 up to 32 CDs with a fixed number of 8 SAGs. Using the design from [95], 32 is the maximum number of CDs possible. At least 2 CDs are necessary to provide SLP using the Multi-Activation concept. Results for this sweep are much different from increasing the number of SAGs and there is much more of a difference when changing the number of CDs. Most benchmarks decrease in performance with increasing numbers of CDs, however in the perfect memory case the perfor- mance increases. Increasing the number of CDs, while allowing for Partial-Activates, is possible to have a detrimental impact on the number of row buffer hits since less data is fetched. For example, with 32 CDs, each CD contains less than one cache line. This is explained in further detail in Section 5.4.7. 67

6 8x2 8x8 8x32 8x2 Perfect 8x8 Perfect 8x32 Perfect

5

4

3

SpeedupoverBaseline 2

1 Relative Relative

0 astar bwaves GemsFDTD gobmk lbm leslie3d libquantum mcf milc soplex zeusmp hmean Figure 5.9: IPC impact of adding more column divisions averaged over the SPEC benchmarks: Large difference in IPC when adding more column divisions; ideal case moves opposite of real case.

3.5

8x2 No Backgrounded Writes 8x8 No Backgrounded Writes 8x32 No Backgrounded Writes 8x2 8x8 8x32

3

2.5

2 SpeedupoverBaseline

1.5 Relative Relative

1 astar bwaves GemsFDTD gobmk lbm leslie3d libquantum mcf milc soplex zeusmp hmean Figure 5.10: The ’nowb’ show IPC when Backgrounded Writes are disabled. The remaining bars show the IPC improvement when Backgrounded Writes are enabled. Up to 12% increase for lbm and 5% on average.

5.4.5 Impact of Backgrounded Writes Another interesting portion of the FgNVM design is the availability of Backgrounded Writes, where writes can complete in one (SAG,CD) pair, while different (SAG,CD) pairs are still ac- cessible. To outline the effectiveness of Backgrounded Writes, we compare the results without Backgrounded Writes to our main results which include Backgrounded Writes. Figure 5.10 shows the IPC results of Backgrounded Writes normalized to the baseline results. In some cases, the addition of Backgrounded Writes has a significant impact on performance, up to 12% and 5% on average. Without Backgrounded Writes, the SPEC benchmarks average around a 32.3% IPC improvement. Backgrounded Writes increase this improvement by another 6.1% to achieve a total of 38.4% improvement over the SPEC benchmarks. Similar to the case of increasing subarrays, the SLP is low enough that reads do not normally occur during write operations. We additionally count the number of read operations that would have been blocked due to a write operation in progress. For the 8×2 design, 8×8, and 8×32 design, an average of 4.7%, 12.7%, and 13.0% of reads are blocked due to writes without backgrounding, respectively. Also note that in our baseline prototype [95], the write time is only 25% longer than a read command. 68

1 8x2 8x8 8x32 8x32 Perfect 0.9 0.8 0.7 0.6 0.5

0.4 Energy Energy Baseline over 0.3 0.2

Relative Relative 0.1 0

Figure 5.11: Energy consumption normalized to baseline NVM prototype. Shows a significant decrease in most FgNVM configurations.

5.4.6 Energy Comparison To see how well our design achieves our goal of saving energy via Partial-Activations, we perform and energy comparison of the FgNVM designs by sweeping over a varying number of CDs. We assume a read operation consumes 2pJ per bit, a write consumes 16pJ per bit, and background power averages to be 0.08pJ per bit of memory. Results are shown in Figure 5.11 and normalized to the NVM baseline. In the NVM baseline, we assume the entire row buffer is sensed during an activation. Therefore, 1KB of data must be sensed compared to 512B for 8×2, 128B for 8×8, and 32B for 8×32. For write requests, we still assume that only 64 bits of data can be written in parallel and is independent from the dimensions of the FgNVM array. From the figure we can see that in all cases the energy of any FgNVM design is significantly lower than the baseline NVM design. Ideally, the energy consumption should decrease by a factor of two when the number of column divisions is doubled. However, due to the background energy and inability to decrease the energy of writes, the energy is not ideal. On average, however, the energy is reduced by 37%, 65%, and 73% in 8×2, 8×8, and 8×32 respectively. The 8x32 configuration is able to come close to ideal since this configuration reads no more than one cache line as a single time.

5.4.7 Design Optimization The proposed FgNVM does suffer from some design limitations in a few benchmarks that do not exhibit high SLP or interference from writes. This problem is the reverse of the “problem” referred to as overfetch in [40] — an underfetch problem. Benchmarks with relatively high row- buffer hit rates, but low request inter-arrival times have degraded performance. This happens because a CD sensing operation can not be amortized effectively unless the next request address is known. For a small amount of column divisions, the underfetch problem is relaxed. Figure 5.12 shows the CCD hit rate of open column divisions across a varying number of 69

100% 8x2 8x8 8x32 90% 80%

70% 60% 50% 40% CCD CCD Rate Hit 30% 20% 10% 0%

Figure 5.12: The CCD hit rate (hit rate of data in all sensed column divisions). As more column divisions are added, the rate drops due to underfetch. column divisions. The CCD hit rate is similar to the row buffer hit rate, except it measures only hits in CDs. That is, if a CD must first be sensed before any data can be read, it is a CCD miss. If a CD was previously sensed and data can immediately be read, it is a CCD hit. We use the term CCD hit as this implies the request can be served in tCCD time.

5.4.8 Sensitivity Study

Throughout the results section we have used the activation time (tRCD) and sensing time (tCAS) given in Table 5.2. Since an activation in the FgNVM simply pulls down the wordline, we assumed tRCD to be much smaller than tCAS, while using the combined total time based on the prototype in [95]. These timing parameters are the most important timings in terms of read performance for an FgNVM design. To explore the sensitivity of these timings, we modify the values while keeping the same total tRCD+tCAS time as the original results – 48 cycles or 120ns. Figure 5.13 shows the results of modifying tRCD and tCAS while using a fixed 8×32 FgNVM configuration and normalized to the NVM baseline. Timings are displayed in the format “tRCD—tCAS”. The results across these benchmarks are very mixed. Benchmarks GemsFDTD, gobmk, lbm, and mcf show a clear trend of increasing IPC as the sensing time is lowered while the benchmarks milc and soplex show a clear trend of decreasing IPC. The remaining benchmarks either reverse or stagnate at a particular point. The relation to this performance change is similar to the CCD hit rate of 8×32 shown in Figure 5.12. Benchmarks which have a CCD hit rate higher than 50% (e.g., milc) typically benefit from having lower activation times. Conversely, benchmarks with CCD hit rates lower than 50% (e.g., gobmk, lbm, mcf, zeusmp) benefit from having lower sensing times. However, for applications with very sporadic access patterns, there are equal benefits from reducing either timing. 70

1.2 10|38 20|28 30|18 40|08

1.1

1

0.9 SpeedupoverBaseline

0.8 Relative Relative 0.7

Figure 5.13: Sensitivity of our tRCD and tCAS selections on 8×32. Shows sensitivity when decreasing tCAS is proportional to CCD hit rate. 5.4.9 Future Devices As devices mature, the read and write latencies may potentially decrease. As a result, the activation and sensing times would be reduced along with them. Here we study lowering the timing values as a way to demonstrate future performance of an FgNVM design. For this study we consider an 8×32 FgNVM design. Figure 5.14 shows the results when lowering the timings of tRCD and tCAS from 48 total cycles (120ns) to 40 and 30 cycles (i.e., 100ns and 75ns). Timings are displayed in the format “tRCD—tCAS” and normalized to the NVM baseline. The “10—38” timing is the original experiment results shown for comparison. Similar to the results in the sensitivity study, the same applications decrease in performance when tCAS is reduced as in Figure 5.13. The results for “10—30” compared to “20—20” can again be explained by the CCD hit rates in Figure 5.12. Naturally, as the overall memory access time decreases, the performance increases.

3.5

10|38 10|30 20|20 10|20 3

2.5

2 SpeedupoverBaseline

1.5 Relative Relative 1

Figure 5.14: Study of reducing total tRCD and tCAS. This simulates performance of future devices with faster activation or sensing time. 71

4.0 2x2 8x2 2x8 8x8 3.5

3.0

2.5

2.0

1.5 Relative Relative SpeedupoverBaseline 1.0

Figure 5.15: Application of FgNVM to a subarray-ganged RRAM design. Yields results similar to PCM design.

5.4.10 Application to STT-RAM and RRAM The analysis section showed that FgNVM benefits highly in terms of energy from an increased number of column divisions. For the PCM design we were able to utilize a high number of CDs due to the large amount of global I/O lines leading to the global sense amplifiers. Despite these limitations, the FgNVM concept is still applicable to NVMs using local sense amplifiers. However, the number of column divisions would be quite limited. Here we show the performance of an RRAM design with local sense amplifiers and a small row buffer. Note that the same technique could be applied to STT-RAM, however we show only RRAM due to space constraints. We use the concept of ganged subarrays described in Section 5.3.4 to increase the row buffer size and ultimately number of possible CDs. Assuming the same memory configuration as in Table 5.2, we would have 128Mb banks per memory device. We further assume 512 bit×512 bit tiles and 2 cache lines per row buffer 4, resulting 32×16 tiles. This increases the row-buffer size from 128-bits to 512-bits, allowing for up to 8 cache lines per row across a rank. Figure 5.15 shows results similar to those observed throughout the paper; the overall performance is increased with large gains for exceptionally memory intensive benchmarks – GemsFDTD, lbm, and libquantum.

5.4.11 Comparison with Contemporary DRAM

We compare our design with contemporary DDR3 DRAM designs to see how close the design comes in terms of performance. The results of these runs are shown in Figure 5.16. Using FgNVM we can build a PCM design that is about half-way between the baseline NVM and a DDR3-15E, which operates at a higher frequency that our FgNVM design.

4This small size is due to heavy multiplexing before sense amplification in RRAM 72

6 8x8 DDR3-15E 5

4

3

2 SpeedupoverBaseline 1

Relative Relative 0

Figure 5.16: FgNVM vs. a mid-grade DDR3 DRAM normalized to baseline NVM. Shows that FgNVM design can help approach the speeds of DRAM.

5.5 Design Implementation

To implement FgNVM in real hardware, modifications to row decoders and a method to control the Y-select for CDs is required. Typical row decoder designs use two-stage decoding, where signals are pre-decoded and sent to a second stage decoder. The row address must be held in a row address latch and the row decoder persistently selects the wordline. In other words, the contemporary row decoder is unable to open multiple wordlines simultaneously. Similar to Y-select logic, an NVM design must persistently enable the local Y-select within a tile in order to be sensed. The NVM baseline enables the local Y-select corresponding to the selected columns. We assume that the column select is broadcast to all tiles in a row at the same time the wordline is selected. We need to subdivide this broadcast into individual CDs to provide fine granularity access.

Row Decoder Designs.

Similar to changes needed for subarray-level parallelism [38] and concurrent refresh [96, 97], the row decoder must be split to have dedicated row decoders for each subarray group. For a design supporting Multi-Activation, we pair each row decoder with a local latch. A multiplexer is used to select which latch should be written when an activate command arrives by looking at the address bits. For the wide bus design proposed in section 5.4.1, we can remove this multiplexer and assume truly parallel for address signals. This particular technique has super-linear area overhead in terms of number of subarray groups since doubling the number of subarrays doubles the number of latches, but only decreases the number of bits per latch by 1-bit. For example, if the incoming row address is 16-bits we require a 16-bit latch supporting 64K wordlines in the bank at the cost of 16 latch bits. By adding two subarrays, the number of bits for each row latch is 15-bits supporting the same 64K 73

Table 5.3: Summary of Area Overheads in FgNVM design. Component Min Overhead Max Overhead Row Decoder < −1µm2 < −50nm2 Row Latches 2, 325µm2 9, 333µm2 Y-select Enables 0µm2 0.1mm2 Total 2, 325µm2 (¡0.1%) 0.1mm2 (0.32%)

(32K + 32K) wordlines in the bank at the cost of 30 (15 + 15) latch bits.

Y-select Designs.

Different from SALP and concurrent refresh, FgNVM also requires additional logic for CDs. For local Y-select, we propose to add an additional enable signal per SAG in each CD. The enable signals are assumed to be routed parallel to the I/O lines in Figure 5.2. These signals enable the local Y-select in only a single SAG in any given CD. The Y-select enables are one-hot signals and pass through an AND gate with the enable signal corresponding to the SAG. In this way, we can still broadcast the column select signals normally across over a CD without modification to the column select logic. There difference is that select signals are inhibited at other SAGs, meaning only one SAG can be paired with any CD. Since we can not share bitlines as mentioned in Section 5.3.2 this becomes an extremely low cost implementation. In the bitline direction, one extra signals would be needed for each SAG to select which is connected to the CD. This requires subarray groups ∗ column divisions new enable signals and allows for each SAG to be connected to any CD. In addition to each enable signal, each column division requires a latch to hold the enable signal value for the duration of the sense operation.

5.5.1 Overhead Costs Area overhead must be carefully considered as even small increases in area have large impacts on the cost of memory. Overhead costs may potentially arise from additional latch overhead for each subarray, modifications to the row decoder, and additional bitlines for local Y-select enable. Below we consider the overheads of area and the impact on lifetime and yield of an FgNVM device. We summarize the are overhead in Table 5.3.

5.5.2 Area Overhead

Row Decoder

For comparing row decoder designs, we assume a two-stage decoder with pre-decoder and sec- ond stage decoder. Based on this, we approximate number of transistors following an equation from [98]. The size of this row decoder grows by Ω(NlogN) for N rows in a bank. Doubling the number of row decoders with N/2 size using this equation results in a negligible change in area. 74

Latch Overhead

We measure latch overhead by implementing the additional latches required in VerilogHDL and synthesizing the design via Synopsys Design Compiler [99] with TSMC 45nm low-power technol- ogy. Area overhead of adding extra latches is similar to that of [38, 96, 97]. We compile designs supporting both 8 and 32 subarray groups and obtain 2, 325µm2 and 9, 333µm2, respectively.

Y-select Enables

We estimate the area of adding additional enable wires using moderately sized metal 3 wire pitches and spacing. In the best case scenario there is enough extra space to route Y-select enables above tiles with the global I/O lines, resulting in no extra area overhead. In the worst case, we conservatively assume that the enable lines are not able to be routed above the the tiles at all, incurring the maximum area penalty. We assume a wire and spacing of 3F (feature size) at 45nm. The result is a 6F wire and spacing of 270nm. Using the maximum number of 32 subarray groups and 32 column divisions results in a enable signal bus width of 246µm. Assuming these wires must stretch over the entire bank, using [95] we assume a length of 4mm, resulting in a total overhead of 0.1mm2 (0.15%).

5.5.3 Yield and NVM Lifetime

Subdividing rows and columns between multiple decoding regions such as subarray groups and column divisions has implications on memory yield and lifetime of non-volatile memory devices. Typical designs use spare rows and columns to get around fabrications defects, increasing the number of dies that are operational and thus the yield. Rows and columns are tested after manufacture and working rows and columns are connected to decoder outputs through one-time programmable fuses. Subdividing the memory into different decoding regions limits the total number of possible defects that can be corrected in each region. In order to keep the number of correctable defects in each region constant, we can duplicate the spare rows and columns in each NVM tile. Equation (5.2) shows the additional overhead in term of percentage of extra wordlines and bitlines. In the equation w and b are wordlines in bitlines in a tile, t is the number of tiles, and ws and bs represent the number of spare wordlines and bitlines. SAGs and CDs represent the number of SAGs and CDs in the FgNVM design. This estimated value is linearly proportional to area overhead. b × w × t + b × w × SAGs + w × b × CDs + w × b s s s s − 1 (5.2) b × w × t + b × ws + w × bs + ws × bs The equation shows that memory array size is increased anywhere from 0.11% in a 2×2 design, 0.87% in an 8×8 design, and 6.14% in a 32×32 design. Additionally, lifetime increasing techniques can potentially be used to correct manufacturing defects. Techniques to correct stuck-at faults and similar errors at the end of NVM cell lifetime typically employ either encoding techniques [100] [90], memory [101] [102], or pointers 75 to error-free regions of memory [103] [104]. Our work does not interfere with these recovery techniques, so the lifetime of NVM is not degraded using FgNVM.

5.6 Conclusion

Non-volatile memories exhibit excellent scalability may be poised to replace DRAM once certain obstacles are overcome. In this work, we introduced a new design which leverages the non- volatility aspect of NVMs to design a fine-granularity NVM — FgNVM. Our design allows for low energy intra-bank parallelism via Partial-Activations, Multi-Activations, and read/write conflict reduction through Backgrounded Writes. Partial-Activation provides the ability to save energy by mitigating overfetch. Multi-Activation allows for irregular access patterns for workloads with high memory parallelism. Backgrounded Writes allow simultaneous reads and writes without need for buffering, cancellation, or pausing. Area overhead of the design ranges from 0.1% – 0.32%, similar to previous designs providing subarray-level access. Our techniques combined provide an average performance improvement of 56.5% over a baseline NVM design with up to 73% reduced energy. Chapter 6

Early Activation Scheduling for Main Memories

Recent work in memory architecture has accelerated greatly due to scaling issues with DRAM [84, 20, 1] With the added exploration of non-volatile memories, this acceleration in research has lead to many interesting direction in bank-level designs. More specifically, several designs have becoming increasingly more fine-granularity in terms of how memory is accessed and moved. For DRAM, the capability to access the smaller-than-bank units allows for overlapping accesses [38]. A similar proposal is made in [40] to enable selective subarray access (i.e., less than one whole row). This can be done either by accessing a single subarray or by enabling only a portion of the entire wordline. Non-volatile memory has also seen specific concepts tailored directly to its unique properties. Early work recognizes the value of opening several smaller rows at once by using several ”swap” rows where recently opened row data may be placed [5]. The swap rows can be accessed in much shorter time and also be used as write buffers for the assigned row. Chapter 5 of this dissertation discusses the FgNVM circuit design to enable more ways to access memory. To summarize, FgNVM allows for a portion of a row to be accessed without designing the bank with explicitly smaller rows. Furthermore, subarray-level access similar to DRAM is made possible without the requirement that local sense amplifiers be used. Lastly, FgNVM allows for simultaneous read and write from different subarrays in a bank, which is useful for NVMs with extremely high write times (e.g., MLC PCM). Despite these new architectures, access to main memory is still done using the traditional triple of row-based commands: activate (ACT), read/write (RD/WR), and precharge (PRE). For most NVMs we may disregard the precharge command. This is typically used to reset bitlines to 1/2 of VDD for voltage sensing capability, however most NVMs are expected to use current-sensing [52]. These row-based commands do not have any timing or energy constraints in the delay between them, besides performance implications. With the ability to have multiple 77 row buffers [5, 38] and capability through FgNVM (see Chapter 5) and other techniques [40] to open portions of a row we have much more flexability in how we access data using row-based commands. In this chapter, we discuss a technique to leverage the ability to have multiple simultaneous open rows and no imposed delay between row-based commands. Our goal is to be able to accurately predict which rows will be accessed next in idle subarrays in memory. By doing so we can issue an Early Activation, that is an ACT command issued speculatively, in order to decrease the latency of reading or writing data to that row. We leverage memory access predictors popular for large DRAM caches [57, 105] as our prediction scheme and issue Early-ACTs directly from predicted LLC misses. In this work we explore non-volatile memories with only ACT+RD/WR pairs. However, the idea is the same as DRAM if we assume open page mode with back-to-back PRE+ACT commands.

6.1 Motivation

An example of an Early-ACT is shown in Figure 6.1. In the figure, the top timing diagram shows the state of the art method to access memory, where an ACT command for a read request is generated by the memory controller (MC) when the demand request arrives. The bottom timing diagram shows the Early-ACT idea, where an ACT is issued speculatively by an access predictor before the demand request arrives at the MC. This bottom diagram can be shifted in either direction, either completely hiding the ACT command for maximum savings or issuing an ACT command when it normally would be issued for no performance benefit nor degradation. We gathered motivational results using an oracle implementation to explore potential for Early-ACT and verify our predictor. The baseline system we explore is also discussed in this section. We will show that systems that deviate from the baseline also have potential to use Early-ACT given a sufficiently large LLC to amortize ACT command timing and queuing delay.

Baseline ACT READ Time Early-ACT ACT READ Time

Demand Request Arrives Savings at Memory Controller

Figure 6.1: Example of an Early-ACT in action. The top line shows the baseline case, where ACT is created and issued when demand request arrives at MC. Bottom is Early-ACT idea, where ACTs are issued speculatively before a demand request arrives. 78

L1$

L1$ L3$ MC L2$ L1$

L1$ AP

Figure 6.2: Baseline system setup for testing Early-ACT concept. Four CPUs with three levels of cache, access predictor (AP), memory controller (MC), and 4 channels of memory.

Table 6.1: Experimental System Setup

4-cores, 2GHz, Out-of-Order Processor 128 ROB entries, 5 issue width 48 LQ size, 32 SQ size L1I-cache 2 cycle latency, 32kB, 2-way 4 mshr, degree 2 tagged prefetcher L1D-cache 5 cache latency, 32kB, 4-way 16 mshr, degree 2 stride prefetcher L2-cache 8 cycle latency, 256kB, 8-way 16 mshr, degree 4 stride prefetcher L3-cache 24 cycle latency, 4MB, 16-way 16 mshr, degree 8 stride prefetcher Access Predictor 256 entries per core, 8 address bits per entry 4 access threshold to predict, 8 maximum value 4GB memory, 400 MHz, 4 channels 1 rank, 8 banks, 16K rows, 8K columns Main Memory 1024-bit row buffer, FRFCFS 64 write drivers, 32 queue entries tRCD=120ns, tCAS=10ns, tRAS=0ns PCM Timings tRP=0ns, tCCD=4cy, tBURST=4cy tCWD=7.5ns, tWP=150ns, tWR=7.5ns

6.1.1 Baseline System Design

The baseline system design is shown in Figure 6.2. We consider a system with 3-levels of cache. Notice that the access predictor (AP) is in parallel with the L3 cache. The access predictor watches for L3 access (i.e., L2 misses) and uses this information to make prediction. If we predict the request will miss in the L3 cache, the access predictor may choose to preform an Early-ACT. The access predictor is connected directly to the memory controller on the other side to issue Early-ACTs. Cache sizes, CPU setup, and memory configuration and latencies are also shown in Table 6.1. The access predictor works by storing multiple entries with a page number and a counter for the number of times accesses to this page are seen. If the counter is below the threshold value, 79

Successful Predictions 1 0.95 0.9 0.85 0.8

Figure 6.3: Memory access predictor accuracy at L3 cache across memory intensive SPEC2006 benchmarks. High accuracy is achieved for confident issuance of Early-ACT commands. the predictor guesses that the access will miss in the L3 cache and has the option to send an Early-ACT to the memory controller. Counters for each page number are updated based on hit or miss in the L3 cache. Requests that miss will reduce the counter by 1 if the counter is non-zero. Requests that hit will increase the counter by 1 if the counter is less than the maximum value. Note that this is not a way to bypass the LLC since we need the cache access result to update our predictor. Figure 6.3 shows the accuracy of our prediction scheme at the L3 cache. These results cover several million accesses and from the results we can see a relatively few number of predictions were unsuccessful in most benchmarks. The mcf benchmark has the lowest prediction accuracy, with a rate of 86%. This relatively low prediction rate is due to the massive amount of memory traffic generated by this benchmark. A higher number of entires in the access predictor brings up the prediction accuracy at the cost of more predictor area overhead.

160 Baseline Oracle 140 120 100 80 60 40 20

0 Average MemoryLatency (Mem Cycles)

Figure 6.4: Results of running Early-ACT oracle prediction on PCM type memories. The average total memory latency is shown with baseline, oracle pairs for each memory type. 80

6.1.2 Oracle Analysis

The theoretical maximum results are determined by using an oracle prediction. Our oracle scheme modifies the timings of the memory simulator outright to emulate an ACT arrive earlier than it truly did. Internally our simulator stores when the next activation can occur. This time is either the current simulation time or some time in the past. In the case that the next activation can occur at the current simulation time, no Early-ACT is possible. In all other cases, we assume that the ACT could have arrived at the next activation time in the past. We limit this value to tRCD to have less skew on the results (i.e., if next activation was millions of cycles ago we assume Early-ACT is issued tRCD cycles ago, preventing simulation of ”infinite” speed up). Figure 6.4 shows the results from oracle test runs. In this case we show the PCM memory system which is typically the type of NVM with the highest activation cycle time. By using Early-ACT we have huge performance potential in memories with very high activation times.

6.2 Results and Analysis

In our first round of results, we test a naive scheme to issue Early-ACTs. When our access predictor guesses that an L3 access will miss, we always issue an Early-ACT. The results in Figure 6.5 show very mixed outcomes. In some benchmarks, a naive scheme achieves nearly the full potential of the oracle (e.g., GemsFDTD, gobmk, libquantum, zeusmp). In other benchmarks, the scheme provides basic benefits, however there is much more performance to be realized.

6.2.1 Missed Prediction Implications

One of the problems with the naive Early-ACT scheme is the number of false hits and false misses. A false hit is defined as an Early-ACT that was not issued because the access predictor falsely predicted an L3 cache hit. A false miss occurs when the access predictor claims an L3 miss will occur when it is actually a hit, causing an Early-ACT to be issued unnecessarily. False

2.5

Baseline Early-ACT Naïve Oracle 2

1.5

1

0.5 IPC SpeedupIPC relative tobaseline 0

Figure 6.5: Results of running Early-ACT using a naive issuance of Early-ACTs. Baseline and simulation with Early-ACT are shown. 81

350000 Early-ACT Oracle

300000

250000

ACTs - 200000

150000

100000 NumberofEarly 50000

0

Figure 6.6: Total number of Early-ACT requests. The difference between Early-ACT and oracle is the total number of false hits and misses combined. hits can degrade the potential performance improvement, but can not reduce performance. False misses can potentially degrade performance in two ways. First, a false miss can issue an ACT to a subarray which is about to activate another row in the future, delaying the future ACT and causing an ACT conflict. Second, false misses can issue an ACT to a subarray closing a row which would be a row buffer hit for a future access, causing a premature closure. Based on our experimental runs for the chosen benchmarks, the number of ACT conflicts and premature closures are so minuscule we do not explore techniques to prevent them. Data for this observation is showned in Figure 6.6. The total number of false hits and false misses can be quantified as the difference between the oracle and Early-ACT bars in the figure. The amount of ACT conflicts and premature closures is a subset of this already minimal number of Early-ACTs. Note that a false miss can issue an ACT which does not delay a future ACT to the same subarray. We do not count these ”hidden” activates as an ACT conflict, however there is still some wasted energy.

6.2.2 Limiting Amounts of Early-ACTs

Other issues that may arise when using Early-ACTs is overfilling the memory queue occupancy. This can occur when too many Early-ACTs are issued, causing the memory queue to become too full to house demand requests. Since demand requests are on the critical path and speculative ACTs are not, this can potentially degrade performance as well. We experimented limiting the number of Early-ACTs by disallowing Early-ACTs to be issued when the memory queue is more than a certain percentage full. Results are shown in Figure 6.7. Similar to the measurements of ACT conflicts and premature closures, we found the impact on the number of Early-ACTs in the queue interfering with demand requests to be minimal. This can be seen by the near-zero variation in performance across all occupancy limitations. In this figure, the Early-ACT series is our naive scheme with no limitations on the number of Early-ACTs issued. 82

<25% <50% <75% <95% Early-ACT

InstructionsperCycle

GemsFDTD astar bwaves gobmk lbm leslie3d libquantum mcf milc soplex zeusmp

Figure 6.7: Instructions per cycle for each benchmark when limiting the number of Early-ACTs. Less-than a percentage series means Early-ACTs are only issued when the memory queue is less than said percentage full.

6.2.3 Unrealized Performance Potential

Based on these observations, the main cause of Early-ACT not achieving the performance poten- tial of the oracle approach in all benchmarks is due to the location of access prediction. Table 6.1 shows the latency of the L3 cache access. If we predict at the L3 which has a 24 memory cycle latency and our ACT time is 48 cycles (i.e., 120ns at 400MHz), then we can save at most 24 cy- cles given the memory queue is empty. The empty queue assumption means the demand request will be issued immediately upon arriving at the memory, causing 24 cycles to be saved (48 - 24). However, when the queue is not empty, the demand request can potentially be queued for long periods of time. The average number of ACT cycles saved is defined as the difference between when the Early-ACT was issued to when a normal ACT would have been issued. This number can range between 0 cycles saved to 48 cycles saved (i.e., the maximum ACT time, tRCD). Zero cycles are saved when the Early-ACT was queued for just as long as the ACT would have been due to subarray conflict or other timing constraints. Maximum savings occurs when the demand request is queued for long enough for the Early-ACT to be completely hidden. Figure 6.8 shows the average number of ACT cycles saved per Early-ACT. Figure 6.9 shows the total number of ACT cycles saved across the entire simulation runs. The data in Figure 6.9 lines up with the speedup in Figure 6.5. Benchmarks which have total ACT cycle savings in Early- ACT comparable to the oracle case have speedups that approach the oracle case in Figure 6.5. In order to realize more savings, either the L3 miss latency would need to be increased or the activation time of the main memory would need to be reduced. However, neither of these solutions would be considered in a realistic system implementation.

6.2.4 Memory Controller Implications

Note that in this work we assumed a FRFCFS memory controller. Some memory controllers may have separate memory queues for read and write commands. In these memory controllers, 83

60 Early-ACT Oracle 50

40

30

20

10

0 Average Average ACT saved cycles (MemCycles)

Figure 6.8: Average number of ACT cycles saved when issuing an Early-ACT. This average can range from 0 to 48. In all cases, the oracle savings are much higher than a realistic implementa- tion.

14000000 Early-ACT Naïve Oracle 12000000

10000000

8000000

6000000

4000000

2000000

0 Total ACT TotalACT Cycles (Mem Saved Cycles)

Figure 6.9: Total number of ACT cycles saved across the entire simulation run. The difference in savings between Early-ACT and oracle mirror the speedup shown in the performance figure. the write commands are typically drained either during periods of low memory activity at the memory controller or when the write queue reaches a specific occupancy. Memory controllers with write queues predictably issue write commands and therefore Early-ACT is not needed for these types of memory controllers. In this work, we utilize the L3 cache replacement scheme to determine which address will be the victim upon a miss to a full cache set. Using this victim address, we can issue an Early-ACT at the time of L3 miss similar to how read misses are handled.

6.3 Conclusions

Due to recent work in the scope of main memory, several new circuit-level designs have shown the possibility to create finer granularity memory banks. In this chapter, we make an effort to introduce new architecture-level implications of these designs. We describe a method to allow rows to be open speculatively without a corresponding read or write transaction. The goal 84 behind this is to decrease overall activation time, increase transaction hit rates, and improve overall system performance. We have shown an accurate prediction method placed at the LLC to introduce early activations (Early-ACTs) to the memory system. Our preliminary work shows potential for this technique to reduce overall memory latency, especially in slow NVM type memories. However, the Early-ACTs must be carefully issued in order to avoid hindering our goals. The technique was evaluated using non-volatile memory, but may also be applied to DRAM under specific memory controller conditions. Chapter 7

Dissertation Conclusions

Non-volatile memory exhibits excellent scalability and has potential to supplement DRAM once certain obstacles are overcome. Specifically, the write endurance, energy, and latency are prob- lematic. As research progresses, several new prototypes have emerged to further increase scala- bility of non-volatile memories. These new prototypes have several problems that result in in- creasingly disparate operation compared to contemporary DRAM. However, the problems with DRAM are causing the future of the technology to slow down and become unknown, making non-volatile memory an attractive solution despite these obstacles. Therefore, it is important to be able to model these emerging technologies at an architectural level. Moreover, the intrin- sic device characteristics should be considered when designing future memory systems. In this dissertation we introduced multiple research directions to bridge the gap between circuit-level knowledge and architectural designs. First, we designed a new simulation tool which considers the unique device characteristics of differing memory cell and array designs. As motivation, we showed that the relative error of performance estimates when using estimated latency values for multi-level cell phase-change memory can be as much as 150%. The research direction for non-volatile memories is moving towards more unique array designs in an effort to extract as much density as possible from the devices. As a result, the latency and energy of these designs is becoming more and more non- uniform. Circuit-level models can give a range of energy and latency values for specific data inputs or changes in memory data. However, these models do not fully capture the data patterns exhibited by real-world applications at the architectural level. Furthermore, current simulators do not allow for desirable simulated system such as hybrid memory systems without significant modification to the baseline model. To this end, we introduced the NVMain simulator to provide such a model. Second, in order to leverage the important capabilities of modeling simulated data with energy and performance characteristics, we implemented a bank-level model which extends the popular NVSim tool. Our bank-level model is able to ascertain energy and performance values for bit patterns written to a cell in a specific memory design. More specifically, multi-level cell 86 models are introduced for so-called horizontal 3D-stacked memory devices. The tool also allows types of memory currently missing from NVSim such as embedded DRAM, which are becoming increasingly popular as a replacement for SRAM LLCs which interface with main memory. We expect this tool to have high utility in design space exploration through finding optimal design points and best technologies for given optimization targets. Third, we leveraged the previous models to design a hybrid-memory system consisting of a DRAM-based cache which stores data at the cache line granularity to improve the performance of a non-volatile main memory. The cache in this hybrid-memory system can help significantly by mitigating all the problems with writes to non-volatile memory cells. However, the number of resulting writes that arrive at the DRAM-based cache doubles due to writebacks from higher level caches as well as fill requests from misses in the DRAM-based cache. Unlike their SRAM- based counterparts, DRAM-based caches are significantly slower since they must incur the typical DRAM activate-precharge cycle. In this work, we introduce the concept of a Fill Cache to store write requests for potentially long periods of time to reduce interference with read requests due to the increased number of writes. Our initial results show a much more uniform average memory access time. In addition to this, we propose several other uses of Fill Caches to explore in the future. Next, we explore low-level characteristics of non-volatile main memory based designs in an effort to apply modern-day DRAM research to non-volatile memories. We proposed a novel design which subdivides the memory into multiple dimensions accessible independently and allows for three optimizations of non-volatile based memories. The first optimization reduces energy by not sensing data that is not requested. The second allows sensing data from disparate wordlines simultaneously. Lastly, we provide the unique capability to perform writes in the background while allowing read requests to be serviced. Our results show that the design works very well with the potential to reach DRAM performance. However, the results demonstrated are far from the ideal case. To this end, we propose multiple techniques as future work to potentially boost the performance of the design as close to ideal as possible. Finally, we attempt to leverage the architectural implications of our proposed non-volatile memory design. Specifically, we make use of the ability to decouple the issuance of row-based commands such as activation, read/write, and precharge. In doing this, we are able to issue early activations (Early-ACTs). This work shows that Early-ACTs have the capability to reduce the overall activation cycle time in memories and potentially hide the latency completely. We explore several different methods to manage the issuance of Early-ACTs such that we do not increase the queuing latency of a memory channel, average activation latency, or overall memory latency by more than the baseline. We explored this using a non-volatile type memory, however the concept could be applied to other memories such as DRAM under certain restrictions. 87

7.1 Future Directions

These five chapters covered non-volatile memory simulation and design assuming similar archi- tecture and protocol to contemporary DRAM. In the future, these assumptions will change due to momentum of newer memory trends and other emerging technologies. Most notably, three- dimensional integrated circuits and the introduction of non-DDR standards will impact how memory is designed and therefore how it is accessed most efficiently. 3D-ICs allow for memory to be stacked directly in the vertical direction rather than a planar layout for DIMMs. As a result, the memory devices are in much closer proximity, changing many of the assumptions used in this work. As discussed in the DESTINY model, it is possible to stack memory using either TSVs or using monolithic stacking with NVMs. These types of designs, specifically monolithically stacked NVMs, will change how memory is accessed. The addition of the third dimension allows for different permutations of row, column, and layer to access individual cache lines or pages. This change can mean concepts such as FgNVM and Early-ACT need to be modified to work with these new designs. Newer non-DDR protocols have also been emerging, such as high-bandwidth memory (HBM) [106] and hybrid-memory cube (HMC) [49]. HBM defines a minimum requirement for the avail- able bandwidth from the memory, which imposes several design constraints requiring 3D-ICs. HMC has similar design constraints demanding 3D-IC usage. These protocols act in an asyn- chronous manner and low-level memory control is moved from the processor die to the memory stack. As a result, the methods used to model and simulate these types of memory will change dramatically. Although DESTINY supports modeling the memory stack itself, the NVMain simulator would need to be updated to model these interfaces. This dissertation explores several different ways in which non-volatile memory can be lever- aged to be used as main memory. Although NVM is still in it’s infancy, we feel the techniques proposed in this dissertation will be valuable in allowing NVMs to overtake DRAM as the mem- ory of choice should DRAM scaling and energy become a major limitation. The techniques are also useful in supplementing DRAM to add a new level of memory with less performance but more energy savings. Bibliography

[1] (2009), “International Technology Roadmap for ,” . URL http://www.itrs.net/reports.html

[2] Mogul, J. C., E. Argollo, M. Shah, and P. Faraboschi (2009) “Operating System Support for NVM+DRAM Hybrid Main Memory,” in Proceedings of the 12th Conference on Hot Topics in Operating Systems, HotOS’09, USENIX Association, Berkeley, CA, USA, pp. 14–14. URL http://dl.acm.org/citation.cfm?id=1855568.1855582

[3] Qureshi, M. K., V. Srinivasan, and J. A. Rivers (2009) “Scalable High Performance Main Memory System Using Phase-change Memory Technology,” in Proceedings of the 36th Annual International Symposium on Computer Architecture, ISCA ’09, ACM, New York, NY, USA, pp. 24–33. URL http://doi.acm.org/10.1145/1555754.1555760

[4] Qureshi, M., M. Franceschini, and L. Lastras-Montano (2010) “Improving read performance of Phase Change Memories via Write Cancellation and Write Pausing,” in High Performance Computer Architecture (HPCA), 2010 IEEE 16th International Symposium on, pp. 1–11.

[5] Lee, B. C., E. Ipek, O. Mutlu, and D. Burger (2009) “Architecting Phase Change Memory As a Scalable Dram Alternative,” in Proceedings of the 36th Annual International Symposium on Computer Architecture, ISCA ’09, ACM, New York, NY, USA, pp. 2–13. URL http://doi.acm.org/10.1145/1555754.1555758

[6] Xu, C., X. Dong, N. Jouppi, and Y. Xie (2011) “Design implications of memristor- based RRAM cross-point structures,” in Design, Automation Test in Europe Conference Exhibition (DATE), 2011, pp. 1–6.

[7] Baek, I., C. Park, H. Ju, D. J. Seong, H. S. Ahn, J. Kim, M. K. Yang, S. Song, E. Kim, S. Park, C. Park, C. Song, G. Jeong, S. Choi, H.-K. Kang, and C. Chung (2011) “Realization of vertical resistive memory (VRRAM) using cost effective 3D process,” in Electron Devices Meeting (IEDM), 2011 IEEE International, pp. 31.8.1–31.8.4.

[8] Chien, W., F. M. Lee, Y. Y. Lin, M. H. Lee, S. H. Chen, C. C. Hsieh, E. K. Lai, H. H. Hui, Y. K. Huang, C. C. Yu, C. Chen, H. L. Lung, K. Y. Hsieh, and C.-Y. Lu (2012) “Multi-layer sidewall WOX resistive memory suitable for 3D ReRAM,” in VLSI Technology (VLSIT), 2012 Symposium on, pp. 153–154. 89

[9] Chen, H.-Y., S. Yu, B. Gao, P. Huang, J. Kang, and H.-S. Wong (2012) “HfOx based vertical resistive random access memory for cost-effective 3D cross-point architecture without cell selector,” in Electron Devices Meeting (IEDM), 2012 IEEE International, pp. 20.7.1–20.7.4.

[10] Cho, W. Y., B.-H. Cho, B.-G. Choi, H.-R. Oh, S. Kang, K.-S. Kim, K.-H. Kim, D.-E. Kim, C.-K. Kwak, H.-G. Byun, Y. Hwang, S. Ahn, G.-H. Koh, G. Jeong, H. Jeong, and K. Kim (2005) “A 0.18-um 3.0-V 64-Mb nonvolatile phase-transition ran- dom access memory (PRAM),” Solid-State Circuits, IEEE Journal of, 40(1), pp. 293–300.

[11] Kang, S., W. Y. Cho, B.-H. Cho, K.-J. Lee, C.-S. Lee, H.-R. Oh, B.-G. Choi, Q. Wang, H.-J. Kim, M.-H. Park, Y. H. Ro, S. Kim, C.-D. Ha, K.-S. Kim, Y.- R. Kim, D.-E. Kim, C.-K. Kwak, H.-G. Byun, G. Jeong, H. Jeong, K. Kim, and Y. Shin (2007) “A 0.1-um 1.8-V 256-Mb Phase-Change Random Access Memory (PRAM) With 66-MHz Synchronous Burst-Read Operation,” Solid-State Circuits, IEEE Journal of, 42(1), pp. 210–218.

[12] De Sandre, G., L. Bettini, A. Pirola, L. Marmonier, M. Pasotti, M. Borghi, P. Mattavelli, P. Zuliani, L. Scotti, G. Mastracchio, F. Bedeschi, R. Gastaldi, and R. Bez (2011) “A 4 Mb LV MOS-Selected Embedded Phase Change Memory in 90 nm Standard CMOS Technology,” Solid-State Circuits, IEEE Journal of, 46(1), pp. 52–63.

[13] Chang, M.-F., C.-W. Wu, C.-C. Kuo, S.-J. Shen, K.-F. Lin, S.-M. Yang, Y.-C. King, C.-J. Lin, and Y.-D. Chih (2012) “A 0.5V 4Mb logic-process compatible embedded resistive RAM (ReRAM) in 65nm CMOS using low-voltage current-mode sensing scheme with 45ns random read time,” in Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2012 IEEE International, pp. 434–436.

[14] Rosenfeld, P., E. Cooper-Balis, and B. Jacob (2011) “DRAMSim2: A Cycle Accu- rate Memory System Simulator,” Computer Architecture Letters, 10(1), pp. 16–19.

[15] Chatterjee, N., R. Balasubramonian, M. Shevgoor, S. H. Pugsley, A. N. Udipi, et al. (2012) “USIMM: the Utah SImulated Memory Module,” Technical Report.

[16] Jeong, M. K., D. H. Yoon, and M. Erez, “DrSim: A Platform for Flexible DRAM System Research,” http://lph.ece.utexas.edu/public/DrSim.

[17] Hansson, A., N. Agarwal, A. Kolli, T. Wenisch, and A. Udipi (2014) “Simulating DRAM controllers for future system architecture exploration,” in Performance Analysis of Systems and Software (ISPASS), 2014 IEEE International Symposium on, pp. 201–210.

[18] Binkert, N., B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, et al. (2011) “The gem5 Simulator,” Computer Architecture News, 39(2), pp. 1–7.

[19] JEDEC Solid State Technology Association (2009), “JEDEC Standard: DDR3 SDRAM Specification,” http://www.jedec.org/standards-documents/docs/jesd-79-3d.

[20] ——— (2012), “JEDEC Standard: DDR4 SDRAM,” http://www.jedec.org/sites/default/files/docs/JESD79-4.pdf. [21] ——— (2012), “JEDEC Standard: LPDDR3,” http://www.jedec.org/standards- documents/results/jesd209-3.

[22] Micron (2007), “Calculating Memory System Power for DDR3,” http://www.micron.com/products/support/power-calc. 90

[23] Chandrasekar, K., B. Akesson, and K. Goossens (2011) “Improved Power Modeling of DDR SDRAMs,” in Digital System Design (DSD), 2011 14th Euromicro Conference on, pp. 99–108.

[24] Wilton, S. J. E. and N. Jouppi (1996) “CACTI: an enhanced cache access and cycle time model,” IEEE JSSC, 31(5), pp. 677–688.

[25] Mamidipaka, M. and N. Dutt (2004) eCACTI: An enhanced power estimation model for on-chip caches, Tech. rep., TR-04-28, CECS, UCI.

[26] Li, S., K. Chen, J. H. Ahn, J. B. Brockman, and N. P. Jouppi (2011) “CACTI-P: Architecture-level Modeling for SRAM-based Structures with Advanced Leakage Reduction Techniques,” in Proceedings of the International Conference on Computer-Aided Design, ICCAD ’11, pp. 694–701.

[27] Chen, K. et al. (2012) “CACTI-3DD: Architecture-level modeling for 3D die-stacked DRAM main memory,” in DATE, pp. 33–38.

[28] Tsai, Y.-F., Y. Xie, N. Vijaykrishnan, and M. J. Irwin (2005) “Three-dimensional cache design exploration using 3DCacti,” in ICCD, pp. 519–524.

[29] Vogelsang, T. (2010) “Understanding the Energy Consumption of Dynamic Random Access Memories,” in MICRO’43, pp. 363–374.

[30] Dong, X., X. Wu, G. Sun, Y. Xie, H. Li, and Y. Chen (2008) “Circuit and Microar- chitecture Evaluation of 3D Stacking Magnetic RAM (MRAM) As a Universal Memory Replacement,” in Proceedings of the 45th Annual Design Automation Conference, DAC ’08.

[31] Chang, M.-T., P. Rosenfeld, S.-L. Lu, and B. Jacob (2013) “Technology comparison for large last-level caches (L3Cs): Low-leakage SRAM, low write-energy STT-RAM, and refresh-optimized eDRAM,” in High Performance Computer Architecture (HPCA2013), 2013 IEEE 19th International Symposium on.

[32] Samavatian, M. H., H. Abbasitabar, M. Arjomand, and H. Sarbazi-Azad (2014) “An Efficient STT-RAM Last Level Cache Architecture for GPUs,” in Proceedings of the 51st Annual Design Automation Conference, DAC ’14.

[33] Nakamura, H., T. Nakada, and S. Miwa (2014) “Normally-off computing project: Challenges and opportunities,” in Design Automation Conference (ASP-DAC), 2014 19th Asia and South Pacific.

[34] Fujita, S., K. Nomura, H. Noguchi, S. Takeda, and K. Abe (2014) “Novel nonvolatile memory hierarchies to realize normally-off mobile processors,” in Design Automation Con- ference (ASP-DAC), 2014 19th Asia and South Pacific.

[35] Hayashikoshi, M., Y. Sato, H. Ueki, H. Kawai, and T. Shimizu (2014) “Normally-off MCU architecture for low-power sensor node,” in Design Automation Conference (ASP- DAC), 2014 19th Asia and South Pacific.

[36] Izumi, S., H. Kawaguchi, M. Yoshimoto, and Y. Fujimori (2014) “Normally-off tech- nologies for healthcare appliance,” in Design Automation Conference (ASP-DAC), 2014 19th Asia and South Pacific. 91

[37] Shuto, Y., S. Yamamoto, and S. Sugahara (2015) “Comparative Study of Power- gating Architectures for Nonvolatile FinFET-SRAM Using Spintronics-based Retention Technology,” in Proceedings of the 2015 Design, Automation and Test in Europe Conference and Exhibition, DATE ’15.

[38] Kim, Y., V. Seshadri, D. Lee, J. Liu, and O. Mutlu (2012) “A Case for Exploiting Subarray-Level Parallelism (SALP) in DRAM,” in ISCA’39, pp. 368–379.

[39] Lee, D., Y. Kim, V. Seshadri, J. Liu, L. Subramanian, and O. Mutlu (2013) “Tiered- latency DRAM: A Low Latency and Low Cost DRAM Architecture,” in Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA), HPCA ’13, IEEE Computer Society, Washington, DC, USA, pp. 615–626. URL http://dx.doi.org/10.1109/HPCA.2013.6522354

[40] Udipi, A. N., N. Muralimanohar, N. Chatterjee, R. Balasubramonian, A. Davis, and N. P. Jouppi (2010) “Rethinking DRAM Design and Organization for Energy- constrained Multi-cores,” in ISCA’37, pp. 175–186.

[41] Bailey, K., L. Ceze, S. D. Gribble, and H. M. Levy (2011) “Operating System Implications of Fast, Cheap, Non-volatile Memory,” in Proceedings of the 13th USENIX Conference on Hot Topics in Operating Systems, HotOS’13.

[42] Condit, J., E. B. Nightingale, C. Frost, E. Ipek, B. Lee, D. Burger, and D. Co- etzee (2009) “Better I/O Through Byte-addressable, Persistent Memory,” in Proceedings of the ACM SIGOPS 22Nd Symposium on Operating Systems Principles, SOSP ’09.

[43] Lee, B., P. Zhou, J. Yang, Y. Zhang, B. Zhao, E. Ipek, O. Mutlu, and D. Burger (2010) “Phase-Change Technology and the Future of Main Memory,” Micro, IEEE.

[44] Lu, Y., J. Shu, L. Sun, and O. Mutlu “Loose-Ordering Consistency for persistent memory,” in 32nd IEEE International Conference on Computer Design, ICCD 2014, Seoul, South Korea, October 19-22, 2014.

[45] Narayanan, D. and O. Hodson (2012) “Whole-system Persistence with Non-volatile Memories,” in Seventeenth International Conference on Architectural Support for Program- ming Languages and Operating Systems (ASPLOS 2012), ACM.

[46] Venkataraman, S., N. Tolia, P. Ranganathan, and R. H. Campbell (2011) “Con- sistent and Durable Data Structures for Non-volatile Byte-addressable Memory,” in Pro- ceedings of the 9th USENIX Conference on File and Stroage Technologies, FAST’11.

[47] Volos, H., A. J. Tack, and M. M. Swift (2011) “Mnemosyne: Lightweight Persistent Memory,” in Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XVI.

[48] JEDEC Solid State Technology Association (2011), “JEDEC Standard: Wide I/O Single Data Rate Specification,” http://www.jedec.org/standards- documents/results/jesd229.

[49] Pawlowski, J. T. (2011) “Hybrid Memory Cube,” HotChip’11. [50] Xie, Y., G. H. Loh, B. Black, and K. Bernstein (2006) “Design space exploration for 3D architectures,” J. Emerg. Technol. Comput. Syst., 2(2), pp. 65–103.

[51] JEDEC Solid State Technology Association (2012), “JEDEC Standard: LPDDR2,” http://www.jedec.org/standards-documents/results/jesd209-2f. 92

[52] Dong, X., C. Xu, Y. Xie, and N. Jouppi (2012) “NVSim: A Circuit-Level Performance, Energy, and Area Model for Emerging Nonvolatile Memory,” Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 31(7), pp. 994–1007.

[53] Muralimanohar, N., R. Balasubramonian, and N. P. Jouppi (2009) CACTI 6.0, Tech. Rep. HPL-2009-85, HP Laboratories.

[54] Poremba, M. and Y. Xie (2012) “NVMain: An Architectural-Level Main Memory Sim- ulator for Emerging Non-volatile Memories,” in ISVLSI, IEEE, pp. 392–397.

[55] Micron (2013), “DDR3 SDRAM Verilog Model,” . [56] Dong, X., Y. Xie, N. Muralimanohar, and N. P. Jouppi (2010) “Simple but Effective Heterogeneous Main Memory with On-Chip Memory Controller Support,” in Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Net- working, Storage and Analysis, SC ’10, IEEE Computer Society, Washington, DC, USA, pp. 1–11. URL http://dx.doi.org/10.1109/SC.2010.50

[57] Qureshi, M. and G. Loh (2012) “Fundamental Latency Trade-off in Architecting DRAM Caches: Outperforming Impractical SRAM-Tags with a Simple and Practical Design,” in Microarchitecture (MICRO), 2012 45th Annual IEEE/ACM International Symposium on, pp. 235–246.

[58] Rusu, S., H. Muljono, D. Ayers, S. Tam, W. Chen, A. Martin, S. Li, S. Vora, R. Varada, and E. Wang (2014) “Ivytown: A 22nm 15-core enterprise Xeon R processor family,” in IEEE ISSCC, pp. 102–103.

[59] Kurd, N. et al. (2014) “Haswell: A family of IA 22nm processors,” in IEEE ISSCC, pp. 112–113.

[60] Fluhr, E. J., J. Friedrich, D. Dreps, V. Zyuban, G. Still, C. Gonzalez, A. Hall, D. Hogenmiller, F. Malgioglio, R. Nett, et al. (2014) “POWER8: A 12-core server- class processor in 22nm SOI with 7.6 Tb/s off-chip bandwidth,” in ISSCC, pp. 96–97.

[61] Loh, G. H., Y. Xie, and B. Black (2007) “ in 3D die-stacking technolo- gies,” IEEE Micro, 27(3), pp. 31–48.

[62] Dong, X. et al. (2012) “NVSim: A circuit-level performance, energy, and area model for emerging nonvolatile memory,” IEEE TCAD.

[63] Barth, J., D. Plass, E. Nelson, C. Hwang, G. Fredeman, M. Sperling, A. Math- ews, T. Kirihata, W. Reohr, K. Nair, and N. Caon (2011) “A 45 nm SOI Embedded DRAM Macro for the POWER Processor 32 MByte On-Chip L3 Cache,” IEEE JSSC.

[64] Barth, J., W. Reohr, P. Parries, G. Fredeman, J. Golz, S. Schuster, R. E. Mat- ick, H. Hunter, C. Tanner, J. Harig, H. Kim, B. Khan, J. Griesemer, R. Havre- luk, K. Yanagisawa, T. Kirihata, and S. Iyer (2008) “A 500 MHz Random Cycle, 1.5 ns Latency, SOI Embedded DRAM Macro Featuring a Three-Transistor Micro Sense Amplifier,” IEEE JSSC, 43(1), pp. 86–95.

[65] Sun, G. et al. (2009) “A novel architecture of the 3D stacked MRAM L2 cache for CMPs,” in HPCA, pp. 239–249. 93

[66] Golz, J., J. Safran, B. He, D. Leu, M. Yin, T. Weaver, A. Vehabovic, Y. Sun, A. Cestero, B. Himmel, G. Maier, C. Kothandaraman, D. Fainstein, J. Barth, N. Robson, T. Kirihata, K. Rim, and S. Iyer (2011) “3D stackable 32nm High-K/Metal Gate SOI embedded DRAM prototype,” in VLSIC, pp. 228–229.

[67] Klim, P., J. Barth, W. Reohr, D. Dick, G. Fredeman, G. Koch, H. Le, A. Khar- gonekar, P. Wilcox, J. Golz, J. B. Kuang, A. Mathews, T. Luong, H. Ngo, R. Freese, H. Hunter, E. Nelson, P. Parries, T. Kirihata, and S. Iyer (2008) “A one MB cache subsystem prototype with 2GHz embedded DRAMs in 45nm SOI CMOS,” in VLSIC.

[68] Kawahara, A., R. Azuma, Y. Ikeda, K. Kawai, Y. Katoh, K. Tanabe, T. Naka- mura, Y. Sumimoto, N. Yamada, N. Nakai, S. Sakamoto, Y. Hayakawa, K. Tsuji, S. Yoneda, A. Himeno, K. Origasa, K. Shimakawa, T. Takagi, T. Mikawa, and K. Aono (2012) “An 8Mb multi-layered cross-point ReRAM macro with 443MB/s write throughput,” in ISSCC, pp. 432–434.

[69] Hsu, C.-L. and C.-F. Wu (2010) “High-performance 3D-SRAM architecture design,” in IEEE APCCAS, pp. 907–910.

[70] Puttaswamy, K. and G. Loh (2009) “3D-Integrated SRAM Components for High- Performance ,” IEEE TC.

[71] Mittal, S., J. S. Vetter, and D. Li (2014) “A Survey Of Architectural Approaches for Managing Embedded DRAM and Non-volatile On-chip Caches,” IEEE Transactions on Parallel and Distributed Systems (TPDS).

[72] Kirihata, T., P. Parries, D. Hanson, H. Kim, J. Golz, G. Fredeman, R. Ra- jeevakumar, J. Griesemer, N. Robson, A. Cestero, M. Wordeman, and S. Iyer (2004) “An 800MHz embedded DRAM with a concurrent refresh mode,” in ISSCC, pp. 206–523.

[73] Agrawal, A. et al. (2014) “Mosaic: Exploiting the Spatial Locality of Process Variation to Reduce Refresh Energy in On-Chip eDRAM Modules,” in HPCA.

[74] Black, B., D. Nelson, C. Webb, and N. Samra (2004) “3D processing technology and its impact on IA32 microprocessors,” in ICCD, pp. 316–318.

[75] Patti, R. (2006) “Three-Dimensional Integrated Circuits and the Future of System-on- Chip Designs,” Proc. of the IEEE, 94(6), pp. 1214–1224.

[76] Loh, G. H. (2008) “3D-Stacked Memory Architectures for Multi-core Processors,” in Pro- ceedings of the 35th Annual International Symposium on Computer Architecture, ISCA ’08, IEEE Computer Society, Washington, DC, USA, pp. 453–464.

[77] Black, B., M. Annavaram, N. Brekelbaum, J. DeVale, L. Jiang, G. H. Loh, D. McCaule, P. Morrow, D. W. Nelson, D. Pantuso, P. Reed, J. Rupley, S. Shankar, J. Shen, and C. Webb (2006) “Die Stacking (3D) Microarchitecture,” in Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 39, IEEE Computer Society, Washington, DC, USA, pp. 469–479.

[78] Zhao, L., R. Iyer, R. Illikkal, and D. Newell (2007) “Exploring DRAM Cache Ar- chitectures for CMP Server Platforms,” in Proceedings of the 25th International Conference on Computer Design. 94

[79] Loh, G. H. and M. D. Hill (2011) “Efficiently enabling conventional block sizes for very large die-stacked DRAM caches,” in Proceedings of the 44th Annual IEEE/ACM Interna- tional Symposium on Microarchitecture, pp. 454–464.

[80] Dong, X., Y. Xie, N. Muralimanohar, and N. P. Jouppi (2010) “Simple but Effective Heterogeneous Main Memory with On-Chip Memory Controller Support,” in Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Net- working, Storage and Analysis, SC ’10, IEEE Computer Society, Washington, DC, USA, pp. 1–11.

[81] Woo, D. H., N. H. Seong, D. L. Lewis, and H.-H. S. Lee (2010) “Optimized 3D- Stacked Memory Architecture by Exploiting Excessive, High-Density TSV Bandwidth,” in Proceedings of the 16th International Symposium on High Performance Computer Archi- tecture, pp. 429–440.

[82] Rixner, S., W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens (2000) “Mem- ory access scheduling,” in Proceedings of the 27th annual international symposium on Com- puter architecture, ISCA ’00, ACM, New York, NY, USA, pp. 128–138.

[83] Somogyi, S., T. F. Wenisch, A. Ailamaki, and B. Falsafi (2009) “Spatio-temporal memory streaming,” in Proceedings of the 36th annual international symposium on Com- puter architecture, ISCA ’09, ACM, New York, NY, USA, pp. 69–80.

[84] Jacob, B., S. W. NG, and D. T. Wang (2007) Memory Systems: Cache, DRAM, Disk, Morgan Kaufmann.

[85] Xie, Y. (2011) “Modeling, Architecture, and Applications for Emerging Memory Tech- nologies,” Design Test of , IEEE, 28(1), pp. 44–51.

[86] Qureshi, M. K., V. Srinivasan, and J. A. Rivers (2009) “Scalable High Performance Main Memory System Using Phase-change Memory Technology,” in Proceedings of the 36th Annual International Symposium on Computer Architecture, ISCA ’09, ACM, New York, NY, USA, pp. 24–33. URL http://doi.acm.org/10.1145/1555754.1555760

[87] Raoux, S., G. Burr, M. Breitwisch, C. Rettner, Y. Chen, R. Shelby, M. Salinga, D. Krebs, S.-H. Chen, H. L. Lung, and C. Lam (2008) “Phase-change random access memory: A scalable technology,” IBM Journal of Research and Develop- ment, 52(4.5), pp. 465–479.

[88] Jiang, L., Y. Zhang, B. R. Childers, and J. Yang (2012) “FPB: Fine-grained Power Budgeting to Improve Write Throughput of Multi-level Cell Phase Change Memory,” in Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchi- tecture, MICRO ’12, IEEE Computer Society, Washington, DC, USA, pp. 1–12. URL http://dx.doi.org/10.1109/MICRO.2012.10

[89] Hay, A., K. Strauss, T. Sherwood, G. H. Loh, and D. Burger (2011) “Preventing PCM Banks from Seizing Too Much Power,” in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-44 ’11, ACM, New York, NY, USA, pp. 186–195. URL http://doi.acm.org/10.1145/2155620.2155642

[90] Cho, S. and H. Lee (2009) “Flip-N-Write: A simple deterministic technique to improve PRAM write performance, energy and endurance,” in Microarchitecture, 2009. MICRO-42. 42nd Annual IEEE/ACM International Symposium on, pp. 347–357. 95

[91] Gunadi, E. and M. Lipasti (2011) “CRIB: Consolidated rename, issue, and bypass,” in Computer Architecture (ISCA), 2011 38th Annual International Symposium on, pp. 23–32. [92] Standard Performance Evaluation Corporation, “SPEC2006 CPU,” http://www.spec.org/cpu2006. [93] Sherwood, T., E. Perelman, G. Hamerly, and B. Calder (2002) “Automatically Characterizing Large Scale Program Behavior,” in Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS X, ACM, New York, NY, USA, pp. 45–57. URL http://doi.acm.org/10.1145/605397.605403 [94] Rixner, S., W. Dally, U. Kapasi, P. Mattson, and J. Owens (2000) “Memory ac- cess scheduling,” in Computer Architecture, 2000. Proceedings of the 27th International Symposium on, pp. 128–138. [95] Choi, Y., I. Song, M.-H. Park, H. Chung, S. Chang, B. Cho, J. Kim, Y. Oh, D. Kwon, J. Sunwoo, J. Shin, Y. Rho, C. Lee, M.-G. Kang, J. Lee, Y. Kwon, S. Kim, J. Kim, Y.-J. Lee, Q. Wang, S. Cha, S. Ahn, H. Horii, J. Lee, K. Kim, H. Joo, K. Lee, Y.-T. Lee, J. Yoo, and G. Jeong (2012) “A 20nm 1.8V 8Gb PRAM with 40MB/s program bandwidth,” in Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2012 IEEE International, pp. 46–48. [96] Zhang, T., M. Poremba, C. Xu, G. Sun, and Y. Xie (2014) “CREAM: A Concurrent- Refresh-Aware DRAM Memory System,” in Proceedings of the 20th Annual IEEE Inter- national Symposium on High Performance Computer Architecture, HPCA ’14. [97] Chang, K. K., D. Lee, Z. Chishti, C. Wilkerson, A. Alameldeen, Y. Kim, and O. Multu (2014) “Improving DRAM Performance by Parallelizing Refreshes with Ac- cesses,” in Proceedings of the 20th Annual IEEE International Symposium on High Perfor- mance Computer Architecture, HPCA ’14. [98] Rabaey, J. M. (1996) Digital Integrated Circuits: A Design Perspective, Prentice-Hall, Inc., Upper Saddle River, NJ, USA. [99] Synopsys, “Design Compiler,” http://www.synopsys.com. [100] Jacobvitz, A. N., R. Calderbank, and D. J. Sorin (2013) “Coset Coding to Extend the Lifetime of Memory,” in Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA), HPCA ’13, IEEE Computer Society, Washington, DC, USA, pp. 222–233. [101] Ipek, E., J. Condit, E. B. Nightingale, D. Burger, and T. Moscibroda (2010) “Dynamically Replicated Memory: Building Reliable Systems from Nanoscale Resistive Memories,” in Proceedings of the Fifteenth Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems, ASPLOS XV, ACM, New York, NY, USA, pp. 3–14. [102] Yoon, D. H., N. Muralimanohar, J. Chang, P. Ranganathan, N. Jouppi, and M. Erez (2011) “FREE-p: Protecting non-volatile memory against both hard and soft er- rors,” in High Performance Computer Architecture (HPCA), 2011 IEEE 17th International Symposium on, pp. 466–477. [103] Schechter, S., G. H. Loh, K. Straus, and D. Burger (2010) “Use ECP, Not ECC, for Hard Failures in Resistive Memories,” in Proceedings of the 37th Annual International Symposium on Computer Architecture, ISCA ’10, ACM, New York, NY, USA, pp. 141–152. 96

[104] Qureshi, M. K. (2011) “Pay-As-You-Go: Low-overhead Hard-error Correction for Phase Change Memories,” in Proceedings of the 44th Annual IEEE/ACM International Sympo- sium on Microarchitecture, MICRO-44 ’11, ACM, New York, NY, USA, pp. 318–328.

[105] Sim, J., G. H. Loh, H. Kim, M. O’Connor, and M. Thottethodi (2012) “A Mostly- Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch,” in Proceed- ings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-45, IEEE Computer Society, Washington, DC, USA, pp. 247–257. URL http://dx.doi.org/10.1109/MICRO.2012.31

[106] JEDEC Solid State Technology Association (2013), “ (HBM) DRAM,” https://www.jedec.org/sites/default/files/docs/JESD235.pdf. Vita Matthew Poremba

Matthew Poremba is a fifth-year Ph.D. candidate in the Department of Computer Science and Engineering at the Pennsylvania State University. Before enrolling in the Ph.D. program in 2010, he received a B.S. degree in Computer Engineering from the Pennsylvania State Uni- versity. He is now working in the Microsystems Design Laboratory and has research interests in Computer Architecture with a focus on memory hierarchy, emerging technologies, and three- dimensional integrated circuits. He has published 12 refereed first- and co-authored papers in various conferences and journals including Design Automation and Test in Europe (DATE), High Performance Computing Architecture (HPCA), Asia and South Pacific Design Automa- tion Conference (ASP-DAC), Design Automation Conference (DAC), IEEE/ACM International Conference on Computer-Aided Design (ICCAD), IEEE Symposium on VLSI (ISVLSI), Com- puter Architecture Letters (CAL), ACM Transactions on Architecture and Code Optimization (TACO), Non-Volatile Memory Workshop (NVMW), IEEE International Conference on 3D Sys- tem Integration (3DIC), and IEEE Workshop on Signal Processing Systems (SiPS).