The Pennsylvania State University The Graduate School
ARCHITECTING BYTE-ADDRESSABLE NON-VOLATILE MEMORIES FOR
MAIN MEMORY
A Dissertation in Computer Science and Engineering by Matthew Poremba
c 2015 Matthew Poremba
Submitted in Partial Fulfillment of the Requirements for the Degree of
Doctor of Philosophy
May 2015 The dissertation of Matthew Poremba was reviewed and approved∗ by the following:
Yuan Xie Professor of Computer Science and Engineering Dissertation Co-Advisor, Co-Chair of Committee
John Sampson Assistant Professor of Computer Science and Engineering Dissertation Co-Advisor, Co-Chair of Committee
Mary Jane Irwin Professor of Computer Science and Engineering Robert E. Noll Professor Evan Pugh Professor
Vijaykrishnan Narayanan Professor of Computer Science and Engineering
Kennith Jenkins Professor of Electrical Engineering
Lee Coraor Associate Professor of Computer Science and Engineering Director of Academic Affairs
∗Signatures are on file in the Graduate School. Abstract
New breakthroughs in memory technology in recent years has lead to increased research efforts in so-called byte-addressable non-volatile memories (NVM). As a result, questions of how and where these types of NVMs can be used have been raised. Simultaneously, semiconductor scaling has lead to an increased number of CPU cores on a processor die as a way to utilize the area. This has increased the pressure on the memory system and causing growth in the amount of main memory that is available in a computer system. This growth has escalated the amount of power consumed by the system by the de facto DRAM type memory. Moreover, DRAM memories have run into physical limitations on scalability due to the nature of their operation. NVMs, on the other hand, provide high scalability well into the future and have decreased static power, one of the major sources of power consumption in contemporary systems. For all of these reasons, NVMs have the potential to be an attractive alternative or even complete replacement for DRAM as main memory. For these types of devices to be feasible, there are some obstacles that must be overcome in order for there to be a compelling reason for NVMs to augment or replace DRAM. Although the static power and scalability are better, NVMs suffers from lower performance, higher dynamic power, and lower endurance than DRAM. Furthermore, the availability of architectural and comprehensive circuit models to explore how these issues can be resolved at a high level are lacking. This dissertation addresses these issues by proposing several models for NVMs at both the architectural and circuit level. The architectural model, NVMain, is built around the assumptions that NVMs may not be complete replacements and thus provides flexibility to model complex memory systems including hybrid and distributed levels of memory. The circuit-level model, DESTINY, combines NVMs with more recent three-dimensional circuit design proposals to obtain performance and energy balanced memory designs. These two models are leveraged to explore several NVM memory designs. The first design employs a hybrid DRAM and NVM and addresses an issue of caching large amounts of NVM data in the DRAM portion. The second design considers reworking memory bank design to provide an extremely high-density NVM bank with the capability to access individual sub-units of the memory bank. The final design leverages the high parallelism from access to individual sub-units to schedule memory requests in a more efficient manner.
iii Table of Contents
List of Figures vii
List of Tables x
Acknowledgments xi
Chapter 1 Introduction 1 1.1 Background ...... 5 1.2 Related Work ...... 7
Chapter 2 Simulation Framework for Non-volatile Memories 10 2.1 Introduction ...... 10 2.2 Motivation ...... 11 2.3 Implementation ...... 12 2.3.1 Energy Modeling ...... 12 2.3.2 Non-volatile Memory Support ...... 12 2.3.3 Fine-grained Memory Architecture ...... 13 2.3.4 Memory System Flexibility ...... 13 2.3.5 Verification ...... 14 2.3.6 Timing Verification ...... 14 2.3.7 Energy Verification ...... 15 2.3.8 Data Verification ...... 15 2.3.9 Simulation Speed ...... 15 2.4 Case Studies ...... 16 2.4.1 MLC Simulation Accuracy ...... 16 2.4.2 Hybrid Memory System ...... 17 2.4.3 DRAM Cache ...... 19 2.5 Conclusions ...... 20
iv Chapter 3 Bank-level Modeling of 3D-stacked NVM and Embedded DRAM 21 3.1 Introduction ...... 21 3.2 Motivation ...... 22 3.2.1 Emerging Memory Technologies ...... 22 3.2.2 Modeling Tools ...... 23 3.3 Model Implementation ...... 24 3.3.1 eDRAM Model ...... 24 3.3.2 3D Model ...... 25 3.4 Validation Results ...... 26 3.4.1 3D SRAM Validation ...... 27 3.4.2 2D and 3D eDRAM Validation ...... 28 3.4.3 3D RRAM Validation ...... 29 3.5 Case Studies using DESTINY ...... 30 3.5.1 Finding the optimal memory technology ...... 30 3.5.2 Finding the optimal layer count in 3D stacking ...... 31 3.6 Conclusion ...... 31
Chapter 4 Improving Effectiveness of Hybrid-Memory Systems with High-Latency Caches 33 4.1 Motivation ...... 35 4.2 Implementation ...... 38 4.2.1 Managing the Fill Cache ...... 40 4.2.2 Re-routing Requests ...... 41 4.2.3 DRAM Cache Load ...... 42 4.2.4 Coalescing Fills ...... 42 4.2.5 Modifications to DRAM Cache ...... 43 4.3 Published Results ...... 44 4.3.1 Experimental Setup ...... 44 4.3.2 DRAM Cache Architectures ...... 45 4.3.3 Hardware Prefetcher ...... 46 4.3.4 Benchmark Selection ...... 47 4.3.5 Baseline Results ...... 48 4.3.6 Average Request Latency ...... 48 4.3.7 Prefetcher Effectiveness ...... 50 4.3.8 Set Indexing Effectiveness ...... 50 4.3.9 Coalesced Requests ...... 51 4.3.10 Sensitivity of Fill Cache Size ...... 51 4.3.11 Application Classification ...... 52 4.4 Conclusion ...... 53
Chapter 5 Leveraging Non-volatility Properties for High Performance, Low Power Main Memory 54 5.1 Introduction ...... 55 5.2 Motivation ...... 56 5.2.1 Non-Volatile Memory Design ...... 56 5.2.2 The Non-Volatility Property ...... 57
v 5.3 Implementation ...... 59 5.3.1 Partial-Activation ...... 59 5.3.2 Multi-Activation ...... 60 5.3.3 Backgrounded Writes ...... 60 5.3.4 Ganged Subarray Groups ...... 61 5.4 Published Results ...... 62 5.4.1 Memory Controller and Scheduling ...... 64 5.4.2 Multi-Issue Memory Controller ...... 64 5.4.3 Address Interleaving ...... 65 5.4.4 Number of Column Divisions and Subarray Groups ...... 65 5.4.5 Impact of Backgrounded Writes ...... 67 5.4.6 Energy Comparison ...... 68 5.4.7 Design Optimization ...... 68 5.4.8 Sensitivity Study ...... 69 5.4.9 Future Devices ...... 70 5.4.10 Application to STT-RAM and RRAM ...... 71 5.4.11 Comparison with Contemporary DRAM ...... 71 5.5 Design Implementation ...... 72 5.5.1 Overhead Costs ...... 73 5.5.2 Area Overhead ...... 73 5.5.3 Yield and NVM Lifetime ...... 74 5.6 Conclusion ...... 75
Chapter 6 Early Activation Scheduling for Main Memories 76 6.1 Motivation ...... 77 6.1.1 Baseline System Design ...... 78 6.1.2 Oracle Analysis ...... 80 6.2 Results and Analysis ...... 80 6.2.1 Missed Prediction Implications ...... 80 6.2.2 Limiting Amounts of Early-ACTs ...... 81 6.2.3 Unrealized Performance Potential ...... 82 6.2.4 Memory Controller Implications ...... 82 6.3 Conclusions ...... 83
Chapter 7 Dissertation Conclusions 85 7.1 Future Directions ...... 87
Bibliography 88
vi List of Figures
1.1 Overview of Memory Architecture. Only one memory controller with one channel is shown, however any number of channels is possible...... 6
2.1 Relative error of an estimated write-pulse time compared to exact measurement based on data values...... 11 2.2 Overview of NVMain Architecture. Only one memory controller with one channel is shown...... 13 2.3 Calculated memory subsystem power of NVMain normalized to DRAMSim2. . . 15 2.4 Percentage of simulation time spent in memory subsystem...... 16 2.5 Absolute error of an estimated write-pulse time compared to exact measurement based on data values...... 17 2.6 Frequency of 2-bit data values written to MLC cells for various SPEC2006 bench- marks...... 18 2.7 IPC results and migration statistics for a hybrid memory...... 18 2.8 IPC results of DRAM Cache with varying prediction accuracy...... 19 2.9 Predictive DRAM Cache hit rate at various accuracies...... 20
3.1 High-level overview of DESTINY framework. Configurations are generated from extended input model and fed to NVSim core. Results are fined tuned via 3D model and filtered by optimization to yield result outputs...... 24 3.2 4-layer monolithically stacked RRAM Storage elements are sandwiched directly between layers of wordlines (south-east orientation) and bitlines (south-west ori- entation)...... 27
4.1 A single DRAM row in a DRAM cache. The 2KB row is divided into 32 64- byte cache line sized segments. The first 3 segments are used for tags while the remaining segments are ways in the cache...... 36 4.2 Example fill to a baseline DRAM cache. In (a), 3 tags are read and no commands are issued until data returns. The request misses in the cache and data is fetched from off-chip memory in (b). The DRAM cache may or may not be precharged if there are other requests to this DRAM bank. In (c), the bank is re-activated if needed, and the data is written followed by a tag write...... 37 4.3 Example snapshot of bank queues in SPEC2006’s bwaves benchmark...... 38 4.4 (a) Fill Cache entry being evicted where a coalesce can be performed. (b) The DRAM cache probing the Fill Cache for requests for bank 4...... 39
vii 4.5 Percentage of reused requests based on type. Types are referenced prefetches (RP), unreferenced prefetches (UP), and demand requests (UD). Average is 29.22% for RP, 41.88% for UP, and 28.90% for UD...... 41 4.6 Example set indexing schemes for DRAM caches. In (a), indexing is similar to SRAM caches where the lowest bits above the byte offset determine the set and upper bits determine the tag. In (b), a portion of the tag is after the byte offset, promoting row-buffer hits...... 43 4.7 Block diagram of the architecture of a main memory subsystem containing a Fill Cache (F$) along with state machine showing basic request flow in a DRAM cache memory system with Fill Cache...... 46 4.8 Speedup of DRAM cache with Fill Cache over DRAM cache baseline. Results for the baseline, STeMS baseline, and combined approaches shown. The x-axis represents the speed-up over the baseline of DRAM Cache with MissMap only, and the y-axis is the benchmark...... 48 4.9 Average DRAM cache request latency for the first 100 million execution cycles showing how Fill Cache averages out request latency during high memory periods (e.g., during application start). The darker line is the average request latency for the DRAM Cache + MissMap design and the lighter line is for the DRAM Cache + MissMap + Fill Cache Design. Average latency (x-axis) and the total execution time (y-axis) are in memory cycles...... 49 4.10 Accuracy and number of covered and uncovered requests issued to the DRAM cache by prefetcher...... 50 4.11 Percentage of fill requests able to be coalesced...... 51 4.12 Percentage of read requests where data was returned by the Fill Cache...... 52
5.1 Potential Energy Savings in NVM by opening 1/Nth of a Row Buffer...... 55 5.2 The design of a typical NVM memory. Two NVM sub-arrays are illustrated. One sub-array is an 8x4 structure, where 8 local bitlines are selected by 4 local Y- select. The global Y-select further selects out the final I/O bitline that is sensed by the global sense amplifier (S/A)...... 57 5.3 Non-volatility property allows reading of only top left cell in a tile. Neither of the right two columns are connected to S/A...... 58 5.4 The access schemes proposed in FgNVM. (a) Partial-Activation: Only one of the top two tiles is read from the bank, energy saved is in the other tile; (b) Multi-Activation: Data read from tiles in different rows, same back; twice the bandwidth potential; (c) Backgrounded Write: Upper-left tile is read, lower-right tile is written. Reduces read/write interference...... 59 5.5 Ganged subarrays use multiple SAGs accessed in parallel to increase row buffer sizes...... 61 5.6 IPC improvement over baseline PCM design compared to FgNVM, multi-issue FgNVM, and an Ideal Scenario. All results show 8×2 FgNVM designs...... 64 5.7 Performance difference with and without interleaving on rows and columns: Shows that interleaving rows and columns is beneficial to FgNVM designs...... 65 5.8 IPC impact of adding more subarray groups averaged over the SPEC benchmarks: Not much performance improvement by dramatically increasing subarrays, with a maximum around 4% in libquantum, gobmk, GemsFDTD, and soplex...... 66
viii 5.9 IPC impact of adding more column divisions averaged over the SPEC benchmarks: Large difference in IPC when adding more column divisions; ideal case moves opposite of real case...... 67 5.10 The ’nowb’ show IPC when Backgrounded Writes are disabled. The remaining bars show the IPC improvement when Backgrounded Writes are enabled. Up to 12% increase for lbm and 5% on average...... 67 5.11 Energy consumption normalized to baseline NVM prototype. Shows a significant decrease in most FgNVM configurations...... 68 5.12 The CCD hit rate (hit rate of data in all sensed column divisions). As more column divisions are added, the rate drops due to underfetch...... 69 5.13 Sensitivity of our tRCD and tCAS selections on 8×32. Shows sensitivity when decreasing tCAS is proportional to CCD hit rate...... 70 5.14 Study of reducing total tRCD and tCAS. This simulates performance of future devices with faster activation or sensing time...... 70 5.15 Application of FgNVM to a subarray-ganged RRAM design. Yields results similar to PCM design...... 71 5.16 FgNVM vs. a mid-grade DDR3 DRAM normalized to baseline NVM. Shows that FgNVM design can help approach the speeds of DRAM...... 72
6.1 Example of an Early-ACT in action. The top line shows the baseline case, where ACT is created and issued when demand request arrives at MC. Bottom is Early- ACT idea, where ACTs are issued speculatively before a demand request arrives. 77 6.2 Baseline system setup for testing Early-ACT concept. Four CPUs with three levels of cache, access predictor (AP), memory controller (MC), and 4 channels of memory...... 78 6.3 Memory access predictor accuracy at L3 cache across memory intensive SPEC2006 benchmarks. High accuracy is achieved for confident issuance of Early-ACT com- mands...... 79 6.4 Results of running Early-ACT oracle prediction on PCM type memories. The average total memory latency is shown with baseline, oracle pairs for each memory type...... 79 6.5 Results of running Early-ACT using a naive issuance of Early-ACTs. Baseline and simulation with Early-ACT are shown...... 80 6.6 Total number of Early-ACT requests. The difference between Early-ACT and oracle is the total number of false hits and misses combined...... 81 6.7 Instructions per cycle for each benchmark when limiting the number of Early- ACTs. Less-than a percentage series means Early-ACTs are only issued when the memory queue is less than said percentage full...... 82 6.8 Average number of ACT cycles saved when issuing an Early-ACT. This average can range from 0 to 48. In all cases, the oracle savings are much higher than a realistic implementation...... 83 6.9 Total number of ACT cycles saved across the entire simulation run. The differ- ence in savings between Early-ACT and oracle mirror the speedup shown in the performance figure...... 83
ix List of Tables
2.1 Simulated MLC Timing Parameters ...... 17
3.1 Validation for 3D SRAM model...... 28 3.2 Validation of 2D and 3D eDRAM...... 28 3.3 Design space exploration results of determining the optimal memory technology for a desired optimization target (refer Section 3.5.1). The table shows results on all parameters for comparison purposes...... 29 3.4 Validation of 3D RRAM...... 29 3.5 Design space exploration results of determining optimal number of 3D-stacked layers for various optimization targets for STT-RAM (refer Section 3.5.2). The table shows results on all parameters for comparison purposes...... 30
4.1 Experimental Setup Parameters ...... 42 4.2 Total memory traffic over first 2 billion cycles of each benchmark. Benchmarks selected are in bold...... 44 4.3 Categories of benchmarks run in our simulations. Some benchmarks belong to multiple categories...... 53
5.1 MPKI and WPKI of Simpoint slices...... 62 5.2 Experimental System Setup ...... 63 5.3 Summary of Area Overheads in FgNVM design...... 73
6.1 Experimental System Setup ...... 78
x Acknowledgments
I would like to acknowledge my gratitude to Dr. Yuan Xie who served as my research advisor for the past 7 years – 5 years during graduate school as well as two years as an undergraduate student. For his encouragement, motivation, and ability to push me forward while working on exciting research topics and his strong connections to top researchers which helped me gain exposure and get where I am today. Furthermore, I would like to thank my committee members Dr. John Sampson, Dr. Mary Jane Irwin, Dr. Vijaykrishnan Narayanan, and Dr. Kennith Jenkins for their help and advice on my dissertation proposal and throughout the years while my advisor was away. Moreover, for matching me with other students with similar research interests so that we may work together towards top-notch publications. Thank you to all students in the department who gave me the opportunity to work with them, both those who have departed and those who remain: Dr. Jin Ouyang, Dr. Cong Xu, Dr. Tao Zhang, Dr. Mike Debole, Dr. Qiaosha Zuo, Dr. Dimin Niu, Dr. Lian Duan, Dr. Xiaoxia Wu, Dr. Jishen Zhao, Dr. Xiangyu Dong and Dr. Jue Wang, Dr. Karthik Swaminathan, Dr. Kevin Irick, Dr. Guangyu Sun, Hsiang-Yun Cheng, Jia Zhan, Ping Chi, Jing Xie, Ivan Stalev, Hang Zhang, Yang Zheng, and Kaisheng Ma. Finally, I would like to acknowledge my family and friends for motivation and encouragement throughout the years.
xi Dedication
This dissertation is dedicated to my friends, family, and colleagues who have provided support, encouragement, and guidance throughout my graduate school career.
xii Chapter 1
Introduction
Struggles to improve memory technologies are becoming a large problem in increasing device capability for existing mainstream memory technologies such as DRAM. For several years there have been issues in decreasing DRAM half-pitch needed to improve memory capacity and price- per-bit metrics. As of the latest ITRS roadmap, there are still no known solutions to some of the problems being faced by DRAM, including reliable charge storage and sensing mechanisms [1]. These struggles are drawing a lot of attention towards redesigns and potentially even replacements for DRAM. Several have proposed the replacement of main memory with non-volatile memories (NVMs). The intrinsic characteristic of non-volatile memories is the fact that data is not lost. This implies the data does not need to be refreshed and that peripheral circuitry such as word-line drivers, sense-amps, and write drivers can be power gated without risk of data loss. Since refresh and stand-by power in the DRAM arrays themselves are two major sources of power dissipation in DRAM, this has the potential for high energy savings. Unfortunately the operational energy of non-volatile memories is nominally high for reads and is even worse for writes, which can eat away at much of these savings. Some of these proposals therefore consider hybrids of DRAM and non-volatile memories [2, 3] to combat this. Simply switching to the DRAM cells with non-volatile memory cells is not a viable option, since there are several differences in operation between the two technologies. Some of the issues with these technologies in addition to the operational energy are longer latency times. Specifically, the write latency on non-volatile memories can be orders of magnitude larger than the read latency. Due to the operation of the memory cell, however, it is difficult to optimize these write latencies. Some architectural-level approaches have been taken to help hide these high latencies, including write cancellation, write pausing [4], using write buffers, and smaller row buffer sizes [5]. However, there still remains much work to be done if non-volatile memories can replace DRAM as main memory. In addition to these issues, non-volatile memory research is increasingly focusing on other 2 unique characteristics of the memory cells. Unlike DRAM, non-volatile memories have the ability to reliably store multiple bits per cell. Some NVM technologies additionally allows for new memory layouts, such as cross-point designs [6] and high-density 3D-stacked memories [7, 8, 9]. However, such designs currently have additional problems with reliability and stability which causes the read and write latencies to potentially differ non-uniformly and have high ”sneak” current increasing overall memory power. As a direct result of this, architectural-level simulation with knowledge of application data can be very important to explore the interactions of circuit- level issues and provide more accurate results of research prototype designs.
The Problem: Although several works have already been published exploring design of non- volatile memory, several of these works may be inaccurate due to high-level abstractions made during simulation. The lack of accuracy in abstracting the difference in memory architectures can also mask problems that would exist in real-world embodiments of memory hierarchy designs. As a result better simulation frameworks are needed to find these problems and explore solutions. In addition to this, very few works utilize the unique features (i.e., non-volatility) in these emerging memory technologies, and simply consider the lack leakage power as the main feature of NVMs. This leaves a large gap in research between the circuit-level implementations and architectural- level explorations of next-generation memory systems.
Demonstrated Solutions:
(I) The first portion of this dissertation introduces a simulation framework to model new char- acteristics of emerging bit-accessible non-volatile memories taking into account some of the miss- ing characteristics of NVMs. Current simulators neither have ability to model hybrid memory systems, endurance, or fault tolerance schemes nor accurate models for energy, write scheduling, and real application data. Furthermore, the monolithic design of the simulators makes adding support for emerging NVMs a challenge. We implemented the NVMain simulator, a highly flex- ible simulator allowing for modular memory design (e.g., memory controllers, interconnects, and banks are the modules). This modular design allows for easy implementation of hybrid memory systems as well as extensibility needed for exploring endurance, fault tolerance schemes, and write programming policies. In addition to this, the ability for per-module energy modeling and shared request flow allows for accurate energy models with data availability. We show that sim- ply increasing the service time of write requests in existing simulators is not sufficient to model non-volatile memories. This is especially true for multi-level cell (MLC) designs. We show that actual measured IPC can be highly inaccurate between applications, with a high correlation to their data patterns.
(II) The second portion of this dissertation leverages the need for bank-level models of energy and latency for given data pairs written to a memory cell. We extend the work in the first portion of this dissertation and build on previous widely known research tools to fill the gaps needed for 3 circuit-level memory simulation. We implement DESTINY, a 3D design-space exploration tool for SRAM, eDRAM and non-volatile memory. The values output from DESTINY can be used as input for NVMain. In addition to non-volatile types, the tool also fills the gaps of memory technologies missing from previous tools, such as eDRAM and general 3D stacked memories. The tool is also able to automate design space exploration and design caches with the optimal number of 3D layers or suggest a technology for a given optimization target. Therefore, this tool is not only useful as a supplement to NVMain, but also has high utility for general computer architecture research.
(III) The third portion of this dissertation explores the challenges of hybrid memory systems. For this work, we utilize a large DRAM memory as a cache-block sized last-level cache (LLC). Using such hybrid memories is increasing in popularity based on the intuition that the high write time of non-volatile memories can be easily hidden by a large, fast cache. In this hybrid memory design, write requests originate both from writebacks in higher level caches as well as ”Fill” requests from main memory. However, unlike their SRAM counterparts, DRAM-based caches are significantly slower than the next higher cache in the memory hierarchy. As a result, the increasing number of writes is more problematic when blocking read requests to the DRAM cache. For this work, we propose an additional ”Fill Cache” to be used to hold write requests. Unlike a write-buffer, the fill cache can be designed to match cache-sets to banks in the DRAM cache. As a result, the fill cache can be much larger and hold much more data than a write-buffer. The increase in size allows for a residency time that is much larger than a write buffer. We show that this allows for additional techniques to be used exclusively on fill caches. In addition to write buffering, we show that the fill cache can be used to coalesce row-buffer hits in the DRAM cache, filter requests and drop fills to the DRAM cache, as well as provide a fast path for accessing unfilled data such as long prefetch requests typically seen at lower-level caches.
(IV) The fourth portion of this dissertation involves the design of a non-volatile main memory that leverages the unique non-volatility characteristic of the memory cells. Previous works show that sense amplifiers used in non-volatile main memories are significantly larger than sense ampli- fiers in DRAM sub-arrays [10, 11, 12, 13]. As a result, non-volatile memories with a high enough resistance relative to the global I/O lines (i.e., the wires delivering the data) may employ global sense amplifiers. One prior work concludes it is more area and power efficient to reduce the size of row buffers and implement multiple SRAM-cached row buffers [5]. We study the data access patterns of multiple benchmark applications and found many interesting characteristics. First, the entirety of a row-buffer is generally not used. The global I/O lines therefore need not be sensed if the corresponding data bits are never accessed. Not sensing these columns can result in significant energy reduction over an activation cycle (i.e., when a row of data is sensed). Second, the depth of the queues provides enough insight to find requests on these unused I/O lines that reside in a different row than the request currently being sensed. As a result it is possible to sense both requests in parallel. We show that a low-area overhead design is possible because of the 4 non-volatility feature of the memory cells. Third, unused I/O lines in both cases can also be used to sneak write requests. As a result, the write requests can potentially be entirely non-blocking. Using these observations, we simulate a memory system with lower energy, higher performance, and minimal area overhead.
(IV) The fifth portion of this dissertation leverages the non-volatile memory design proposed in the forth portion in order to explore potential for architecture-level applications. Specifically, we consider the ability to decouple the standard row-based addressing system used to access main memory type designs and propose a method to speculatively open arbitrary rows using an activation command. We can do this by leveraging the property that multiple rows may be open in fine-granularity memory systems. This means if we can accurately predict which rows will become open in the near future, we can preemptively issue an activation to these rows so that data is available in the memory row buffer as soon as the transaction arrives at the memory controller. By doing so, the average activation time is reduced and the number of row buffer hits for read and write transaction increases, reducing overall memory latency and boosting system performance.
The goal of this dissertation is to explore the potential for non-volatile memory to replace DRAM as main memory, either partially or completely. Through the five portions listed above, we are able to introduce several techniques which can be used to improve performance or reduce energy consumption of a computer’s memory subsystem. Although non-volatile memory may be better suited for replacement in other places in the memory hierarchy, such as cache or solid-state disks, we limit the scope to main memory in an attempt to get power and performance metrics closer to DRAM. 5
1.1 Background
NVM Cell characteristics
Non-volatile memory cells are fundamentally different from current DRAM and SRAM-based memories because they store information as a resistance value rather than storing a charge. The main benefit of this is the ability to keep information without a power source. This is becoming increasingly important today as technology scaling smaller and smaller has begun to cause leakage power to be the dominant source of power dissipation. Non-volatile memories can be realized in multiple ways, but the most popular types of NVM devices today are Phase-Change Memory (PCM), Resistive RAM or Memristor (RRAM), and spin-transfer torque RAM (STT-RAM). Phase-change memory stores information by heating a chalcogenide element. Heating such an element to a high temperature by providing high current and quickly removing it causes the material to move and form an amorphous structure that is ”frozen” once the current is removed. Heating to a lower temperature that is hot enough for the material to move and gradually cooling it induces crystal growth in the cell. This crystalline structure is much lower resistance than the amorphous state, and thus we can provide two unique values to be stored in the cell. Resistive-RAM memories stores information in a metal-oxide insulator between two elec- trodes. Introducing a potential difference on the electrodes causes soft-breakdown of the metal- oxide material and generates oxygen vacancies in this layer. These vacancies move towards the opposing electrode causing a conductive filament to be formed. Flowing current in the opposite direction forces the oxygen vacancies back into the metal-oxide layer, removing the conductive filament. Again, the two different states of the cell are enough to store at least two measurable values in the cell. Spin-transfer torque RAM stores information in a reversible ”free” layer. This magnetic layer is sandwiched together with a layer with permanent magnetization used as a reference. By providing current flow in one direction, the magnetic field of the free layer can be changed. The difference in directions of the magnetic fields creates two unique resistance values which can be used to store data.
Memory Organization
Main memory design is typically organized into multiple channels, ranks, and banks. Figure 1.1 shows the architecture of a single memory channel. Additional memory channels would be similar to this figure. The memory controller is responsible for scheduling requests to the memory system. It is typically connected via an interconnect (e.g., a transmission-line bus from a CPU’s memory controller to the memory modules). Each memory rank consists of a group of banks. There may be multiple ranks on the same interconnect. Within each bank, there are multiple subarrays which subdivide the bank in the vertical direction and share a common local row decoder which is driven by the bank’s global row decoder. Each subarray is further subdivided in the horizontal direction into multiple tiles. Each tile typically passes through a multiplexer to reduce the amount 6
Rank Rank Memory Controller Commands: Sub-Array . . .
Address Trans RAS, CAS, PRE, REF, Mapping lator .. tiles .. PowerDown, Refresh PowerUp… Bank Request Interconnect
Scheduling . . . Incoming Requests Incoming Types:
OffChip, OnChip, Request Queue (Optical) Bank
Figure 1.1: Overview of Memory Architecture. Only one memory controller with one channel is shown, however any number of channels is possible.