CROW: a Low-Cost Substrate for Improving DRAM Performance, Energy Efficiency, and Reliability Hasan Hassan† Minesh Patel† Jeremie S

CROW: A Low-Cost Substrate for Improving DRAM Performance, Energy Efficiency, and Reliability Hasan Hassany Minesh Pately Jeremie S. Kimy§ A. Giray Yaglikciy Nandita Vijaykumary§ Nika Mansouri Ghiasiy Saugata Ghose§ Onur Mutluy§ yETH Zürich §Carnegie Mellon University ABSTRACT instruction and data cache miss rates, and 3) have low memory-level DRAM has been the dominant technology for architecting main parallelism. While manufacturers offer latency-optimized DRAM memory for decades. Recent trends in multi-core system design and modules [72, 98], these modules have significantly lower capac- large-dataset applications have amplified the role of DRAM asa ity and higher cost compared to commodity DRAM [8, 53, 58]. critical system bottleneck. We propose Copy-Row DRAM (CROW), Thus, reducing the high DRAM access latency without trading off a flexible substrate that enables new mechanisms for improving capacity and cost in commodity DRAM remains an important chal- DRAM performance, energy efficiency, and reliability. We use the lenge [17, 18, 58, 78]. CROW substrate to implement 1) a low-cost in-DRAM caching Second, the high DRAM refresh overhead is a challenge to im- mechanism that lowers DRAM activation latency to frequently- proving system performance and energy consumption. A DRAM accessed rows by 38% and 2) a mechanism that avoids the use of cell stores data in a capacitor that leaks charge over time. To main- short-retention-time rows to mitigate the performance and energy tain correctness, every DRAM cell requires periodic refresh opera- overhead of DRAM refresh operations. CROW’s flexibility allows tions that restore the charge level in a cell. As the DRAM cell size the implementation of both mechanisms at the same time. Our decreases with process technology scaling, newer DRAM devices evaluations show that the two mechanisms synergistically improve contain more DRAM cells than older DRAM devices [34]. As a re- system performance by 20.0% and reduce DRAM energy by 22.3% for sult, while DRAM capacity increases, the performance and energy memory-intensive four-core workloads, while incurring 0.48% extra overheads of the refresh operations scale unfavorably [7, 40, 64]. area overhead in the DRAM chip and 11:3 KiB storage overhead In modern LPDDR4 [73] devices, the memory controller refreshes in the memory controller, and consuming 1.6% of DRAM storage every DRAM cell every 32 ms. Previous studies show that 1) refresh capacity, for one particular implementation. operations incur large performance overheads, as DRAM cells can- not be accessed when the cells are being refreshed [7, 64, 76, 84]; KEYWORDS and 2) up to 50% of the total DRAM energy is consumed by the refresh operations [7, 64]. DRAM, memory systems, performance, power, energy, reliability Third, the increasing vulnerability of DRAM cells to various failure mechanisms is an important challenge to maintaining DRAM 1 INTRODUCTION reliability. As the process technology shrinks, DRAM cells become DRAM has long been the dominant technology for architecting smaller and get closer to each other, and thus become more suscepti- main memory systems due to the high capacity it offers at low ble to failures [56, 64, 65, 70, 78, 91, 118]. A tangible example of such cost. As the memory demands of applications have been growing, a failure mechanism in modern DRAM is RowHammer [52, 79, 80]. manufacturers have been scaling the DRAM process technology to RowHammer causes disturbance errors (i.e., bit flips in vulnerable keep pace. Unfortunately, while the density of the DRAM chips has DRAM cells that are not being accessed) in DRAM rows physically been increasing as a result of scaling, DRAM faces three critical adjacent to a row that is repeatedly activated many times. challenges in meeting application demands [78, 82]: (1) high access These three challenges are difficult to solve efficiently by directly latencies and (2) high refresh overheads, both of which degrade sys- modifying the underlying cell array structure. This is because com- tem performance and energy efficiency; and (3) increasing exposure modity DRAM implements an extremely dense DRAM cell array to vulnerabilities, which reduces the reliability of DRAM. that is optimized for low area-per-bit [57, 58, 105]. Because of its First, the high DRAM access latency is a challenge to improving density, even a small change in the DRAM cell array structure may system performance and energy efficiency. While DRAM capac- incur non-negligible area overhead [58, 78, 112, 119]. Our goal in ity increased significantly over the last two decades [6, 35, 37, 57, this work is to lower the DRAM access latency, reduce the refresh 58, 105], DRAM access latency decreased only slightly [6, 58, 82]. overhead, and improve DRAM reliability with no changes to the The high DRAM access latency significantly degrades the perfor- DRAM cell architecture, and with only minimal changes to the mance of many workloads [14, 23, 28, 30, 123]. The performance DRAM chip. impact is particularly large for applications that 1) have working To this end, we propose Copy-Row DRAM (CROW), a flexible sets exceeding the cache capacity of the system, 2) suffer from high in-DRAM substrate that can be used in multiple different ways to address the performance, energy efficiency, and reliability challenges Permission to make digital or hard copies of all or part of this work for personal or of DRAM. The key idea of CROW is to provide a fast, low-cost classroom use is granted without fee provided that copies are not made or distributed duplicate select rows for profit or commercial advantage and that copies bear this notice and the full citation mechanism to in DRAM that contain data that on the first page. Copyrights for components of this work owned by others than the is most sensitive or vulnerable to latency, refresh, or reliability author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or issues. CROW requires only small changes in commodity DRAM republish, to post on servers or to redistribute to lists, requires prior specific permission chips. At a high level, CROW partitions each DRAM subarray into and/or a fee. Request permissions from [email protected]. ISCA ’19, June 22–26, 2019, Phoenix, AZ, USA two regions (regular rows and copy rows) and enables independent © 2019 Copyright held by the owner/author(s). Publication rights licensed to Associa- control over the rows in each region. The key components required tion for Computing Machinery. for independent control include 1) an independent decoder for the ACM ISBN 978-1-4503-6669-4/19/06...$15.00 copy rows and 2) small changes in the memory controller interface https://doi.org/10.1145/3307650.3322231 1 to address the copy rows. To store information about the state of DRAM chip. CROW-ref has two important properties that prior the copy rows (e.g., which regular rows are duplicated in which works [2, 3, 11, 12, 19, 33, 38, 49, 50, 64, 66, 69, 76, 83, 84, 86, 88, 107] copy rows), CROW implements a small CROW-table in the memory fail to address at the same time. First, to avoid using weak rows, controller (Section 3.3). CROW-ref does not require system software changes that compli- CROW takes advantage of the fact that DRAM rows in the same cate data allocation. Second, once our versatile CROW substrate is subarray share sense amplifiers to enable two types of interaction implemented, CROW-ref does not require any additional changes between a regular row and a copy row. First, CROW can activate in DRAM that are specific to enabling a new refresh scheme. (i.e., open) a copy row slightly after activating a regular row in Results Overview. Our evaluations show that CROW-cache the same subarray. Once the copy row is activated, the charge and CROW-ref significantly improve performance and energy ef- restoration operation in DRAM (used to restore the charge that was ficiency. First, our access latency reduction mechanism, CROW- drained from the cells of a row during activation) charges both the cache, provides 7.4% higher performance and consumes 13.3% less regular row and the copy row at once. The simultaneous charge DRAM energy, averaged across our 20 memory-intensive four-core restoration of the two rows copies the contents of the regular row workloads, while reserving only 1.6% of the DRAM storage capacity into the copy row, similar to the in-DRAM row copy operation for copy rows on a system with four LPDDR4 channels. Second, our proposed by RowClone [100]. Second, CROW can simultaneously refresh rate reduction mechanism, CROW-ref, improves system activate a regular row and a copy row that contain the same data. performance by 8.1% and reduces DRAM energy by 5.4%, averaged The simultaneous activation reduces the latency required to access across the same four-core workloads using the same system con- the data by driving each sense amplifier using the charge from two figuration with futuristic 64 Gbit DRAM chips. CROW-cache and cells in each row (instead of using the charge from only one cell in CROW-ref are complementary to each other. When combined with- one row, as is done in commodity DRAM). out increasing the number of copy rows available, the two mech- The CROW substrate is flexible, and can be used to implement a anisms together improve average system performance by 20.0% variety of mechanisms. In this work, we discuss and evaluate two and reduce DRAM energy consumption by 22.3%, achieving greater such novel mechanisms in detail: 1) CROW-cache, which reduces the improvements than either mechanism does alone. DRAM access latency (Section 4.1) and 2) CROW-ref, which reduces To analyze the activation and restoration latency impact of simul- the DRAM refresh overhead (Section 4.2).

CROW: a Low-Cost Substrate for Improving DRAM Performance, Energy Efficiency, and Reliability Hasan Hassan† Minesh Patel† Jeremie S

Zynq-7000 All Programmable Soc and 7 Series Devices Memory Interface Solutions User Guide (UG586) [Ref 2]

Micron Technology Inc

Understanding and Exploiting Design-Induced Latency Variation in Modern DRAM Chips

Leveraging Heterogeneity in DRAM Main Memories to Accelerate Critical Word Access ∗

Dynamic Rams from Asynchrounos to DDR4

Memory Systems Cache, DRAM, Disk

MOCA: Memory Object Classiﬁcation and Allocation in Heterogeneous Memory Systems

Techniquesfor

Zynq-7000 All Programmable Soc and 7 Series Devices Memory Interface Solutions User Guide (UG586) [Ref 2]

External Memory Interface Handbook Volume 2: Design Guidelines

I/O: a Detailed Example

In-DRAM Cache Management for Low Latency and Low Power 3D-Stacked Drams