Tiered-Latency DRAM: a Low Latency and Low Cost DRAM Architecture

Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture Donghyuk Lee Yoongu Kim Vivek Seshadri Jamie Liu Lavanya Subramanian Onur Mutlu Carnegie Mellon University 2.5 100 Abstract Capacity Latency (tRCD) Latency (tRC) The capacity and cost-per-bit of DRAM have historically 2.0 80 scaled to satisfy the needs of increasingly large and complex 1.5 60 computer systems. However, DRAM latency has remained al- 1.0 40 most constant, making memory latency the performance bottle- Latency (ns) Latency neck in today’s systems. We observe that the high access la- (Gb) Capacity 0.5 20 tency is not intrinsic to DRAM, but a trade-oU made to decrease 0.0 0 cost-per-bit. To mitigate the high area overhead of DRAM sens- SDR-200 DDR-400 DDR2-800 DDR3-1066 DDR3-1333 ing structures, commodity DRAMs connect many DRAM cells to 20002000 20032003 20062006 20082008 20112011 each sense-ampliVer through a wire called a bitline. These bit- y We refer to the dominant DRAM chips during the period of time [5, 19]. lines have a high parasitic capacitance due to their long length, Figure 1. DRAM Capacity & Latency Over Time [5, 19, 35, 43] and this bitline capacitance is the dominant source of DRAM la- cialized applications such as high-end networking equipment tency. Specialized low-latency DRAMs use shorter bitlines with that require very low latency even at a very high cost [6]. fewer cells, but have a higher cost-per-bit due to greater sense- ampliVer area overhead. In this work, we introduce Tiered- In DRAM, each bit is represented by electrical charge on a capacitor-based cell. The small size of this capacitor necessi- Latency DRAM (TL-DRAM), which achieves both low latency tates the use of an auxiliary structure, called a sense-ampliVer, and low cost-per-bit. In TL-DRAM, each long bitline is split into to sense the small amount of charge held by the cell and am- two shorter segments by an isolation transistor, allowing one plify it to a full digital logic value. A sense-ampliVer is up to segment to be accessed with the latency of a short-bitline DRAM three orders of magnitude larger than a cell [39]. To mitigate without incurring high cost-per-bit. We propose mechanisms the large size of sense-ampliVers, each sense-ampliVer is con- that use the low-latency segment as a hardware-managed or nected to many DRAM cells through a wire called a bitline. software-managed cache. Evaluations show that our proposed mechanisms improve both performance and energy-eXciency DRAM manufacturers trade latency for cost-per-bit by ad- for both single-core and multi-programmed workloads. justing the length of these bitlines. Shorter bitlines (fewer cells connected to the bitline) constitute a smaller electrical load on the bitline, resulting in decreased latency, but require a larger 1 Introduction number of sense-ampliVers for a given DRAM capacity (Fig 2a), Primarily due to its low cost-per-bit, DRAM has long been resulting in higher cost-per-bit. Longer bitlines (more cells con- the choice substrate for architecting main memory subsystems. nected to the bitline) require fewer sense-ampliVers for a given In fact, DRAM’s cost-per-bit has been decreasing at a rapid DRAM capacity (Fig 2b), reducing cost-per-bit, but impose a rate as DRAM process technology scales to integrate ever more higher electrical load on the bitline, increasing latency. As a DRAM cells into the same die area. As a result, each successive result, neither of these two approaches can optimize for both generation of DRAM has enabled increasingly large-capacity cost-per-bit and latency. main memory subsystems at low cost. ) e t r In stark contrast to the continued scaling of cost-per-bit, the n i o l t h i latency of DRAM has remained almost constant. During the s b ( cells same 11-year interval in which DRAM’s cost-per-bit decreased e n i by a factor of 16, DRAM latency (as measured by the t and sense-amps l RCD t 1 i timing constraints ) decreased by only 30.5% and 26.3%, as b t ) RC e g n i Isolation TR. n shown in Fig 1. From the perspective of the processor, an access l ) t e o t e i l r n ( n b to DRAM takes hundreds of cycles – time during which the i i o l l t t h i i processor may be stalled, waiting for DRAM. Such wasted time s b cells ( cells cells b leads to large performance degradations commonly referred to as the “memory wall” [58] or the “memory gap” [56]. sense-amps sense-amps sense-amps However, the high latency of commodity DRAM chips is in fact a deliberate trade-oU made by DRAM manufactur- (a) Latency Opt. (b) Cost Opt. (c) Our Proposal ers. While process technology scaling has enabled DRAM Figure 2. DRAM: Latency vs. Cost Optimized, Our Proposal designs with both lower cost-per-bit and lower latency [16], Our goal is to design a new DRAM architecture that pro- DRAM manufacturers have usually sacriVced the latency ben- vides low latency for the common case while still retaining a eVts of scaling in order to achieve even lower cost-per-bit, as low cost-per-bit overall. Our proposal, which we call Tiered- we explain below. Hence, while low-latency DRAM chips ex- Latency DRAM, is based on the key observation that long bit- ist [25, 27, 45], their higher cost-per-bit relegates them to spe- lines are the dominant source of DRAM latency [47]. 1The overall DRAM access latency can be decomposed into individual Our key idea is to adopt long bitlines to achieve low cost- DRAM timing constraints (Sec 2.2). Two of the most important timing con- per-bit, while allowing their lengths to appear shorter in order to achieve low latency. In our mechanism (Tiered-Latency straints are tRCD (row-to-column delay) and tRC (“row-conWict” latency). When DRAM is lightly loaded, tRCD is often the bottleneck, whereas when DRAM), each long bitline is split into two shorter segments us- DRAM is heavily loaded, tRC is often the bottleneck (Secs 2.2 and 2.3). ing an isolation transistor, as shown in Fig 2c: the segment that 1 is connected directly to the sense-ampliVer is called the near This paper makes the following contributions. segment, whereas the other is called the far segment. • Based on the observation that long internal wires (bitlines) Table 1 summarizes the latency and die-size (i.e., cost- 2 are the dominant source of DRAM latency, we propose a per-bit ) characteristics of Tiered-Latency DRAM and other new DRAM architecture, Tiered-Latency DRAM. To our DRAMs (refer to Sec 3 for details). Compared to commodity knowledge, this is the Vrst work that enables low-latency DRAM (long bitlines) which incurs high access latency for all DRAM without signiVcantly increasing cost-per-bit. cells, Tiered-Latency DRAM oUers signiVcantly reduced access • We quantitatively evaluate the latency, area, and power latency for cells in the near segment, while not always increas- characteristics of Tiered-Latency DRAM through circuit ing the access latency for cells in the far segment. In fact, for simulations based on a publicly available 55nm DRAM cells in the far segment, although tRC is increased, tRCD is process technology [39]. actually decreased. • We describe two major ways of leveraging TL-DRAM: Table 1. Latency & Die-Size Comparison of DRAMs (Sec 3) 1) by not exposing the near segment capacity to the OS and using it as a hardware-managed cache, and 2) by ex- Short Bitline Long Bitline Segmented Bitline posing the near segment capacity to the OS and using (Fig 2a) (Fig 2b) (Fig 2c) hardware/software to map frequently accessed pages to Unsegmented Unsegmented Near Far the near segment. We propose two new policies to manage Length (Cells) 32 512 32 480 the near segment cache that speciVcally exploit the asym- metric latency characteristics of TL-DRAM. Lowest High Lowest Low We evaluate our proposed memory management policies tRCD (8.2ns) (15ns) (8.2ns) (12.1ns) • Latency on top of the Tiered-Latency DRAM substrate and show Lowest High Lowest Higher t that they improve both system performance and energy RC (23.1ns) (52.5ns) (23.1ns) (65.8ns) eXciency for both single-program and multi-programmed Normalized Highest Lowest Low workloads. Die-Size (Cost) (3.76) (1.00) (1.03) 2 DRAM Background To access a cell in the near segment, the isolation transistor 3 A DRAM chip consists of numerous cells that are grouped is turned oU, so that the cell and the sense-ampliVer see only at diUerent granularities to form a hierarchical organization as the portion of the bitline corresponding to the near segment shown in Fig 3. First, a DRAM chip is the set of all cells on (i.e., reduced electrical load). Therefore, the near segment can the same silicon die, along with the I/O circuitry which allows be accessed quickly. On the other hand, to access a cell in the the cells to be accessed from outside the chip. Second, a DRAM far segment, the isolation transistor is turned on to connect the chip is divided into multiple banks (e.g., eight for DDR3), where entire length of the bitline to the sense-ampliVer. In this case, a bank is a set of cells that share peripheral circuitry such as the however, the cell and the sense-ampliVer see the full electrical address decoders (row and column).

Tiered-Latency DRAM: a Low Latency and Low Cost DRAM Architecture

Zynq-7000 All Programmable Soc and 7 Series Devices Memory Interface Solutions User Guide (UG586) [Ref 2]

Micron Technology Inc

Understanding and Exploiting Design-Induced Latency Variation in Modern DRAM Chips

Leveraging Heterogeneity in DRAM Main Memories to Accelerate Critical Word Access ∗

Dynamic Rams from Asynchrounos to DDR4

Memory Systems Cache, DRAM, Disk

MOCA: Memory Object Classiﬁcation and Allocation in Heterogeneous Memory Systems

Techniquesfor

Zynq-7000 All Programmable Soc and 7 Series Devices Memory Interface Solutions User Guide (UG586) [Ref 2]

External Memory Interface Handbook Volume 2: Design Guidelines

I/O: a Detailed Example

CROW: a Low-Cost Substrate for Improving DRAM Performance, Energy Efficiency, and Reliability Hasan Hassan† Minesh Patel† Jeremie S