In-DRAM Cache Management for Low Latency and Low Power 3D-Stacked Drams

micromachines Article In-DRAM Cache Management for Low Latency and Low Power 3D-Stacked DRAMs Ho Hyun Shin 1,2 and Eui-Young Chung 2,* 1 Samsung Electronics Company, Ltd., Hwasung 18448, Korea; [email protected] 2 School of Electrical and Electronic Engineering, Yonsei University, Seoul 03722, Korea * Correspondence: [email protected]; Tel.: +82-2-2123-5866 Received: 24 December 2018; Accepted: 5 February 2019; Published: 14 February 2019 Abstract: Recently, 3D-stacked dynamic random access memory (DRAM) has become a promising solution for ultra-high capacity and high-bandwidth memory implementations. However, it also suffers from memory wall problems due to long latency, such as with typical 2D-DRAMs. Although there are various cache management techniques and latency hiding schemes to reduce DRAM access time, in a high-performance system using high-capacity 3D-stacked DRAM, it is ultimately essential to reduce the latency of the DRAM itself. To solve this problem, various asymmetric in-DRAM cache structures have recently been proposed, which are more attractive for high-capacity DRAMs because they can be implemented at a lower cost in 3D-stacked DRAMs. However, most research mainly focuses on the architecture of the in-DRAM cache itself and does not pay much attention to proper management methods. In this paper, we propose two new management algorithms for the in-DRAM caches to achieve a low-latency and low-power 3D-stacked DRAM device. Through the computing system simulation, we demonstrate the improvement of energy delay product up to 67%. Keywords: 3D-stacked; DRAM; in-DRAM cache; low-latency; low-power 1. Introduction The latency of dynamic random access memory (DRAM) has been a critical issue for two primary reasons [1]. Firstly, while the processing speed of central processing unit (CPU) has been continuously improved, DRAM latency has remained relatively unchanged for decades. This speed gap, called the memory wall, causes significant bottlenecks in the overall computing performance [2,3]. As shown in Figure1a, while the capacity and bandwidth have increased 16 and 6 times over time, respectively, the timing constraints representing the DRAM latency, row address to column address delay (tRCD) and row cycle time (tRC), have only been improved by 11.2% and 20.0%, respectively [4–7]. Secondly, the processing speed of big data workloads is affected by the memory latency, as well as bandwidth. Russell et al. proved that the instructions per cycle of the applications dealing with big data could be significantly improved by reducing the DRAM latency [8]. This is because the data stream of big data is likely to have large dependency between its elements. In particular, on-line transaction processing (OLTP), which supports high transaction-oriented applications, is a representative example of latency-sensitive applications [9]. In addition, recent AI applications require large amounts of memory to handle large amounts of data, and require low latency to provide real-time data processing. In other words, we expect to see an increasing number of applications that simultaneously demand high capacity and low latency. Micromachines 2019, 10, 124; doi:10.3390/mi10020124 www.mdpi.com/journal/micromachines Micromachines 2019, 10, 124 2 of 15 10 25 8 Capacity 20 6 Bandwidth 15 4 10 Capacity (GB) 2 5 0 0 Bandwidth (GB/sec) DDR‐400 DDR2‐800 DDR3‐1600 DDR4‐2400 (a) 18 70 16 50 14 30 tRC (ns) tRCD (ns) tRCD tRC 12 10 DDR‐400 DDR2‐800 DDR3‐1600 DDR4‐2400 (b) Figure 1. Comparison of dynamic random access memory (DRAM) capacity, bandwidth, and latency improvement by DRAM generation [4–7]. (a) Capacity and bandwidth of DRAM. (b) DRAM access latency: row address to column address delay (tRCD) and row cycle time (tRC). DRAM devices are being transformed into various structures as a result of recent developments in die stacking through silicon via (TSV) [10]. For example, the die stacking of homogeneous DRAM chips extends their capacity without power and performance losses [11,12]. Moreover, a heterogeneous combination of logic and DRAM dies, such as for a high-bandwidth memory (HBM) or hybrid memory cube (HMC), increases the data bandwidth without a significant power overhead [13,14]. The meaning of the power implied above is precisely the power relative to the performance value, such as capacity and bandwidth. For example, when comparing Graphic Double Data Rate 5 (GDDR5) and HBM with the same capacity and bandwidth performance, HBM’s power consumption is significantly smaller. However, though they have enhanced the memory sub-system in terms of capacity and bandwidth, the latency improvements have been neglected. In order to overcome the long latency problem of DRAM, many computers embed numerous caches in the CPU. The cache not only overcomes the long latency of DRAM, but it also provides data locality for the pre-fetched pages. Thus, it offers large bandwidth locally in a CPU. However, since a typical cache is implemented using static random access memory (SRAM), it incurs large costs and consumes a high amount of leakage power. As a result, it is essential to reduce the DRAM latency itself to improve memory access latency (In this paper, DRAM latency refers to the time required for a DRAM controller to read or write data to a DRAM device, and memory access latency represents the latency required to access the data of the cache or DRAM by the processor instructions.). The in-DRAM cache, which is embedded in a DRAM device, has several unique characteristics that differ from the processor cache [15]. First, the cache itself is placed in the DRAM, but its operation is managed by the DRAM controller. This is because the interface between the controller and the DRAM follows the DRAM timing constraints specified in joint electron device engineering council (JEDEC), which maintains high compatibility with the current computing system. Of course, there are various ways to implement the in-DRAM cache and its manager, such as operating systems (OS) or processor modifications. However, such methods require many modifications to the current computing system, and eventually degrade compatibility. We designed the manager to the DRAM controller so that the proposed method could follow the JEDEC specification, and implemented the in-DRAM cache in the DRAM device. Micromachines 2019, 10, 124 3 of 15 Secondly, the capacity of the in-DRAM cache increases proportionally to the DRAM capacity and is much larger than the processor cache. For example, when hundreds of gigabytes of DRAM are mounted in a system, while the memory capacity of the processor cache remains constant at several hundred megabytes, the capacity of the in-DRAM cache can be up to tens of gigabytes. However, this large-capacity in-DRAM cache requires a larger tag size. This results in long tag access latency, which in turn increases the overall memory access latency. To overcome this problem, the data transfer granularity between the DRAM and in-DRAM cache, which is called cache block size, must be increased. However, this causes significant power consumption. Power issues in DRAMs are very important in terms of minimizing the energy consumed by the DRAM chip itself, and are also critical parameters for 3D-stacked DRAMs from a thermal point of view. Since a 3D-stacked DRAM chip consists of several dies, it is very difficult to emit the heat generated inside the chip to the outside. This heat degrades the retention characteristics of the DRAM cells, and thus DRAM requires a shorter refresh cycle. However, reducing the refresh cycle of the high-capacity 3D-stacked DRAM results in more heat, which causes the retention time of the DRAM cell to decrease again. Therefore, thermal problems in 3D-stacked DRAMs are very sensitive design parameters and must be overcome. Considering various properties of the in-DRAM cache, this paper proposes two new in-DRAM cache management algorithms for the data replacement, particularly to maximize its efficiency and minimize its energy consumption. In addition, the proposed management algorithms are not tied to a specific in-DRAM cache architecture, and can be appropriately adapted to general architectures. 2. Background and In-Dynamic Random Access Memory (DRAM) Cache Architecture A DRAM chip consists of the DRAM cell array area and peripheral circuits, including several in-out ports (Figure2). Here, the DRAM cell region is composed of a plurality of sub-arrays, including DRAM cells and bit-line sense amplifiers. As mentioned in Section1, DRAM latency improvements are very slow, and there are many reasons for this. The reason for the slow latency improvement is directly related to cost and power consumption [16,17]. In order to reduce the sensing and pre-charge time, for example, the number of cells connected per bit-line should be reduced [18]. However, this leads to an increase in the number of bit-line sense amplifiers, and thus increases the chip size. Moreover, timing constraints, such as CAS latency (tCL) are mainly influenced by the speed of the data path. In order to improve this speed, the capacitive metal loading of the data path signal should be decreased, or its driver strength should be increased. However, these approaches may increase the cost or power consumption. Consequently, the latency of a DRAM device must be optimized with the simultaneous consideration of multiple side-effects. In this paper, we focus on the in-DRAM cache among various skills to reduce the latency of DRAM, and discuss its management method. Micromachines 2019, 10, 124 4 of 15 bit line sense amplifier DRAM DRAM DRAM DRAM bit linebit cell cell cell cell word line array array array array bit line sense amplifier In-out port DRAM cell array Peripheral bit line sense amplifier circuits …… DRAM DRAM DRAM DRAM cell cell cell cell array array array array DRAM cell array bit line sense amplifier Figure 2.

In-DRAM Cache Management for Low Latency and Low Power 3D-Stacked Drams

Zynq-7000 All Programmable Soc and 7 Series Devices Memory Interface Solutions User Guide (UG586) [Ref 2]

Micron Technology Inc

CPU Clock Rate DRAM Access Latency Growing

Understanding and Exploiting Design-Induced Latency Variation in Modern DRAM Chips

Leveraging Heterogeneity in DRAM Main Memories to Accelerate Critical Word Access ∗

4Gb: X16 DDR4 SDRAM Features DDR4 SDRAM EDY4016A - 256Mb X 16

Dynamic Rams from Asynchrounos to DDR4

Memory Systems Cache, DRAM, Disk

MOCA: Memory Object Classiﬁcation and Allocation in Heterogeneous Memory Systems

Techniquesfor

Zynq-7000 All Programmable Soc and 7 Series Devices Memory Interface Solutions User Guide (UG586) [Ref 2]

External Memory Interface Handbook Volume 2: Design Guidelines