Energy-Efficient DRAM for Extreme Bandwidth Systems

Fine-Grained DRAM: Energy-Efficient DRAM for Extreme Bandwidth Systems Mike O’Connor∗yz Niladrish Chatterjee∗y Donghyuk Leey John Wilsony Aditya Agrawaly Stephen W. Keckleryz William J. Dallyy⋄ yNVIDIA zThe University of Texas at Austin ⋄Stanford University {moconnor, nchatterjee, donghyukl, jowilson, adityaa, skeckler, bdally} @nvidia.com ABSTRACT KEYWORDS Future GPUs and other high-performance throughput processors will DRAM, Energy-Efficiency, High Bandwidth, GPU require multiple TB/s of bandwidth to DRAM. Satisfying this band- ACM Reference format: width demand within an acceptable energy budget is a challenge M. O’Connor, N. Chatterjee, D. Lee, J. Wilson, A. Agrawal, S.W. Keckler, and in these extreme bandwidth memory systems. We propose a new W.J. Dally. 2017. Fine-Grained DRAM: Energy-Efficient DRAM for Extreme high-bandwidth DRAM architecture, Fine-Grained DRAM (FGDRAM), Bandwidth Systems. In Proceedings of MICRO-50, Cambridge, MA, USA, which improves bandwidth by 4× and improves the energy efficiency October 14–18, 2017, 14 pages. of DRAM by 2× relative to the highest-bandwidth, most energy- https://doi.org/10.1145/3123939.3124545 efficient contemporary DRAM, High Bandwidth Memory (HBM2). These benefits are in large measure achieved by partitioning theDRAM 1 INTRODUCTION die into many independent units, called grains, each of which has a High bandwidth DRAM has been a key enabler of the continu- local, adjacent I/O. This approach unlocks the bandwidth of all the ous performance scaling of Graphics Processing Units (GPUs) and banks in the DRAM to be used simultaneously, eliminating shared other throughput-oriented parallel processors. Successive gener- buses interconnecting various banks. Furthermore, the on-DRAM data ations of GPU-specific DRAMs, optimized primarily to maximize movement energy is significantly reduced due to the much shorter bandwidth rather than minimize cost per bit, have increased ag- wiring distance between the cell array and the local I/O. This FGDRAM gregate system bandwidth; first through high-frequency off-chip architecture readily lends itself to leveraging existing techniques to signaling with Graphics Double-Data Rate memories (GDDR3/5/5X reducing the effective DRAM row size in an area efficient manner, [18, 21, 24]) and, most recently, through on-package integration of reducing wasteful row activate energy in applications with low local- the processor die and wide, high-bandwidth interfaces to stacks of ity. In addition, when FGDRAM is paired with a memory controller DRAM (e.g., High Bandwidth Memory (HBM/HBM2) [20, 23] and optimized to exploit the additional concurrency provided by the in- Multi-Channel DRAM (MCDRAM) [15]). Future GPUs will demand dependent grains, it improves GPU system performance by 19% over multiple TB/s of DRAM bandwidth requiring further improvements an iso-bandwidth and iso-capacity future HBM baseline. Thus, this in the bandwidth of GPU-specific DRAM devices. energy-efficient, high-bandwidth FGDRAM architecture addresses the In this paper, we show that traditional techniques for extend- needs of future extreme-bandwidth memory systems. ing the bandwidth of DRAMs will either add to the system en- CCS CONCEPTS ergy, and/or add to the cost/area of DRAM devices. To meet the bandwidth objectives of the future, DRAM devices must be more • Hardware → Dynamic memory; Power and energy; • Com- energy-efficient than they are today without significantly sacrific- puting methodologies → Graphics processors; • Computer sys- ing area-efficiency. To architect a DRAM device that meets these tems organization → Parallel architectures; objectives, we carry out a detailed design space exploration of high- bandwidth DRAM microarchitectures. Using constraints imposed ∗ Both authors contributed equally to the paper by practical DRAM layouts and insights from GPU memory access behaviors to inform the design process, we arrive at a DRAM and memory controller architecture, Fine-grained DRAM (FGDRAM), Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed suited to future high-bandwidth GPUs. for profit or commercial advantage and that copies bear this notice and the full citation The most formidable challenge to scaling the bandwidth of GPU on the first page. Copyrights for components of this work owned by others than ACM DRAMs is the energy of DRAM accesses. Every system is designed must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a to operate within a fixed maximum power envelope. The energy fee. Request permissions from [email protected]. spent on DRAM access eats into the total power budget available MICRO-50, October 14–18, 2017, Cambridge, MA, USA for the rest of the system. Traditionally, high-end GPU cards have © 2017 Association for Computing Machinery. ACM ISBN 978-1-4503-4952-9/17/10...$15.00 been limited to approximately 300W, of which no more than about https://doi.org/10.1145/3123939.3124545 20% is budgeted to the DRAM when operating at peak bandwidth. MICRO-50, October 14–18, 2017, Cambridge, MA, USA M. O’Connor et al. a traditional HBM2 die where 16 DRAM banks share a single wide 30 I/O interface, each FGDRAM grain fetches data from only a single 25 DRAM bank. Second, each grain has a fraction of the bandwidth of 20 a traditional HBM2 channel. These two architectural changes enable FGDRAM GDDR5 the main benefits of . First, eliminating the sharing of a 15 DRAM channel by multiple banks eliminates the inter-bank global 14.0pJ/bit 10 data bus on a DRAM die. This architecture reduces the distance HBM2 moved by data from a row-buffer to the I/O hub, thereby reduc- 5 3.9pJ/bit ing the on-DRAM data movement energy. Second, because each 536GB/s 1.9TB/s per access energy [pJ/bit] per access 0 FGDRAM bank needs to provide less bandwidth than a traditional 192 256GB/s 512GB/s 1 TB/s 21920TB/s 4 TB/s bank, FGDRAM is able to use techniques explained in Section 3.2 to memory system bandwidth achieve lower activation energy without significant area overheads. (a) Maximum DRAM access energy for given peak bandwidth within While these benefits combine synergistically to reduce the DRAM 60W DRAM power budget access energy, the allocation of private data channels to the individ- Activation On-die Data Movement I/O ual banks on a die also exposes the entire bandwidth of the DRAM HBM2 die to the GPU and paves the way for area-efficient bandwidth scaling. The throughput-optimized memory controllers on a GPU 0 1 2 (pJ/b) 3 4 can easily exploit this architecture to provide high bandwidth to (b) HBM2 energy consumption memory intensive applications. In summary, this paper makes the following contributions: Figure 1: GPU Memory Power and Energy • Based on a detailed analysis of GPU workloads (both compute and graphics) and practical DRAM architectures, we Figure 1a shows the DRAM energy per access that can be tolerated demonstrate that both data movement and row activation at a given peak DRAM bandwidth while remaining within a 60W energies must be reduced to meet the energy target of future DRAM power budget. We see that the energy improvements of on- memories. die stacked High Bandwidth Memory (HBM2) over off-chip GDDR5 • We propose a new DRAM architecture, FGDRAM, which pro- memories have allowed modern GPUs to approach a terabyte-per- vides both 4× more bandwidth and 51% lower energy per second of memory bandwidth at comparable power to previous access than HBM2, the highest bandwidth and most efficient GPUs that provided less than half the bandwidth using GDDR5. This contemporary DRAM. HBM2 figure also demonstrates, however, that even with , systems • We develop an evolutionary approach to HBM2 which also with more than 2 TB/s of bandwidth won’t be possible within this provides 4× more bandwidth, but show FGDRAM is 49% lower traditional power budget. A future exascale GPU with 4 TB/s of energy than this iso-bandwidth baseline. DRAM bandwidth would dissipate upwards of 120 W of DRAM • The additional concurrency in our proposed FGDRAM archi- power. tecture can be easily exploited by a GPU to improve the HBM2 The energy to access a bit in is approximately 3.97 pJ/bit, performance of a wide range of GPU compute workloads by and, as shown in Figure 1b, it consists largely of data movement 19% on average over the iso-bandwidth baseline. energy (the energy to move data from the row buffer to the I/O • We also consider the iso-bandwidth baseline enhanced with pins) and activation energy (the energy to precharge a bank and two prior proposed techniques to improve DRAM perfor- activate a row of cells into the row-buffer); the I/O energy accounts mance and energy. We show that FGDRAM requires 34% less for the small remainder. The activation energy is a function of energy, is 1.5% less area, and is within 1.3% of the perfor- the row size and the row locality of the memory access stream, mance of this enhanced baseline. and it is a significant factor because most GPU workloads access only a small fraction of the 1KB row activated in HBM2. The data 2 BANDWIDTH SCALING CHALLENGES movement energy is determined primarily by the distance that the data moves on both the DRAM die and the base layer die to This section examines the main challenges faced by conventional reach the I/O pins, the capacitance of these wires, and the rate bandwidth scaling techniques when applied to high bandwidth of switching on this datapath. Since most current DRAM devices, DRAMs.

Load more