Efficient Design and Communication for 3D Stacked Dynamic Memory

EFFICIENT DESIGN AND COMMUNICATION FOR 3D STACKED DYNAMIC MEMORY An Undergraduate Research Scholars Thesis by ANDREW DOUGLASS Submitted to the Undergraduate Research Scholars program at Texas A&M University in partial fulfillment of the requirements for the designation as an UNDERGRADUATE RESEARCH SCHOLAR Approved by Research Advisor: Dr. Sunil Khatri May 2017 Major: Computer Engineering TABLE OF CONTENTS Page ABSTRACT .................................................................................................................................. 1 ACKNOWLEDGMENTS ............................................................................................................ 2 NOMENCLATURE ..................................................................................................................... 3 CHAPTER I. INTRODUCTION ...................................................................................................... 6 History of DRAM Architectures ........................................................................... 6 Rambus DRAM ................................................................................................. 10 Graphics DRAM ................................................................................................ 11 Current DIMM Issues ......................................................................................... 12 3D Stacked Memory ........................................................................................... 15 II. RING BASED MEMORY ........................................................................................ 19 Skew Cancellation for HBM ............................................................................... 19 Massed Refresh ................................................................................................... 20 Simultaneous Multi Layer Access ...................................................................... 20 Ring-Based Interconnect ..................................................................................... 20 III. SIMULATION RESULTS ....................................................................................... 26 16nm Process ...................................................................................................... 26 3D Clock Ring .................................................................................................... 27 Current DRAM Speeds ....................................................................................... 29 Low Power Ring Memory .................................................................................. 36 High Performance Ring Memory ........................................................................ 43 IV. CONCLUSION ......................................................................................................... 49 Importance .......................................................................................................... 49 Future Research .................................................................................................. 49 REFERENCES ........................................................................................................................... 51 APPENDIX: MEMORY SIMULATION DATA TABLES....................................................... 54 ABSTRACT Efficient Design and Communication for 3D Stacked Dynamic Memory Andrew Douglass Department of Electrical and Computer Engineering Texas A&M University Research Advisor: Dr. Sunil Khatri Department of Electrical and Computer Engineering Texas A&M University As computer memory increases in size and processors continue to get faster, memory becomes an increasing bottleneck to system performance. To mitigate the slow DRAM memory chip speeds, a new generation of 3D stacked DRAM will allow lower power consumption and higher bandwidth. To communicate between these chips, this paper proposes the use of ring based standing wave oscillators for fast data transfer. With a fast clocking scheme, multiple channels can share the same bus to reduce TSVs and maintain similar memory latencies. Simulations with the new clocking scheme and data transfers are performed to show the improvements that can be made in memory communication. Experimental results show that a ring based clocking scheme can obtain two times the speed up of current stacked memory chips. Variations of this clocking scheme can also provide half the power consumption with comparable speeds. These ring-based architectures allow higher memory speeds without compromising the complexity of the hardware. This allows the ring-based memory architecture to trade off power, throughput, and latency to improve system performance for different applications. 1 ACKNOWLEDGEMENTS I would like to thank my mentor, Dr. Sunil Khatri, for his guidance and continual support throughout the course of this research project. I would also like to thank Abbas Fairouz for his help with SPICE simulations and the time he dedicated to my questions. Thanks also go to my friends, colleagues, and the faculty staff for making my time at Texas A&M University a great experience. I also want to extend my gratitude to the staff of the Undergraduate Research Scholars program, which provided many training sessions and assistance to all students involved in the program. Finally, I want to thank my mother and father for being role models to me and always supporting my goals. 2 NOMENCLATURE DRAM Dynamic Random Access Memory (DRAM) is a form of volatile memory that involves a transistor and a capacitor to store bits of information. JEDEC The Joint Electron Device Engineering Council (JEDEC) is an organization made up from many industry leaders who create the standards in DRAM memory. DIMM Dual In-line Memory Module (DIMM) is a printed circuit board that contains multiple random access memory chips connected to a large bus with separate electrical contacts on each side of the module. SO-DIMM Small Outline Dual In-line Memory Module (SO-DIMM) is a smaller version of DIMM that is used for devices where space is limited (e.g. Laptops, Tablets, and Routers). DMC Dynamic Memory Controller (DMC) is the hardware device that translates addresses, queues requests, and communicates with memory modules. MC Memory Chips (MCs) are integrated circuits found on the DIMM printed circuit board that contain memory arrays to store data as bits. Channel An independent bus that communicates between the DMC and the DIMM, which often consists of a data portion, a control portion, and an address portion. Figure 1: Multi-channel DRAM architecture. Rank A set of DRAM memory chips that share a control bus and operate simultaneously to provide the total bits for the data bus (shown in Figure 2). 3 Figure 2: Both sides of a DIMM with ranks labeled. Bank A subset of memory arrays that operate independently to allow pipelining of requests to different banks. Bit-line Bit-lines represent columns in a memory array that fill with the value from a memory cell during reading and drive to the desired value when writing (shown in Figure 3). Word-line Word-lines represent rows in a memory array that drive to a high voltage when reading from or writing to the specified row (shown in Figure 3). Figure 3: DRAM array bit-lines and word-lines. DDR Double Data Rate (DDR) is a computer bus that transfers data on both the positive edge and negative edge of the clock signal. Sense Amps A circuit used to detect small voltage changes on the bit-lines when reading data to determine what state (1 or 0) the specified capacitor was storing. Refresh The process of occasionally reading data into the sense amps and immediately writing that data back to the capacitors to prevent the loss of information. 4 Skew Skew is the maximum time difference that it takes for a signal to arrive at one endpoint compared to another endpoint in the same distribution network. H-Tree Named for the shape it resembles, it is a clock distribution scheme used to minimize clock skew between its end points by maintaining similar delays through each signal path. I/O Bus Input/Output (I/O) bus is the set of pins on the DIMM that carry the data to the module during writes and from the module during reads. PLL Phase-locked loop (PLL) is a control system that outputs a signal whose phase is the same as the input reference signal. SWO Standing Wave Oscillator (SWO) is a type of oscillator circuit with an inverter pair on one end and a short on the other to create a clock signal with the same phase at every point along its resonant wires. 5 CHAPTER I INTRODUCTION Though cheaper and larger, forms of non-volatile memory have long latencies. This can severely slow down the time it takes for a system to execute a desired program. To overcome this issue, forms of volatile memory store data temporarily so that the CPU can access it quickly. As processors improve, memory has continued to lag behind and cause bottlenecks to system performance. To alleviate this issue, clocking for the I/O data bus has increased with each generation of new DDR memories to obtain larger memory transfers in the same time frame. Though this proves effective for smaller burst lengths, larger burst lengths will be wasting overhead by transferring data to cache that will not be used. This likelihood increases as memory sizes continue to grow and multi-core systems become more prevalent. As a result, different computer system applications use different DRAM architectures. These designs specialize and vary in power consumption, throughput, and cost.

Efficient Design and Communication for 3D Stacked Dynamic Memory

Performance Impact of Memory Channels on Sparse and Irregular Algorithms

Product Guide SAMSUNG ELECTRONICS RESERVES the RIGHT to CHANGE PRODUCTS, INFORMATION and SPECIFICATIONS WITHOUT NOTICE

You Need to Know About Ddr4

Improving DRAM Performance by Parallelizing Refreshes

Datasheet DDR4 SDRAM Revision History

Case Study on Integrated Architecture for In-Memory and In-Storage Computing

Computer Architectures

ECE 571 – Advanced Microprocessor-Based Design Lecture 17

Adm-Pcie-9H3 V1.5

256M(16Mx16) GDDR SDRAM HY5DU561622CT

High Bandwidth Memory for Graphics Applications Contents

Architectures for High Performance Computing and Data Systems Using Byte-Addressable Persistent Memory