Efficient Design and Communication for 3D Stacked Dynamic Memory

Total Page:16

File Type:pdf, Size:1020Kb

Efficient Design and Communication for 3D Stacked Dynamic Memory EFFICIENT DESIGN AND COMMUNICATION FOR 3D STACKED DYNAMIC MEMORY An Undergraduate Research Scholars Thesis by ANDREW DOUGLASS Submitted to the Undergraduate Research Scholars program at Texas A&M University in partial fulfillment of the requirements for the designation as an UNDERGRADUATE RESEARCH SCHOLAR Approved by Research Advisor: Dr. Sunil Khatri May 2017 Major: Computer Engineering TABLE OF CONTENTS Page ABSTRACT .................................................................................................................................. 1 ACKNOWLEDGMENTS ............................................................................................................ 2 NOMENCLATURE ..................................................................................................................... 3 CHAPTER I. INTRODUCTION ...................................................................................................... 6 History of DRAM Architectures ........................................................................... 6 Rambus DRAM ................................................................................................. 10 Graphics DRAM ................................................................................................ 11 Current DIMM Issues ......................................................................................... 12 3D Stacked Memory ........................................................................................... 15 II. RING BASED MEMORY ........................................................................................ 19 Skew Cancellation for HBM ............................................................................... 19 Massed Refresh ................................................................................................... 20 Simultaneous Multi Layer Access ...................................................................... 20 Ring-Based Interconnect ..................................................................................... 20 III. SIMULATION RESULTS ....................................................................................... 26 16nm Process ...................................................................................................... 26 3D Clock Ring .................................................................................................... 27 Current DRAM Speeds ....................................................................................... 29 Low Power Ring Memory .................................................................................. 36 High Performance Ring Memory ........................................................................ 43 IV. CONCLUSION ......................................................................................................... 49 Importance .......................................................................................................... 49 Future Research .................................................................................................. 49 REFERENCES ........................................................................................................................... 51 APPENDIX: MEMORY SIMULATION DATA TABLES....................................................... 54 ABSTRACT Efficient Design and Communication for 3D Stacked Dynamic Memory Andrew Douglass Department of Electrical and Computer Engineering Texas A&M University Research Advisor: Dr. Sunil Khatri Department of Electrical and Computer Engineering Texas A&M University As computer memory increases in size and processors continue to get faster, memory becomes an increasing bottleneck to system performance. To mitigate the slow DRAM memory chip speeds, a new generation of 3D stacked DRAM will allow lower power consumption and higher bandwidth. To communicate between these chips, this paper proposes the use of ring based standing wave oscillators for fast data transfer. With a fast clocking scheme, multiple channels can share the same bus to reduce TSVs and maintain similar memory latencies. Simulations with the new clocking scheme and data transfers are performed to show the improvements that can be made in memory communication. Experimental results show that a ring based clocking scheme can obtain two times the speed up of current stacked memory chips. Variations of this clocking scheme can also provide half the power consumption with comparable speeds. These ring-based architectures allow higher memory speeds without compromising the complexity of the hardware. This allows the ring-based memory architecture to trade off power, throughput, and latency to improve system performance for different applications. 1 ACKNOWLEDGEMENTS I would like to thank my mentor, Dr. Sunil Khatri, for his guidance and continual support throughout the course of this research project. I would also like to thank Abbas Fairouz for his help with SPICE simulations and the time he dedicated to my questions. Thanks also go to my friends, colleagues, and the faculty staff for making my time at Texas A&M University a great experience. I also want to extend my gratitude to the staff of the Undergraduate Research Scholars program, which provided many training sessions and assistance to all students involved in the program. Finally, I want to thank my mother and father for being role models to me and always supporting my goals. 2 NOMENCLATURE DRAM Dynamic Random Access Memory (DRAM) is a form of volatile memory that involves a transistor and a capacitor to store bits of information. JEDEC The Joint Electron Device Engineering Council (JEDEC) is an organization made up from many industry leaders who create the standards in DRAM memory. DIMM Dual In-line Memory Module (DIMM) is a printed circuit board that contains multiple random access memory chips connected to a large bus with separate electrical contacts on each side of the module. SO-DIMM Small Outline Dual In-line Memory Module (SO-DIMM) is a smaller version of DIMM that is used for devices where space is limited (e.g. Laptops, Tablets, and Routers). DMC Dynamic Memory Controller (DMC) is the hardware device that translates addresses, queues requests, and communicates with memory modules. MC Memory Chips (MCs) are integrated circuits found on the DIMM printed circuit board that contain memory arrays to store data as bits. Channel An independent bus that communicates between the DMC and the DIMM, which often consists of a data portion, a control portion, and an address portion. Figure 1: Multi-channel DRAM architecture. Rank A set of DRAM memory chips that share a control bus and operate simultaneously to provide the total bits for the data bus (shown in Figure 2). 3 Figure 2: Both sides of a DIMM with ranks labeled. Bank A subset of memory arrays that operate independently to allow pipelining of requests to different banks. Bit-line Bit-lines represent columns in a memory array that fill with the value from a memory cell during reading and drive to the desired value when writing (shown in Figure 3). Word-line Word-lines represent rows in a memory array that drive to a high voltage when reading from or writing to the specified row (shown in Figure 3). Figure 3: DRAM array bit-lines and word-lines. DDR Double Data Rate (DDR) is a computer bus that transfers data on both the positive edge and negative edge of the clock signal. Sense Amps A circuit used to detect small voltage changes on the bit-lines when reading data to determine what state (1 or 0) the specified capacitor was storing. Refresh The process of occasionally reading data into the sense amps and immediately writing that data back to the capacitors to prevent the loss of information. 4 Skew Skew is the maximum time difference that it takes for a signal to arrive at one endpoint compared to another endpoint in the same distribution network. H-Tree Named for the shape it resembles, it is a clock distribution scheme used to minimize clock skew between its end points by maintaining similar delays through each signal path. I/O Bus Input/Output (I/O) bus is the set of pins on the DIMM that carry the data to the module during writes and from the module during reads. PLL Phase-locked loop (PLL) is a control system that outputs a signal whose phase is the same as the input reference signal. SWO Standing Wave Oscillator (SWO) is a type of oscillator circuit with an inverter pair on one end and a short on the other to create a clock signal with the same phase at every point along its resonant wires. 5 CHAPTER I INTRODUCTION Though cheaper and larger, forms of non-volatile memory have long latencies. This can severely slow down the time it takes for a system to execute a desired program. To overcome this issue, forms of volatile memory store data temporarily so that the CPU can access it quickly. As processors improve, memory has continued to lag behind and cause bottlenecks to system performance. To alleviate this issue, clocking for the I/O data bus has increased with each generation of new DDR memories to obtain larger memory transfers in the same time frame. Though this proves effective for smaller burst lengths, larger burst lengths will be wasting overhead by transferring data to cache that will not be used. This likelihood increases as memory sizes continue to grow and multi-core systems become more prevalent. As a result, different computer system applications use different DRAM architectures. These designs specialize and vary in power consumption, throughput, and cost.
Recommended publications
  • Performance Impact of Memory Channels on Sparse and Irregular Algorithms
    Performance Impact of Memory Channels on Sparse and Irregular Algorithms Oded Green1,2, James Fox2, Jeffrey Young2, Jun Shirako2, and David Bader3 1NVIDIA Corporation — 2Georgia Institute of Technology — 3New Jersey Institute of Technology Abstract— Graph processing is typically considered to be a Graph algorithms are typically latency-bound if there is memory-bound rather than compute-bound problem. One com- not enough parallelism to saturate the memory subsystem. mon line of thought is that more available memory bandwidth However, the growth in parallelism for shared-memory corresponds to better graph processing performance. However, in this work we demonstrate that the key factor in the utilization systems brings into question whether graph algorithms are of the memory system for graph algorithms is not necessarily still primarily latency-bound. Getting peak or near-peak the raw bandwidth or even the latency of memory requests. bandwidth of current memory subsystems requires highly Instead, we show that performance is proportional to the parallel applications as well as good spatial and temporal number of memory channels available to handle small data memory reuse. The lack of spatial locality in sparse and transfers with limited spatial locality. Using several widely used graph frameworks, including irregular algorithms means that prefetched cache lines have Gunrock (on the GPU) and GAPBS & Ligra (for CPUs), poor data reuse and that the effective bandwidth is fairly low. we evaluate key graph analytics kernels using two unique The introduction of high-bandwidth memories like HBM and memory hierarchies, DDR-based and HBM/MCDRAM. Our HMC have not yet closed this inefficiency gap for latency- results show that the differences in the peak bandwidths sensitive accesses [19].
    [Show full text]
  • Product Guide SAMSUNG ELECTRONICS RESERVES the RIGHT to CHANGE PRODUCTS, INFORMATION and SPECIFICATIONS WITHOUT NOTICE
    May. 2018 DDR4 SDRAM Memory Product Guide SAMSUNG ELECTRONICS RESERVES THE RIGHT TO CHANGE PRODUCTS, INFORMATION AND SPECIFICATIONS WITHOUT NOTICE. Products and specifications discussed herein are for reference purposes only. All information discussed herein is provided on an "AS IS" basis, without warranties of any kind. This document and all information discussed herein remain the sole and exclusive property of Samsung Electronics. No license of any patent, copyright, mask work, trademark or any other intellectual property right is granted by one party to the other party under this document, by implication, estoppel or other- wise. Samsung products are not intended for use in life support, critical care, medical, safety equipment, or similar applications where product failure could result in loss of life or personal or physical harm, or any military or defense application, or any governmental procurement to which special terms or provisions may apply. For updates or additional information about Samsung products, contact your nearest Samsung office. All brand names, trademarks and registered trademarks belong to their respective owners. © 2018 Samsung Electronics Co., Ltd. All rights reserved. - 1 - May. 2018 Product Guide DDR4 SDRAM Memory 1. DDR4 SDRAM MEMORY ORDERING INFORMATION 1 2 3 4 5 6 7 8 9 10 11 K 4 A X X X X X X X - X X X X SAMSUNG Memory Speed DRAM Temp & Power DRAM Type Package Type Density Revision Bit Organization Interface (VDD, VDDQ) # of Internal Banks 1. SAMSUNG Memory : K 8. Revision M: 1st Gen. A: 2nd Gen. 2. DRAM : 4 B: 3rd Gen. C: 4th Gen. D: 5th Gen.
    [Show full text]
  • You Need to Know About Ddr4
    Overcoming DDR Challenges in High-Performance Designs Mazyar Razzaz, Applications Engineering Jeff Steinheider, Product Marketing September 2018 | AMF-NET-T3267 Company Public – NXP, the NXP logo, and NXP secure connections for a smarter world are trademarks of NXP B.V. All other product or service names are the property of their respective owners. © 2018 NXP B.V. Agenda • Basic DDR SDRAM Structure • DDR3 vs. DDR4 SDRAM Differences • DDR Bring up Issues • Configurations and Validation via QCVS Tool COMPANY PUBLIC 1 BASIC DDR SDRAM STRUCTURE COMPANY PUBLIC 2 Single Transistor Memory Cell Access Transistor Column (bit) line Row (word) line G S D “1” => Vcc “0” => Gnd “precharged” to Vcc/2 Cbit Ccol Storage Parasitic Line Capacitor Vcc/2 Capacitance COMPANY PUBLIC 3 Memory Arrays B0 B1 B2 B3 B4 B5 B6 B7 ROW ADDRESS DECODER ADDRESS ROW W0 W1 W2 SENSE AMPS & WRITE DRIVERS COLUMN ADDRESS DECODER COMPANY PUBLIC 4 Internal Memory Banks • Multiple arrays organized into banks • Multiple banks per memory device − DDR3 – 8 banks, and 3 bank address (BA) bits − DDR4 – 16 banks with 4 banks in each of 4 sub bank groups − Can have one active row in each bank at any given time • Concurrency − Can be opening or closing a row in one bank while accessing another bank Bank 0 Bank 1 Bank 2 Bank 3 Row 0 Row 1 Row 2 Row 3 Row … Row Buffers COMPANY PUBLIC 5 Memory Access • A requested row is ACTIVATED and made accessible through the bank’s row buffers • READ and/or WRITE are issued to the active row in the row buffers • The row is PRECHARGED and is no longer
    [Show full text]
  • Improving DRAM Performance by Parallelizing Refreshes
    Improving DRAM Performance by Parallelizing Refreshes with Accesses Kevin Kai-Wei Chang Donghyuk Lee Zeshan Chishti† [email protected] [email protected] [email protected] Alaa R. Alameldeen† Chris Wilkerson† Yoongu Kim Onur Mutlu [email protected] [email protected] [email protected] [email protected] Carnegie Mellon University †Intel Labs Abstract Each DRAM cell must be refreshed periodically every re- fresh interval as specified by the DRAM standards [11, 14]. Modern DRAM cells are periodically refreshed to prevent The exact refresh interval time depends on the DRAM type data loss due to leakage. Commodity DDR (double data rate) (e.g., DDR or LPDDR) and the operating temperature. While DRAM refreshes cells at the rank level. This degrades perfor- DRAM is being refreshed, it becomes unavailable to serve mance significantly because it prevents an entire DRAM rank memory requests. As a result, refresh latency significantly de- from serving memory requests while being refreshed. DRAM de- grades system performance [24, 31, 33, 41] by delaying in- signed for mobile platforms, LPDDR (low power DDR) DRAM, flight memory requests. This problem will become more preva- supports an enhanced mode, called per-bank refresh, that re- lent as DRAM density increases, leading to more DRAM rows freshes cells at the bank level. This enables a bank to be ac- to be refreshed within the same refresh interval. DRAM chip cessed while another in the same rank is being refreshed, alle- density is expected to increase from 8Gb to 32Gb by 2020 as viating part of the negative performance impact of refreshes.
    [Show full text]
  • Datasheet DDR4 SDRAM Revision History
    Rev. 1.4, Apr. 2018 M378A5244CB0 M378A1K43CB2 M378A2K43CB1 288pin Unbuffered DIMM based on 8Gb C-die 78FBGA with Lead-Free & Halogen-Free (RoHS compliant) datasheet SAMSUNG ELECTRONICS RESERVES THE RIGHT TO CHANGE PRODUCTS, INFORMATION AND SPECIFICATIONS WITHOUT NOTICE. Products and specifications discussed herein are for reference purposes only. All information discussed herein is provided on an "AS IS" basis, without warranties of any kind. This document and all information discussed herein remain the sole and exclusive property of Samsung Electronics. No license of any patent, copyright, mask work, trademark or any other intellectual property right is granted by one party to the other party under this document, by implication, estoppel or other- wise. Samsung products are not intended for use in life support, critical care, medical, safety equipment, or similar applications where product failure could result in loss of life or personal or physical harm, or any military or defense application, or any governmental procurement to which special terms or provisions may apply. For updates or additional information about Samsung products, contact your nearest Samsung office. All brand names, trademarks and registered trademarks belong to their respective owners. © 2018 Samsung Electronics Co., Ltd.GG All rights reserved. - 1 - Rev. 1.4 Unbuffered DIMM datasheet DDR4 SDRAM Revision History Revision No. History Draft Date Remark Editor 1.0 - First SPEC. Release 27th Jun. 2016 - J.Y.Lee 1.1 - Deletion of Function Block Diagram [M378A1K43CB2] on page 11 29th Jun. 2016 - J.Y.Lee - Change of Physical Dimensions [M378A1K43CB1] on page 41 1.11 - Correction of Typo 7th Mar. 2017 - J.Y.Lee 1.2 - Change of Physical Dimensions [M378A1K43CB1] on page 41 23h Mar.
    [Show full text]
  • Case Study on Integrated Architecture for In-Memory and In-Storage Computing
    electronics Article Case Study on Integrated Architecture for In-Memory and In-Storage Computing Manho Kim 1, Sung-Ho Kim 1, Hyuk-Jae Lee 1 and Chae-Eun Rhee 2,* 1 Department of Electrical Engineering, Seoul National University, Seoul 08826, Korea; [email protected] (M.K.); [email protected] (S.-H.K.); [email protected] (H.-J.L.) 2 Department of Information and Communication Engineering, Inha University, Incheon 22212, Korea * Correspondence: [email protected]; Tel.: +82-32-860-7429 Abstract: Since the advent of computers, computing performance has been steadily increasing. Moreover, recent technologies are mostly based on massive data, and the development of artificial intelligence is accelerating it. Accordingly, various studies are being conducted to increase the performance and computing and data access, together reducing energy consumption. In-memory computing (IMC) and in-storage computing (ISC) are currently the most actively studied architectures to deal with the challenges of recent technologies. Since IMC performs operations in memory, there is a chance to overcome the memory bandwidth limit. ISC can reduce energy by using a low power processor inside storage without an expensive IO interface. To integrate the host CPU, IMC and ISC harmoniously, appropriate workload allocation that reflects the characteristics of the target application is required. In this paper, the energy and processing speed are evaluated according to the workload allocation and system conditions. The proof-of-concept prototyping system is implemented for the integrated architecture. The simulation results show that IMC improves the performance by 4.4 times and reduces total energy by 4.6 times over the baseline host CPU.
    [Show full text]
  • Computer Architectures
    Computer Architectures Processor+Memory Richard Šusta, Pavel Píša Czech Technical University in Prague, Faculty of Electrical Engineering AE0B36APO Computer Architectures Ver.1.00 2019 1 The main instruction cycle of the CPU 1. Initial setup/reset – set initial PC value, PSW, etc. 2. Read the instruction from the memory PC → to the address bus Read the memory contents (machine instruction) and transfer it to the IR PC+l → PC, where l is length of the instruction 3. Decode operation code (opcode) 4. Execute the operation compute effective address, select registers, read operands, pass them through ALU and store result 5. Check for exceptions/interrupts (and service them) 6. Repeat from the step 2 AE0B36APO Computer Architectures 2 Single cycle CPU – implementation of the load instruction lw: type I, rs – base address, imm – offset, rt – register where to store fetched data I opcode(6), 31:26 rs(5), 25:21 rt(5), 20:16 immediate (16), 15:0 ALUControl 25:21 WE3 SrcA Zero WE PC’ PC Instr A1 RD1 A RD ALU A RD Instr. A2 RD2 SrcB AluOut Data ReadData Memory Memory A3 Reg. WD3 File WD 15:0 Sign Ext SignImm B35APO Computer Architectures 3 Single cycle CPU – implementation of the load instruction lw: type I, rs – base address, imm – offset, rt – register where to store fetched data I opcode(6), 31:26 rs(5), 25:21 rt(5), 20:16 immediate (16), 15:0 Write on rising edge of CLK RegWrite = 1 ALUControl 25:21 WE3 SrcA Zero WE PC’ PC Instr A1 RD1 A RD ALU A RD Instr.
    [Show full text]
  • ECE 571 – Advanced Microprocessor-Based Design Lecture 17
    ECE 571 { Advanced Microprocessor-Based Design Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver [email protected] 3 April 2018 Announcements • HW8 is readings 1 More DRAM 2 ECC Memory • There's debate about how many errors can happen, anywhere from 10−10 error/bit*h (roughly one bit error per hour per gigabyte of memory) to 10−17 error/bit*h (roughly one bit error per millennium per gigabyte of memory • Google did a study and they found more toward the high end • Would you notice if you had a bit flipped? • Scrubbing { only notice a flip once you read out a value 3 Registered Memory • Registered vs Unregistered • Registered has a buffer on board. More expensive but can have more DIMMs on a channel • Registered may be slower (if it buffers for a cycle) • RDIMM/UDIMM 4 Bandwidth/Latency Issues • Truly random access? No, burst speed fast, random speed not. • Is that a problem? Mostly filling cache lines? 5 Memory Controller • Can we have full random access to memory? Why not just pass on CPU mem requests unchanged? • What might have higher priority? • Why might re-ordering the accesses help performance (back and forth between two pages) 6 Reducing Refresh • DRAM Refresh Mechanisms, Penalties, and Trade-Offs by Bhati et al. • Refresh hurts performance: ◦ Memory controller stalls access to memory being refreshed ◦ Refresh takes energy (read/write) On 32Gb device, up to 20% of energy consumption and 30% of performance 7 Async vs Sync Refresh • Traditional refresh rates ◦ Async Standard (15.6us) ◦ Async Extended (125us) ◦ SDRAM -
    [Show full text]
  • Adm-Pcie-9H3 V1.5
    ADM-PCIE-9H3 10th September 2019 Datasheet Revision: 1.5 AD01365 Applications Board Features • High-Performance Network Accelerator • 1x OpenCAPI Interface • Data CenterData Processor • 1x QSFP-DD Cages • High Performance Computing (HPC) • Shrouded heatsink with passive and fan cooling • System Modelling options • Market Analysis FPGA Features • 2x 4GB HBM Gen2 memory (32 AXI Ports provide 460GB/s Access Bandwidth) • 3x 100G Ethernet MACs (incl. KR4 RS-FEC) • 3x 150G Interlaken cores • 2x PCI Express x16 Gen3 / x8 Gen4 cores Summary The ADM-PCIE-9H3 is a high-performance FPGA processing card intended for data center applications using Virtex UltraScale+ High Bandwidth Memory FPGAs from Xilinx. The ADM-PCIE-9H3 utilises the Xilinx Virtex Ultrascale Plus FPGA family that includes on substrate High Bandwidth Memory (HBM Gen2). This provides exceptional memory Read/Write performance while reducing the overall power consumption of the board by negating the need for external SDRAM devices. There are also a number of high speed interface options available including 100G Ethernet MACs and OpenCAPI connectivity, to make the most of these interfaces the ADM-PCIE-9H3 is fitted with a QSFP-DD Cage (8x28Gbps lanes) and one OpenCAPI interface for ultra low latency communications. Target Device Host Interface Xilinx Virtex UltraScale Plus: XCVU33P-2E 1x PCI Express Gen3 x16 or 1x/2x* PCI Express (FSVH2104) Gen4 x8 or OpenCAPI LUTs = 440k(872k) Board Format FFs = 879k(1743k) DSPs = 2880(5952) 1/2 Length low profile x16 PCIe form Factor BRAM = 23.6Mb(47.3Mb) WxHxD = 19.7mm x 80.1mm x 181.5mm URAM = 90.0Mb(180.0Mb) Weight = TBCg 2x 4GB HBM Gen2 memory (32 AXI Ports Communications Interfaces provide 460GB/s Access Bandwidth) 1x QSFP-DD 8x28Gbps - 10/25/40/100G 3x 100G Ethernet MACs (incl.
    [Show full text]
  • 256M(16Mx16) GDDR SDRAM HY5DU561622CT
    HY5DU561622CT 256M(16Mx16) GDDR SDRAM HY5DU561622CT This document is a general product description and is subject to change without notice. Hynix Electronics does not assume any responsibility for use of circuits described. No patent licenses are implied. Rev. 1.3 / Oct. 2004 HY5DU561622CT Revision History Revision History Draft Date Remark No. 0.1 Defined Preliminary Specification Dec. 2002 0.2 Defined IDD Spec. Feb. 2003 0.3 Improvement of VDD from 2.8V to 2.5V in 300MHz Mar. 2003 0.4 Changed IDD Spec. Apr. 2003 0.5 166MHz Speed added. May. 2003 1) Changed VDD Value of HY5DU561622CT-33/36 from 2.5V to 2.6V 0.6 June 2003 2) Added tRC@Auto Precharge Parameter in AC CHARACTERISTICS - I 0.7 Added 350MHz Speed June 2003 0.8 Changed VDD/VDDQ min/max range of HY5DU561622CT- 33 /36 July 2003 1) Changed tRAS_max Value from 120K to 100K in All Frequency 0.9 2) Changed IDD6 value from 3mA to 4mA in All Frequency Aug. 2003 3) Changed Refresh Time from 64ms to 32ms in All Frequency 1.0 Refresh Time restore to 64ms from 32ms Jun. 2004 1.1 Insert AC Overshoot comment Aug. 2004 1.2 tRAS_max change Sep. 2004 1.3 tRAS_min & tQHS change Oct. 2004 Rev. 1.3 / Oct. 2004 2 HY5DU561622CT PRELIMINARY DESCRIPTION The Hynix HY5DU561622CT is a 268,435,456-bit CMOS Double Data Rate(DDR) Synchronous DRAM, ideally suited for the point-to-point applications which requires high bandwidth. The Hynix 16Mx16 DDR SDRAMs offer fully synchronous operations referenced to both rising and falling edges of the clock.
    [Show full text]
  • High Bandwidth Memory for Graphics Applications Contents
    High Bandwidth Memory for Graphics Applications Contents • Differences in Requirements: System Memory vs. Graphics Memory • Timeline of Graphics Memory Standards • GDDR2 • GDDR3 • GDDR4 • GDDR5 SGRAM • Problems with GDDR • Solution ‐ Introduction to HBM • Performance comparisons with GDDR5 • Benchmarks • Hybrid Memory Cube Differences in Requirements System Memory Graphics Memory • Optimized for low latency • Optimized for high bandwidth • Short burst vector loads • Long burst vector loads • Equal read/write latency ratio • Low read/write latency ratio • Very general solutions and designs • Designs can be very haphazard Brief History of Graphics Memory Types • Ancient History: VRAM, WRAM, MDRAM, SGRAM • Bridge to modern times: GDDR2 • The first modern standard: GDDR4 • Rapidly outclassed: GDDR4 • Current state: GDDR5 GDDR2 • First implemented with Nvidia GeForce FX 5800 (2003) • Midway point between DDR and ‘true’ DDR2 • Stepping stone towards DDR‐based graphics memory • Second‐generation GDDR2 based on DDR2 GDDR3 • Designed by ATI Technologies , first used by Nvidia GeForce FX 5700 (2004) • Based off of the same technological base as DDR2 • Lower heat and power consumption • Uses internal terminators and a 32‐bit bus GDDR4 • Based on DDR3, designed by Samsung from 2005‐2007 • Introduced Data Bus Inversion (DBI) • Doubled prefetch size to 8n • Used on ATI Radeon 2xxx and 3xxx, never became commercially viable GDDR5 SGRAM • Based on DDR3 SDRAM memory • Inherits benefits of GDDR4 • First used in AMD Radeon HD 4870 video cards (2008) • Current
    [Show full text]
  • Architectures for High Performance Computing and Data Systems Using Byte-Addressable Persistent Memory
    Architectures for High Performance Computing and Data Systems using Byte-Addressable Persistent Memory Adrian Jackson∗, Mark Parsonsy, Michele` Weilandz EPCC, The University of Edinburgh Edinburgh, United Kingdom ∗[email protected], [email protected], [email protected] Bernhard Homolle¨ x SVA System Vertrieb Alexander GmbH Paderborn, Germany x [email protected] Abstract—Non-volatile, byte addressable, memory technology example of such hardware, used for data storage. More re- with performance close to main memory promises to revolutionise cently, flash memory has been used for high performance I/O computing systems in the near future. Such memory technology in the form of Solid State Disk (SSD) drives, providing higher provides the potential for extremely large memory regions (i.e. > 3TB per server), very high performance I/O, and new ways bandwidth and lower latency than traditional Hard Disk Drives of storing and sharing data for applications and workflows. (HDD). This paper outlines an architecture that has been designed to Whilst flash memory can provide fast input/output (I/O) exploit such memory for High Performance Computing and High performance for computer systems, there are some draw backs. Performance Data Analytics systems, along with descriptions of It has limited endurance when compare to HDD technology, how applications could benefit from such hardware. Index Terms—Non-volatile memory, persistent memory, stor- restricted by the number of modifications a memory cell age class memory, system architecture, systemware, NVRAM, can undertake and thus the effective lifetime of the flash SCM, B-APM storage [29]. It is often also more expensive than other storage technologies.
    [Show full text]