Efficient Design and Communication for 3D Stacked Dynamic Memory
Total Page:16
File Type:pdf, Size:1020Kb

Load more
Recommended publications
-
Performance Impact of Memory Channels on Sparse and Irregular Algorithms
Performance Impact of Memory Channels on Sparse and Irregular Algorithms Oded Green1,2, James Fox2, Jeffrey Young2, Jun Shirako2, and David Bader3 1NVIDIA Corporation — 2Georgia Institute of Technology — 3New Jersey Institute of Technology Abstract— Graph processing is typically considered to be a Graph algorithms are typically latency-bound if there is memory-bound rather than compute-bound problem. One com- not enough parallelism to saturate the memory subsystem. mon line of thought is that more available memory bandwidth However, the growth in parallelism for shared-memory corresponds to better graph processing performance. However, in this work we demonstrate that the key factor in the utilization systems brings into question whether graph algorithms are of the memory system for graph algorithms is not necessarily still primarily latency-bound. Getting peak or near-peak the raw bandwidth or even the latency of memory requests. bandwidth of current memory subsystems requires highly Instead, we show that performance is proportional to the parallel applications as well as good spatial and temporal number of memory channels available to handle small data memory reuse. The lack of spatial locality in sparse and transfers with limited spatial locality. Using several widely used graph frameworks, including irregular algorithms means that prefetched cache lines have Gunrock (on the GPU) and GAPBS & Ligra (for CPUs), poor data reuse and that the effective bandwidth is fairly low. we evaluate key graph analytics kernels using two unique The introduction of high-bandwidth memories like HBM and memory hierarchies, DDR-based and HBM/MCDRAM. Our HMC have not yet closed this inefficiency gap for latency- results show that the differences in the peak bandwidths sensitive accesses [19]. -
Product Guide SAMSUNG ELECTRONICS RESERVES the RIGHT to CHANGE PRODUCTS, INFORMATION and SPECIFICATIONS WITHOUT NOTICE
May. 2018 DDR4 SDRAM Memory Product Guide SAMSUNG ELECTRONICS RESERVES THE RIGHT TO CHANGE PRODUCTS, INFORMATION AND SPECIFICATIONS WITHOUT NOTICE. Products and specifications discussed herein are for reference purposes only. All information discussed herein is provided on an "AS IS" basis, without warranties of any kind. This document and all information discussed herein remain the sole and exclusive property of Samsung Electronics. No license of any patent, copyright, mask work, trademark or any other intellectual property right is granted by one party to the other party under this document, by implication, estoppel or other- wise. Samsung products are not intended for use in life support, critical care, medical, safety equipment, or similar applications where product failure could result in loss of life or personal or physical harm, or any military or defense application, or any governmental procurement to which special terms or provisions may apply. For updates or additional information about Samsung products, contact your nearest Samsung office. All brand names, trademarks and registered trademarks belong to their respective owners. © 2018 Samsung Electronics Co., Ltd. All rights reserved. - 1 - May. 2018 Product Guide DDR4 SDRAM Memory 1. DDR4 SDRAM MEMORY ORDERING INFORMATION 1 2 3 4 5 6 7 8 9 10 11 K 4 A X X X X X X X - X X X X SAMSUNG Memory Speed DRAM Temp & Power DRAM Type Package Type Density Revision Bit Organization Interface (VDD, VDDQ) # of Internal Banks 1. SAMSUNG Memory : K 8. Revision M: 1st Gen. A: 2nd Gen. 2. DRAM : 4 B: 3rd Gen. C: 4th Gen. D: 5th Gen. -
You Need to Know About Ddr4
Overcoming DDR Challenges in High-Performance Designs Mazyar Razzaz, Applications Engineering Jeff Steinheider, Product Marketing September 2018 | AMF-NET-T3267 Company Public – NXP, the NXP logo, and NXP secure connections for a smarter world are trademarks of NXP B.V. All other product or service names are the property of their respective owners. © 2018 NXP B.V. Agenda • Basic DDR SDRAM Structure • DDR3 vs. DDR4 SDRAM Differences • DDR Bring up Issues • Configurations and Validation via QCVS Tool COMPANY PUBLIC 1 BASIC DDR SDRAM STRUCTURE COMPANY PUBLIC 2 Single Transistor Memory Cell Access Transistor Column (bit) line Row (word) line G S D “1” => Vcc “0” => Gnd “precharged” to Vcc/2 Cbit Ccol Storage Parasitic Line Capacitor Vcc/2 Capacitance COMPANY PUBLIC 3 Memory Arrays B0 B1 B2 B3 B4 B5 B6 B7 ROW ADDRESS DECODER ADDRESS ROW W0 W1 W2 SENSE AMPS & WRITE DRIVERS COLUMN ADDRESS DECODER COMPANY PUBLIC 4 Internal Memory Banks • Multiple arrays organized into banks • Multiple banks per memory device − DDR3 – 8 banks, and 3 bank address (BA) bits − DDR4 – 16 banks with 4 banks in each of 4 sub bank groups − Can have one active row in each bank at any given time • Concurrency − Can be opening or closing a row in one bank while accessing another bank Bank 0 Bank 1 Bank 2 Bank 3 Row 0 Row 1 Row 2 Row 3 Row … Row Buffers COMPANY PUBLIC 5 Memory Access • A requested row is ACTIVATED and made accessible through the bank’s row buffers • READ and/or WRITE are issued to the active row in the row buffers • The row is PRECHARGED and is no longer -
Improving DRAM Performance by Parallelizing Refreshes
Improving DRAM Performance by Parallelizing Refreshes with Accesses Kevin Kai-Wei Chang Donghyuk Lee Zeshan Chishti† [email protected] [email protected] [email protected] Alaa R. Alameldeen† Chris Wilkerson† Yoongu Kim Onur Mutlu [email protected] [email protected] [email protected] [email protected] Carnegie Mellon University †Intel Labs Abstract Each DRAM cell must be refreshed periodically every re- fresh interval as specified by the DRAM standards [11, 14]. Modern DRAM cells are periodically refreshed to prevent The exact refresh interval time depends on the DRAM type data loss due to leakage. Commodity DDR (double data rate) (e.g., DDR or LPDDR) and the operating temperature. While DRAM refreshes cells at the rank level. This degrades perfor- DRAM is being refreshed, it becomes unavailable to serve mance significantly because it prevents an entire DRAM rank memory requests. As a result, refresh latency significantly de- from serving memory requests while being refreshed. DRAM de- grades system performance [24, 31, 33, 41] by delaying in- signed for mobile platforms, LPDDR (low power DDR) DRAM, flight memory requests. This problem will become more preva- supports an enhanced mode, called per-bank refresh, that re- lent as DRAM density increases, leading to more DRAM rows freshes cells at the bank level. This enables a bank to be ac- to be refreshed within the same refresh interval. DRAM chip cessed while another in the same rank is being refreshed, alle- density is expected to increase from 8Gb to 32Gb by 2020 as viating part of the negative performance impact of refreshes. -
Datasheet DDR4 SDRAM Revision History
Rev. 1.4, Apr. 2018 M378A5244CB0 M378A1K43CB2 M378A2K43CB1 288pin Unbuffered DIMM based on 8Gb C-die 78FBGA with Lead-Free & Halogen-Free (RoHS compliant) datasheet SAMSUNG ELECTRONICS RESERVES THE RIGHT TO CHANGE PRODUCTS, INFORMATION AND SPECIFICATIONS WITHOUT NOTICE. Products and specifications discussed herein are for reference purposes only. All information discussed herein is provided on an "AS IS" basis, without warranties of any kind. This document and all information discussed herein remain the sole and exclusive property of Samsung Electronics. No license of any patent, copyright, mask work, trademark or any other intellectual property right is granted by one party to the other party under this document, by implication, estoppel or other- wise. Samsung products are not intended for use in life support, critical care, medical, safety equipment, or similar applications where product failure could result in loss of life or personal or physical harm, or any military or defense application, or any governmental procurement to which special terms or provisions may apply. For updates or additional information about Samsung products, contact your nearest Samsung office. All brand names, trademarks and registered trademarks belong to their respective owners. © 2018 Samsung Electronics Co., Ltd.GG All rights reserved. - 1 - Rev. 1.4 Unbuffered DIMM datasheet DDR4 SDRAM Revision History Revision No. History Draft Date Remark Editor 1.0 - First SPEC. Release 27th Jun. 2016 - J.Y.Lee 1.1 - Deletion of Function Block Diagram [M378A1K43CB2] on page 11 29th Jun. 2016 - J.Y.Lee - Change of Physical Dimensions [M378A1K43CB1] on page 41 1.11 - Correction of Typo 7th Mar. 2017 - J.Y.Lee 1.2 - Change of Physical Dimensions [M378A1K43CB1] on page 41 23h Mar. -
Case Study on Integrated Architecture for In-Memory and In-Storage Computing
electronics Article Case Study on Integrated Architecture for In-Memory and In-Storage Computing Manho Kim 1, Sung-Ho Kim 1, Hyuk-Jae Lee 1 and Chae-Eun Rhee 2,* 1 Department of Electrical Engineering, Seoul National University, Seoul 08826, Korea; [email protected] (M.K.); [email protected] (S.-H.K.); [email protected] (H.-J.L.) 2 Department of Information and Communication Engineering, Inha University, Incheon 22212, Korea * Correspondence: [email protected]; Tel.: +82-32-860-7429 Abstract: Since the advent of computers, computing performance has been steadily increasing. Moreover, recent technologies are mostly based on massive data, and the development of artificial intelligence is accelerating it. Accordingly, various studies are being conducted to increase the performance and computing and data access, together reducing energy consumption. In-memory computing (IMC) and in-storage computing (ISC) are currently the most actively studied architectures to deal with the challenges of recent technologies. Since IMC performs operations in memory, there is a chance to overcome the memory bandwidth limit. ISC can reduce energy by using a low power processor inside storage without an expensive IO interface. To integrate the host CPU, IMC and ISC harmoniously, appropriate workload allocation that reflects the characteristics of the target application is required. In this paper, the energy and processing speed are evaluated according to the workload allocation and system conditions. The proof-of-concept prototyping system is implemented for the integrated architecture. The simulation results show that IMC improves the performance by 4.4 times and reduces total energy by 4.6 times over the baseline host CPU. -
Computer Architectures
Computer Architectures Processor+Memory Richard Šusta, Pavel Píša Czech Technical University in Prague, Faculty of Electrical Engineering AE0B36APO Computer Architectures Ver.1.00 2019 1 The main instruction cycle of the CPU 1. Initial setup/reset – set initial PC value, PSW, etc. 2. Read the instruction from the memory PC → to the address bus Read the memory contents (machine instruction) and transfer it to the IR PC+l → PC, where l is length of the instruction 3. Decode operation code (opcode) 4. Execute the operation compute effective address, select registers, read operands, pass them through ALU and store result 5. Check for exceptions/interrupts (and service them) 6. Repeat from the step 2 AE0B36APO Computer Architectures 2 Single cycle CPU – implementation of the load instruction lw: type I, rs – base address, imm – offset, rt – register where to store fetched data I opcode(6), 31:26 rs(5), 25:21 rt(5), 20:16 immediate (16), 15:0 ALUControl 25:21 WE3 SrcA Zero WE PC’ PC Instr A1 RD1 A RD ALU A RD Instr. A2 RD2 SrcB AluOut Data ReadData Memory Memory A3 Reg. WD3 File WD 15:0 Sign Ext SignImm B35APO Computer Architectures 3 Single cycle CPU – implementation of the load instruction lw: type I, rs – base address, imm – offset, rt – register where to store fetched data I opcode(6), 31:26 rs(5), 25:21 rt(5), 20:16 immediate (16), 15:0 Write on rising edge of CLK RegWrite = 1 ALUControl 25:21 WE3 SrcA Zero WE PC’ PC Instr A1 RD1 A RD ALU A RD Instr. -
ECE 571 – Advanced Microprocessor-Based Design Lecture 17
ECE 571 { Advanced Microprocessor-Based Design Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver [email protected] 3 April 2018 Announcements • HW8 is readings 1 More DRAM 2 ECC Memory • There's debate about how many errors can happen, anywhere from 10−10 error/bit*h (roughly one bit error per hour per gigabyte of memory) to 10−17 error/bit*h (roughly one bit error per millennium per gigabyte of memory • Google did a study and they found more toward the high end • Would you notice if you had a bit flipped? • Scrubbing { only notice a flip once you read out a value 3 Registered Memory • Registered vs Unregistered • Registered has a buffer on board. More expensive but can have more DIMMs on a channel • Registered may be slower (if it buffers for a cycle) • RDIMM/UDIMM 4 Bandwidth/Latency Issues • Truly random access? No, burst speed fast, random speed not. • Is that a problem? Mostly filling cache lines? 5 Memory Controller • Can we have full random access to memory? Why not just pass on CPU mem requests unchanged? • What might have higher priority? • Why might re-ordering the accesses help performance (back and forth between two pages) 6 Reducing Refresh • DRAM Refresh Mechanisms, Penalties, and Trade-Offs by Bhati et al. • Refresh hurts performance: ◦ Memory controller stalls access to memory being refreshed ◦ Refresh takes energy (read/write) On 32Gb device, up to 20% of energy consumption and 30% of performance 7 Async vs Sync Refresh • Traditional refresh rates ◦ Async Standard (15.6us) ◦ Async Extended (125us) ◦ SDRAM - -
Adm-Pcie-9H3 V1.5
ADM-PCIE-9H3 10th September 2019 Datasheet Revision: 1.5 AD01365 Applications Board Features • High-Performance Network Accelerator • 1x OpenCAPI Interface • Data CenterData Processor • 1x QSFP-DD Cages • High Performance Computing (HPC) • Shrouded heatsink with passive and fan cooling • System Modelling options • Market Analysis FPGA Features • 2x 4GB HBM Gen2 memory (32 AXI Ports provide 460GB/s Access Bandwidth) • 3x 100G Ethernet MACs (incl. KR4 RS-FEC) • 3x 150G Interlaken cores • 2x PCI Express x16 Gen3 / x8 Gen4 cores Summary The ADM-PCIE-9H3 is a high-performance FPGA processing card intended for data center applications using Virtex UltraScale+ High Bandwidth Memory FPGAs from Xilinx. The ADM-PCIE-9H3 utilises the Xilinx Virtex Ultrascale Plus FPGA family that includes on substrate High Bandwidth Memory (HBM Gen2). This provides exceptional memory Read/Write performance while reducing the overall power consumption of the board by negating the need for external SDRAM devices. There are also a number of high speed interface options available including 100G Ethernet MACs and OpenCAPI connectivity, to make the most of these interfaces the ADM-PCIE-9H3 is fitted with a QSFP-DD Cage (8x28Gbps lanes) and one OpenCAPI interface for ultra low latency communications. Target Device Host Interface Xilinx Virtex UltraScale Plus: XCVU33P-2E 1x PCI Express Gen3 x16 or 1x/2x* PCI Express (FSVH2104) Gen4 x8 or OpenCAPI LUTs = 440k(872k) Board Format FFs = 879k(1743k) DSPs = 2880(5952) 1/2 Length low profile x16 PCIe form Factor BRAM = 23.6Mb(47.3Mb) WxHxD = 19.7mm x 80.1mm x 181.5mm URAM = 90.0Mb(180.0Mb) Weight = TBCg 2x 4GB HBM Gen2 memory (32 AXI Ports Communications Interfaces provide 460GB/s Access Bandwidth) 1x QSFP-DD 8x28Gbps - 10/25/40/100G 3x 100G Ethernet MACs (incl. -
256M(16Mx16) GDDR SDRAM HY5DU561622CT
HY5DU561622CT 256M(16Mx16) GDDR SDRAM HY5DU561622CT This document is a general product description and is subject to change without notice. Hynix Electronics does not assume any responsibility for use of circuits described. No patent licenses are implied. Rev. 1.3 / Oct. 2004 HY5DU561622CT Revision History Revision History Draft Date Remark No. 0.1 Defined Preliminary Specification Dec. 2002 0.2 Defined IDD Spec. Feb. 2003 0.3 Improvement of VDD from 2.8V to 2.5V in 300MHz Mar. 2003 0.4 Changed IDD Spec. Apr. 2003 0.5 166MHz Speed added. May. 2003 1) Changed VDD Value of HY5DU561622CT-33/36 from 2.5V to 2.6V 0.6 June 2003 2) Added tRC@Auto Precharge Parameter in AC CHARACTERISTICS - I 0.7 Added 350MHz Speed June 2003 0.8 Changed VDD/VDDQ min/max range of HY5DU561622CT- 33 /36 July 2003 1) Changed tRAS_max Value from 120K to 100K in All Frequency 0.9 2) Changed IDD6 value from 3mA to 4mA in All Frequency Aug. 2003 3) Changed Refresh Time from 64ms to 32ms in All Frequency 1.0 Refresh Time restore to 64ms from 32ms Jun. 2004 1.1 Insert AC Overshoot comment Aug. 2004 1.2 tRAS_max change Sep. 2004 1.3 tRAS_min & tQHS change Oct. 2004 Rev. 1.3 / Oct. 2004 2 HY5DU561622CT PRELIMINARY DESCRIPTION The Hynix HY5DU561622CT is a 268,435,456-bit CMOS Double Data Rate(DDR) Synchronous DRAM, ideally suited for the point-to-point applications which requires high bandwidth. The Hynix 16Mx16 DDR SDRAMs offer fully synchronous operations referenced to both rising and falling edges of the clock. -
High Bandwidth Memory for Graphics Applications Contents
High Bandwidth Memory for Graphics Applications Contents • Differences in Requirements: System Memory vs. Graphics Memory • Timeline of Graphics Memory Standards • GDDR2 • GDDR3 • GDDR4 • GDDR5 SGRAM • Problems with GDDR • Solution ‐ Introduction to HBM • Performance comparisons with GDDR5 • Benchmarks • Hybrid Memory Cube Differences in Requirements System Memory Graphics Memory • Optimized for low latency • Optimized for high bandwidth • Short burst vector loads • Long burst vector loads • Equal read/write latency ratio • Low read/write latency ratio • Very general solutions and designs • Designs can be very haphazard Brief History of Graphics Memory Types • Ancient History: VRAM, WRAM, MDRAM, SGRAM • Bridge to modern times: GDDR2 • The first modern standard: GDDR4 • Rapidly outclassed: GDDR4 • Current state: GDDR5 GDDR2 • First implemented with Nvidia GeForce FX 5800 (2003) • Midway point between DDR and ‘true’ DDR2 • Stepping stone towards DDR‐based graphics memory • Second‐generation GDDR2 based on DDR2 GDDR3 • Designed by ATI Technologies , first used by Nvidia GeForce FX 5700 (2004) • Based off of the same technological base as DDR2 • Lower heat and power consumption • Uses internal terminators and a 32‐bit bus GDDR4 • Based on DDR3, designed by Samsung from 2005‐2007 • Introduced Data Bus Inversion (DBI) • Doubled prefetch size to 8n • Used on ATI Radeon 2xxx and 3xxx, never became commercially viable GDDR5 SGRAM • Based on DDR3 SDRAM memory • Inherits benefits of GDDR4 • First used in AMD Radeon HD 4870 video cards (2008) • Current -
Architectures for High Performance Computing and Data Systems Using Byte-Addressable Persistent Memory
Architectures for High Performance Computing and Data Systems using Byte-Addressable Persistent Memory Adrian Jackson∗, Mark Parsonsy, Michele` Weilandz EPCC, The University of Edinburgh Edinburgh, United Kingdom ∗[email protected], [email protected], [email protected] Bernhard Homolle¨ x SVA System Vertrieb Alexander GmbH Paderborn, Germany x [email protected] Abstract—Non-volatile, byte addressable, memory technology example of such hardware, used for data storage. More re- with performance close to main memory promises to revolutionise cently, flash memory has been used for high performance I/O computing systems in the near future. Such memory technology in the form of Solid State Disk (SSD) drives, providing higher provides the potential for extremely large memory regions (i.e. > 3TB per server), very high performance I/O, and new ways bandwidth and lower latency than traditional Hard Disk Drives of storing and sharing data for applications and workflows. (HDD). This paper outlines an architecture that has been designed to Whilst flash memory can provide fast input/output (I/O) exploit such memory for High Performance Computing and High performance for computer systems, there are some draw backs. Performance Data Analytics systems, along with descriptions of It has limited endurance when compare to HDD technology, how applications could benefit from such hardware. Index Terms—Non-volatile memory, persistent memory, stor- restricted by the number of modifications a memory cell age class memory, system architecture, systemware, NVRAM, can undertake and thus the effective lifetime of the flash SCM, B-APM storage [29]. It is often also more expensive than other storage technologies.