High-Bandwidth Memory Interface Design

High-Bandwidth Memory Interface Design Chulwoo Kim [email protected] Dept. of Electrical Engineering Korea University, Seoul, Korea February 17, 2013 Chulwoo Kim 1 of 86 Outline Introduction Clock Generation and Distribution Transceiver Design TSV Interface for DRAM Summary References Chulwoo Kim 2 of 86 Outline Introduction DRAM 101 Simplified DRAM Architecture and Operation Differences of DRAM (DDRx, GDDRx, LPDDRx) Trend Memory Interface: Differences and Issues Clock Generation and Distribution Transceiver Design TSV Interface for DRAM Summary References Chulwoo Kim 3 of 86 DRAM 101 SDR Single Data Rate Main Memory DDRx CLK SDRAM PC, Notebook, Server DQ D Graphics Memory Synchronous DDR GDDRx Dynamic Graphic Card, Console Random Double Data Rate Access Memory CLK Mobile Memory LPDDRx DQ D D Phone, Tablet PC CLK MCU Command C CAS* Latency CLK & Data Command CLK SDRAM DQ D D D D D D D D Burst Length *CAS : Column Address Strobe Chulwoo Kim Introduction 4 of 86 DRAM DDR4 Die Photo Bank Bank Bank Bank Bank Bank Bank Bank 0 1 2 3 8 9 10 11 Supply Voltage VDD=1.2V, VPP=2.5V Process 38nm CMOS /3-metal Banks 4-Bank Group, 16 Bank Bank BankData Bank Rate Bank 2400Bank Mbps Bank Bank Bank 4 5 Number6 of IO‟s 7 X4 12/ X8 13 14 15 [1] K. B. Koo et al., ISSCC 2012, pp. 40-41 Chulwoo Kim Introduction 5 of 86 Simplified DRAM Architecture Bank Bank Word Driver Line Word BLT BLB Fuse Repair Row Row Decoder Row WL Cell Array Column Decoder BLSA* Write Drv. / Read Amp. Column Repair Fuse Peripheral Circuit Generator Serial to Parallel DCLK ICLK CMD DLL parallel to serial Controller DQ RX DQ TX CLK/ADD/CMD Buffer Bank Bank * BLSA : Bit line sense amplifier Chulwoo Kim Introduction 6 of 86 Concept of DRAM operation Bank Bank WRITE Np×Ndq : Serial to parallel BLSABLSA (DQ GIO) READ *BLSA : Bit line sense amplifier : Parallel to serial *Np: Number of (GIO DQ) pre-fetch *Ndq: Number of DQ Peripheral Circuit Serial to Parallel GIO *GIO : Global I/O parallel to serial Ndq bits Ndq bits Np×Ndq bits DQ RX DQ TX Bank Bank Chulwoo Kim Introduction 7 of 86 Pre-fetch Timing(DDR1,BL*=2) tCCD*=1 CLK RD RD GIO GIO GIO After CL* BL*=2 DQS DQ 0 1 0 1 Number of GIO channel=Np×Ndq=2×8=16 (DDR1 x8) * tCCD : CAS to CAS delay * CL : CAS latency [2] JEDEC, JESD79F, pp. 24-29 * BL : Burst length Chulwoo Kim Introduction 8 of 86 Pre-fetch Diagram(DDR1) Bank Bank Bank Bank Num. of GIO channel = 2×Ndq Bank Bank Bank Bank Pre-fetch operation 2-bit pre-fetch [2×Ndq] data access (If the output data rate is 400Mbps, the internal data rate is 200Mbps) Chulwoo Kim Introduction 9 of 86 Pre-fetch Timing(DDR2,BL=4) tCCD=2 CLK RD RD GIO GIO GIO After RL* BL=4 DQS DQ 0 1 2 3 0 1 2 3 Number of GIO channel=Np×Ndq=4×8=32 (DDR2 x8) * RL : READ latency [3] JEDEC, JESD79-2F, pp. 35 Chulwoo Kim Introduction 10 of 86 Pre-fetch Diagram(DDR2) Bank Bank Bank Bank Num. of GIO channel = 4×Ndq Bank Bank Bank Bank Pre-fetch operation 4-bit pre-fetch [4×Ndq] data access (If the output data rate is 800Mbps, the internal data rate is 200Mbps, same as DDR1) Chulwoo Kim Introduction 11 of 86 Pre-fetch Timing(DDR3,BL=8) tCCD=4 CLK RD RD GIO GIO GIO After RL BL=8 DQS DQ 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Number of GIO channel=Np×Ndq=8×8=64 (DDR3 x8) [4] JEDEC, JESD79-3F, pp. 62 Chulwoo Kim Introduction 12 of 86 Pre-fetch Diagram(DDR3) Bank Bank Bank Bank Num. of GIO channel = 8×Ndq Bank Bank Bank Bank Pre-fetch operation 8-bit pre-fetch [8×Ndq] data access (If the output data rate is 1.6Gbps, the internal data rate is 200Mbps, same as DDR1) Chulwoo Kim Introduction 13 of 86 Bank Grouping Timing(DDR4,BL=8) tCCD_S=4 tCCD_L=5 CLK RD RD RD G0 G1 G1 GIO_BG0 GIO_BG0 GIO_BG1 GIO_BG1 GIO_BG1 GIO_BG2 GIO_BG3 After RL BL=8 DQS DQ 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Number of GIO channel=Np×Ndq×Ngroup=8×8×4 = 256(DDR4 x8) [5] JEDEC, JESD79-4, pp. 77-78 [6] T. Y. Oh et al., ISSCC 2010, pp. 434-435 Chulwoo Kim Introduction 14 of 86 Pre-fetch & Bank Grouping(DDR4) Bank Bank Bank Bank Group0 Group1 GIO MUX Num. of GIO channel = 8×Ndq Group2 Group3 Bank Bank Bank Bank Pre-fetch operation 8-bit pre-fetch Bank grouping [1] K. B. Koo et al., ISSCC 2012, pp. 40-41 Chulwoo Kim Introduction 15 of 86 Differences of DDRx,GDDRx,LPDDRx DDRx GDDRx LPDDRx Bank Bank Bank Bank PAD Bank Bank Architecture PAD PAD Bank Bank Bank Bank Bank Bank PAD Application PC/Server Graphic card Mobile/Consumer Socket DIMM On board MCP*/PoP*/SiP* IO ×4/×8 ×16/×32 ×16/×32 .Single uni-directional .No DLL Unique WDQS, RDQS .DPD* .VDDQ termination .PASR* Function .CRC, DBI .TCSR* .ABI * MCP: Multi chip package * DPD: Deep power down * PoP : Package on package * PASR : Partial array self refresh * SiP : System in package * TCSR : Temperature compensated self refresh Chulwoo Kim Introduction 16 of 86 DDR Comparison DDR1 DDR2 DDR3 DDR4 VDD [V] 2.5 1.8 1.5 1.2 Data Rate 200M~400M 400M~800M 800M~2.1G 1.6G~3.2G [bps/pin] Pre-Fetch 2 bit 4 bit 8 bit 8 bit STROBE Single DQS Differential DQS, DQSB Interface SSTL_2 SSTL_18 SSTL_15 POD_12 .OCD calibration .Dynamic ODT .CA parity .ODT .ZQ calibration .DBI*, CRC* New .Write leveling .Gear down Feature .CAL* ▪ PDA* .FGREF * ▪ TCAR* .Bank grouping * DBI: Data bus inversion * PDA: Per DRAM addressability * CRC: Cyclic redundancy check * FGREF: Fine granularity refresh * CAL: Command address latency * TCAR: Temperature controlled array refresh Chulwoo Kim Introduction 17 of 86 GDDR Comparison GDDR1 gDDR2 GDDR3 GDDR4 GDDR5 VDD [V] 2.5 1.8 1.5 1.5 1.5/1.35 Data Rate 300~900M 800M~1G 700M~2.6G 2.0G~3.0G 3.6G~7.0G [bps/pin] Pre-Fetch 2 bit 4 bit 4 bit 8 bit 8 bit Differential STROBE Single DQS Bi-direction Single Uni-direction WDQS, RDQS DQS*, DQSB Interface SSTL_2 SSTL_2 POD-18 POD-15 POD-15 .OCD* .ZQ .DBI .No DLL calibration .Parity(opt) .PLL(option) New .ODT* .WCK, WCKB Feature .CRC ▪ ABI* .RDQS(option) .Bank grouping * DQS: DQ strobe signal, DQ is dada I/O Pin * ODT: On die termination * OCD: Off chip driver * ABI: Address bus inversion Chulwoo Kim Introduction 18 of 86 LPDDR Comparison LPDDR1 LPDDR2 LPDDR3 VDD [V] 1.8 1.2 1.2 Data Rate 200M~400M 200M~1066M 333M~1600M [bps/pin] Pre-Fetch 2 bit 4 bit 8 bit STROBE DQS DQS_T, DQS_C DQS_T, DQS_C Interface SSTL_18* HSUL_12* HSUL_12* DLL X X X .CA pin .ODT New (High tapped termination) Feature * SSTL: Stub series terminated logic * HSUL: High speed un-terminated logic Chulwoo Kim Introduction 19 of 86 Trend DDR1 Although all types of DRAMs are 2.5 reaching their limits in supply voltage, GDDR1 the demand of high-bandwidth memory is keep increasing LPDDR1 gDDR2 1.8 DDR2 VDD [V] VDD GDDR3 1.5 GDDR4 GDDR5 DDR3 LPDDR2 DDR4 1.2 LPDDR3 0.2 0.4 0.8 1.2 1.6 2.0 2.4 2.8 3.2 3.6 … 7.0 Data Rate [Gbps] Chulwoo Kim Introduction 20 of 86 Memory Interface DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM GPU DRAM CPU System Feature Issue Single-ended/high speed Reflection Many channel Inter-symbol interference (weak for coupling effect) Simultaneous switching output DDR: multi-drop noise (multi rank, multi DIMM) Pin to pin skew GDDR: point to point Poor transistor performance Impedance discontinuities (stubs, connector, via, etc. ) Chulwoo Kim Introduction 21 of 86 Outline Introduction Clock Generation and Distribution Delay-locked loop (DLL) Duty cycle corrector (DCC) Clock distribution Transceiver Design TSV Conclusions References Chulwoo Kim 22 of 86 Basic DLL Architecture tD1 tDVDL tDREP I_CLK Variable Replica Clock Delay Line Delay PD Controller FB_CLK Data O_CLK DATA from memory core External DRAM tD2 tD1 Clock tCK ∙ N = tDVDL +tDREP I_CLK tDREP ≈ tD1 +tD2 FB_CLK tDREP tDVDL tCK ∙ N = tDVDL +tD1 +tD2 + γ O_CLK Data γ = tDREP – (tD1 +tD2) tD2 Chulwoo Kim Clock Generation and Distribution 23 of 86 Replica Delay Mismatch γ variation [ps] Supply Voltage [V] γ ≈0 HVDD Long Valid HVDD Valid Valid Data Data Data Window Window Window γ <0 tCK VDD VDD LVDD LVDD Short γ >0 tDQSCK* (or tAC) tDQSCK (or tAC) tDQSCK (or tAC) *tDQSCK (or tAC) – DQS output access time for CK/CKb Chulwoo Kim Clock Generation and Distribution 24 of 86 Locking Range Considerations I_CLK Long tDREQUIRED tDINIT+tDREP FB_CLK Bird’s beak tCK I_CLK tD +tD INIT REP tDREQUIRED FB_CLK Short tDQSCK (or tAC) tDINIT = tDVDL(0) + tDREP N×tCK > tDVDL(0) + tDREP tCK = tDVDL + tDREP + t∆ [7] H.-W. Lee et al., submitted to TVLSI Chulwoo Kim Clock Generation and Distribution 25 of 86 Synchronous Mirror Delay (SMD) tD1 tD1+tD2 tD3 Clock Clock Delay Measure Delay Line I_CLK I_CLK Replica Replicate Measure Delay Replicate Delay Line OUT tD2 OUT tD1 tD3 tD3 tD2 tD1+tD2 Basic Operation Measure and replicate the delay No feedback Match delay in two cycles [8] T.

High-Bandwidth Memory Interface Design

2GB DDR3 SDRAM 72Bit SO-DIMM

Memory & Devices

Tesla K80 Gpu Accelerator

1Gb Gddr3 SDRAM E-Die

Performance Impact of Memory Channels on Sparse and Irregular Algorithms

Different Types of RAM RAM RAM Stands for Random Access Memory. It Is Place Where Computer Stores Its Operating System. Applicat

Product Guide SAMSUNG ELECTRONICS RESERVES the RIGHT to CHANGE PRODUCTS, INFORMATION and SPECIFICATIONS WITHOUT NOTICE

NVIDIA Quadro CX Overview

GRA110 3U VPX High Performance Graphics Board

DDR and DDR2 SDRAM Controller Compiler User Guide

AXP Internal 2-Apr-20 1

You Need to Know About Ddr4