Wide I/O DRAM Architecture Utilizing Proximity Communication

Wide I/O DRAM Architecture Utilizing Proximity Communication by Qawi Harvard Thesis Defense – October 8th, 2009 Introduction Bandwidth and power consumption of dynamic random access memory stifles computer performance scaling Background Status of Proximity Communication DRAM Market Analysis 4 Gb DRAM Architecture Wide I/O DRAM Architecture Utilizing Proximity Communication Qawi Harvard – Oct. 8th,2009 Thesis Defense 2 Background Memory Gap Main memory does not scale with processor performance Power Current consumption is rising Bandwidth increases power Voltage scaling masks the issue Density Memory channel loading Limits bandwidth Proximity Communication Proposed by Ivan Sutherland – US Patent #6,500,696 Promises to reduce power and increase bandwidth Qawi Harvard – Oct. 8th,2009 Thesis Defense 3 Proximity Communication Chip 1 Chip 2 Transmit Receive Chip 1 Chip 2 Receive Transmit Capacitive Coupled Proximity Communication Top metal forms the parallel plates Chip-to-chip communication through coupling capacitor Ref:[1] Qawi Harvard – Oct. 8th,2009 Thesis Defense 4 Proximity Communication Benefits Increased I/O density Avoids on/off chip wires Eases chip replacement at the system level Enhances system level testability Enables smaller chip sizes Removes the need for ESD protection Challenges Mechanical misalignment Applying power to the chips Thermal solution Ref:[1-5] Qawi Harvard – Oct. 8th,2009 Thesis Defense 5 Proximity Communication Parallel Plate Capacitance A 0 aF C 8.9 m d 0 2 10 pF/mm 4000 2 1000 Chip-to-chip separation Proximity Communication d = 1 µm Area Ball Bonding 100 One channel I/O Density per per mm Density I/O 50 fF 10 2 2003 2004 2005 2006 2007 2008 2009 2010 200 signals/mm [1] Ref:[1] Qawi Harvard – Oct. 8th,2009 Thesis Defense 6 Proximity Communication Mechanical Misalignment Six axis Multiple sources z θx θy Separation Tilt θz y x Translation Rotation Ref:[5] Qawi Harvard – Oct. 8th,2009 Thesis Defense 7 Proximity Communication Electronic Sensors Chip-to-chip separation sensors (0.2 µm resolution) Vernier scale incorporated on chip (1.0 µm resolution) 1.0 Electrical Re-Alignment 0.9 0.8 Receive array 0.7 0.6 0.5 Micro-transmit array 0.4 0.3 Electronic steering circuit 0.2 0.1 Normalized Received SignalReceived Normalized 0 0 20 40 60 80 100 120 Misalignment (µm) Transmit Chip Inactive Transmit Pad 1 0 1 0 1 0 1 0 1 0 1 Active Transmit Pad Receiver Pad Ref:[1-5] 1 0 1 0 1 ?? 0 1 0 1 Receive Chip Qawi Harvard – Oct. 8th,2009 Thesis Defense 8 DRAM Market Analysis Revisit the Memory Gap “Performance” becomes a relative term Dichotomy in scaling Why Density? Why Not Latency? 100000 100000 10000 10000 Processor IPS DRAM Density 1000 1000 100 100 Processor IPS 10 DRAM Latency 10 Relative Performance (%) Performance Relative Relative Performance (%) Performance Relative 1 1 1980 1985 1990 1995 2000 2005 2010 1980 1985 1990 1995 2000 2005 2010 Ref:[6] Qawi Harvard – Oct. 8th,2009 Thesis Defense 9 DRAM Market Analysis Moore’s Law 41% increase in transistor count per year Selling Price 36% historic decline per year Putting it into Perspective 1 0 0 0 . 0 0 0 0 100,000.00 1 0 0 . 0 0 0 0 1 9 71975 9 Historically the price per bit has 1 Gb 2009 → $2.00 1977 10,000.00 1974 1980 1979 declined by 9% every quarter 1976 1 0 . 0 0 0 0 1981 1981 1984 (1974 – 2008). 1978 1983 2 Gb 2011 → $1.64 1,000.00 1 9 8 2 19801 9 8 3 1988 1985 1987 19821985 1989 1 . 0 0 0 0 1984 1989 100.00 19901993 1 9 8 6 1 9 8 7 1 9 9 5 H i s t o r i c a l 1991 1993 Density or Bust!! 1986 19881991 1994 1995 price•per•bit decline has 1 9 9 6 0 . 1 0 0 0 1992 averaged 35.5% 1990 10.00 1992 19941 9 9 7 (1978 - 2002) 1997 2 0 0 0 Price per Bit (Millicents) 1998 0 . 0 1 0 0 1996 1999 1.00 1 9 9 9 2 0 0 3 Price per Bit (Milicents) 1998 2 0 0 1 20012 0 0 4 F 2000 22003 0 0 5 F 2002 2 02005 0 6 F 0 .0.10 0 0 1 0 20022 0 0 7 F 2 0 0 8 F 2004 0 .0.01 0 0 0 1 1 100 1 0 , 0 0 0 12 Cumulative Bit Volume (10 ) 1 , 0 0 0 , 0 0 0 Ref:[7-8] 100,000,000 Qawi Harvard – Oct. 8th,2009 Thesis Defense 10 DRAM Market Analysis Cost Low cost manufacturing process Generations o 3 metal layers Features • Increased usage of each level PM Simple RAS/CAS o Small chip size FPM Fast CAS Access • Limits I/O count Latched Output Moore’s Law EDO Programmable Burst & Synchronous w/Clock SDR Latency Multi-Bank 41% scaling per year LVTTL Interface o Wordline cross sectional area DDR Data Clocked on Both Clock Data Strobe Edges SSTL 2.5 Interface o Tight metal pitch DDR2 ODT Posted CAS o Contact resistance OCD SSTL 1.8 Interface DDR3 Standard Low Voltage Option Drive/ODT Calibration Physics of Scaling Dynamic ODT Write Leveling DDR4 Faster Latency must increase Lower Power Ref:[7-8,13] Qawi Harvard – Oct. 8th,2009 Thesis Defense 11 DRAM Market Analysis 700 ) Home PC 600 mA Plugged into a wall 500 400 ♦ DDR ● DDR4 ▲ DDR3 Mobile 300 Projections ■ DDR2 Battery life 200 100 Current Consumption ( Consumption Current Server 0 266 333 400 533 666 800 1066 1333 1600 2133 2666 3200 Power consumption Data Rate (MHz) Cooling Memory Mezzanine Tray (Sun SPARC Enterprise T5240 Server Only) Trending Up Poor Efficiency UltraSPARC UltraSPARC Bandwidth Driven Ref:[14-17] Qawi Harvard – Oct. 8th,2009 Thesis Defense 12 DRAM Market Analysis 4000 Interface versus Core 3500 SDR DDR DDR2 DDR3 DDR4 (1n) (2n) (4n) (8n) (16n) 3000 Interface bares the burden 2666 2666 Core cycles 2500 ■ Interface 2133 2000 ● Core 1600 1600 DRAM Pre-fetch 1500 1333 Frequency (MHz) Frequency 800 1000 1066 Doubles at each generation 667 667 400 533 333 500 267 100 200 200 Density limited by 67 67 100 133 167 167 200 133 167 133 167 167 bandwidth 0 133 167200 SSTL loading in memory channel Increase chip count per module Channel DIMMS or Devices per per Devices or DIMMS 400 800 1066 1333 1600 Ref:[14-17] Data Rate (MHz) Qawi Harvard – Oct. 8th,2009 Thesis Defense 13 4 Gb DRAM Architecture Possible Architecture Compared to ITRS 2012 Production Release 74 mm2 56 % Array Efficiency COLUMN 40 nm SPINE ROW Ref:[10,12,20-24] Qawi Harvard – Oct. 8th,2009 Thesis Defense 14 4 Gb DRAM Architecture 6F2 Memory Cell 6 F 2 Feature Size = 40 nm F Gate Isolation 2 Cell Area = 0.0096 µm UNITCELL 3F Pitch per Wordline Bitline Memory Wordline 2F Pitch per Bitline Capacitor 256 kb Array Macro Core Array 512 bitlines ≈ 43.6 µm 512 wordlines ≈ 65.4 µm UNITCELL UNITCELL Periphery Circuitry UNITCELL 4 µm space allocated Ref:[24] Qawi Harvard – Oct. 8th,2009 Thesis Defense 15 4 Gb DRAM Architecture Ref:[25] Qawi Harvard – Oct. 8th,2009 Thesis Defense 16 4 Gb DRAM Architecture 3.5 mm 256 Mb Array 1.9 mm 32 x 32 256 kb macros 1.6 mm 32 43 .6 m 4 m 1.532 mm 256M Array 256M Array 2.3 mm 2.3 ROW 1 Gb Array mm 2.5 Multiple implementations COLUMN COLUMN 4.9 mm 4.9 256M Array 256M Array ROW GLOBAL CORNER COLUMN Qawi Harvard – Oct. 8th,2009 Thesis Defense 17 4 Gb DRAM Architecture Chip Size = 71.4 mm2 ITRS Array Efficiency = 57.7% 74 mm2 7.0 mm 56% Array Efficiency 256M 256M 256M 256M Row Row Wide I/O Architecture Array Array Array Array Moving the pads Column Column Column Column Centralized Row 256M 256M 256M 256M Row Row Array Array Array Array Centralized Column SPINE 0.4 mm 10.2 mm 10.2 256M 256M 256M 256M Row Row Array Array Array Array Column Column Column Column 256M 256M 256M 256M Row Row Array Array Array Array Ref:[29] Qawi Harvard – Oct. 8th,2009 Thesis Defense 18 Wide I/O Chip Architecture 64 Bytes per Chip Chip Size = 67.0 mm2 6.2% Chip Size Reduction Array Efficiency = 61.5% 6.7 mm 6.6% Increase in Array RO 256M Array W 256M Array Efficiency 256M 256M 256M 256M Array Array RORow Array Array Challenges 256M Array W 256M Array RO W 256M Array mm 4.6 Routing from the edge 256M 256M 256M 256M Array Array RORow Array Array Array I/O route increase W 256M Array o 2.3 mm → 4.6 mm RO 256M Array W 256M Array 10.0 mm 10.0 256M 256M 256M 256M Additional row decode Array Array RORow Array Array 256M Array W 256M Array Create Eight Internal Banks 256M Array 256M 256M 256M 256M Array Array RORow Array Array W 256M Array Proximity Interface Qawi Harvard – Oct. 8th,2009 Thesis Defense 19 4 Gb DRAM Architecture 1 mm Allocation for Proximity Chip Size = 69.68 mm2 Array Efficiency = 59.2% Channel 6.7 mm Buffers at the Center 3.2 mm Increase global I/O metal usage 512M Bank 512M Bank ROW Array I/O Routing Reduced to mm 2.3 2.3 mm 512M Bank 512M Bank Architecture NOT Efficient for ROW Proximity Communication 10.4 mm 10.4 512M Bank 512M Bank ROW 6.7 mm versus 10.4 mm mm 7.0 Buffers required 512M Bank 512M Bank Large metal usage ROW Proximity Interface Qawi Harvard – Oct.

Wide I/O DRAM Architecture Utilizing Proximity Communication

Computational Science and Engineering İSTANBUL TECHNICAL

Wide I O DRAM Architecture Utilizing Proximity Communication

Multi-Threading in Electrictm, a Java VLSI CAD Tool

Session the Use of Robots and Gamification in Education

Tightly-Coupled and Fault-Tolerant Communication in Parallel Systems

Multi-Threading in VLSI CAD

How Collaboration Aids Innovation

Final Year Project Thesis Mo S H a Di