IRAM Microprocessor Talk

An Introduction to Intelligent RAM (IRAM) David Patterson, Krste Asanovic, Aaron Brown, Ben Gribstad, Richard Fromm, Jason Golbus, Kimberly Keeton, Christoforos Kozyrakis, Stelianos Perissakis, Randi Thomas, Noah Treuhaft, Tom Anderson, John Wawrzynek, and Katherine Yelick [email protected] http://iram.cs.berkeley.edu/ EECS, University of California Berkeley, CA 94720-1776 1 IRAM Vision Statement Proc L Microprocessor & DRAM $ $ o f on a single chip: I/O I/O L2$ g a Bus Bus i b – on-chip memory latency c 5-10X, bandwidth 50-100X DRAM – improve energy efficiency I/O 2X-4X (no off-chip bus) I/O Proc – serial I/O 5-10X v. buses D f – smaller board area/volume Bus R A a – adjustable memory size/width M b DRAM 2 Outline Today’s Situation: Microprocessor & DRAM Potential of IRAM Applications of IRAM Grading New Instruction Set Architectures Berkeley IRAM Instruction Set Overview Berkeley IRAM Project Plans Related Work and Why Now? IRAM Challenges & Industrial Impact 3 Performance 1000 100 10 Processor-DRAM Gap(latency) 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 “Moore’s Law” Time 1990 1991 1992 1993 1994 1995 1996 1997 (grows 50%/year) Performance Gap: Processor-Memory 1998 1999 DRAM 2000 CPU 60%/yr. µProc 7%/yr. DRAM 4 Processor-Memory Performance Gap “Tax” Processor % Area %Transistors (≈cost) (≈power) Alpha 21164 37% 77% StrongArm SA110 61% 94% Pentium Pro 64% 88% – 2 dies per package: Proc/I$/D$ + L2$ Caches have no inherent value, only try to close performance gap 5 Today’s Situation: Microprocessor MIPS MPUs R5000 R10000 10k/5k Clock Rate 200 MHz 195 MHz 1.0x On-Chip Caches 32K/32K 32K/32K 1.0x Instructions/Cycle 1(+ FP) 4 4.0x Pipe stages 5 5-7 1.2x Model In-order Out-of-order --- Die Size (mm2) 84 298 3.5x – without cache, TLB 32 205 6.3x Development (man yr.) 60 300 5.0x SPECint_base95 5.7 8.8 1.6x 6 Today’s Situation: Microprocessor Microprocessor-DRAM performance gap – time of a full cache miss in instructions executed 1st Alpha (7000): 340 ns/5.0 ns = 68 clks x 2 or 136 2nd Alpha (8400): 266 ns/3.3 ns = 80 clks x 4 or 320 3rd Alpha (t.b.d.): 180 ns/1.7 ns =108 clks x 6 or 648 – 1/2X latency x 3X clock rate x 3X Instr/clock ⇒ ≈5X Power limits performance (battery, cooling) Shrinking number of desktop MPUs? PA-RISCPA-RISC PowerPC MIPSMIPS AlphaAlpha SPARC IA-64 7 Today’s Situation: DRAM DRAM Revenue per Quarter $20,000 $16B $15,000 $10,000 $7B (Miillions) $5,000 $0 1Q 2Q 3Q 4Q 1Q 2Q 3Q 4Q 1Q 2Q 3Q 4Q 1Q 94 94 94 94 95 95 95 95 96 96 96 96 97 • Intel: 30%/year since 1987; 1/3 income profit 8 Today’s Situation: DRAM Commodity, second source industry ⇒ high volume, low profit, conservative – Little organization innovation (vs. processors) in 20 years: page mode, EDO, Synch DRAM DRAM industry at a crossroads: – Fewer DRAMs per computer over time » Growth bits/chip DRAM : 50%-60%/yr » Nathan Myhrvold M/S: mature software growth (33%/yr for NT) ≈ growth MB/$ of DRAM (25%-30%/yr) – Starting to question buying larger DRAMs? 9 Fewer DRAMs/System over Time (from Pete DRAM Generation MacWilliams, Intel) ‘86 ‘89 ‘92 ‘96 ‘99 ‘02 1 Mb 4 Mb 16 Mb 64 Mb 256 Mb 1 Gb 4 MB 32 8 Memory per 16 4 DRAM growth 8 MB @ 60% / year 16 MB 8 2 32 MB 4 1 Memory per 8 2 64 MB System growth 128 MB @ 25%-30% / year 4 1 Minimum Memory Size 256 MB 8 2 10 Multiple Motivations for IRAM Some apps: energy, board area, memory size Gap means performance challenge is memory DRAM companies at crossroads? – Dramatic price drop since January 1996 – Dwindling interest in future DRAM? » Too much memory per chip? Alternatives to IRAM: fix capacity but shrink DRAM die, packaging breakthrough, ... 11 Potential IRAM Latency: 5 - 10X No parallel DRAMs, memory controller, bus to turn around, SIMM module, pins… New focus: Latency oriented DRAM? – Dominant delay = RC of the word lines – keep wire length short & block sizes small? 10-30 ns for 64b-256b IRAM “RAS/CAS”? AlphaSta. 600: 180 ns=128b, 270 ns= 512b Next generation (21264): 180 ns for 512b? 12 Potential IRAM Bandwidth: 100X 1024 1Mbit modules(1Gb), each 256b wide – 20% @ 20 ns RAS/CAS = 320 GBytes/sec If cross bar switch delivers 1/3 to 2/3 of BW of 20% of modules ⇒ 100 - 200 GBytes/sec FYI: AlphaServer 8400 = 1.2 GBytes/sec – 75 MHz, 256-bit memory bus, 4 banks 13 Potential Energy Efficiency: 2X-4X Case study of StrongARM memory hierarchy vs. IRAM memory hierarchy – cell size advantages ⇒ much larger cache ⇒ fewer off-chip references ⇒ up to 2X-4X energy efficiency for memory – less energy per bit access for DRAM Memory cell area ratio/process: P6, α ‘164,SArm cache/logic : SRAM/SRAM : DRAM/DRAM 20-50 : 8-11 : 1 14 Potential Innovation in Standard DRAM Interfaces Optimizations when chip is a system vs. chip is a memory component – Lower power via on-demand memory module activation? – “Map out” bad memory modules to improve yield? – Improve yield with variable refresh rate? – Reduce test cases/testing time during manufacturing? IRAM advantages even greater if innovate inside DRAM memory interface? 15 Commercial IRAM highway is governed by memory per IRAM? Laptop 32 MB Network Computer Super PDA/Phone 8 MB Video Games Graphics 2 MB Acc. 16 Near-term IRAM Applications “Intelligent” Set-top – 2.6M Nintendo 64 (≈ $150) sold in 1st year – 4-chip Nintendo ⇒ 1-chip: 3D graphics, sound, fun! “Intelligent” Personal Digital Assistant – 0.6M PalmPilots (≈ $300) sold in 1st 6 months – Handwriting + learn new alphabet (α = K, = T, = 4) v. Speech input 17 App #1: PDA of 2003? Pilot PDA (calendar, notes, address book, calculator, memo, ...) + Gameboy + Nikon Coolpix (camera, tape recorder, notes ...) + Cell Phone,Pager, GPS – Vision to see surroundings, + Speech, vision scan documents recognition – Voice output for conversations + wireless data (WWW) – Play chess with PDA on plane? 18 Revolutionary App: Decision Support? 4 address buses Sun 10000 (Oracle 8): data crossbar switch – TPC-D (1TB) leader ProcProc Xbar ProcProc Xbar ProcProc 12.4 ProcProc – SMP 64 CPUs, Mem Mem s GB/s s 64GB dram, 603 disks bridge 1 … bridge 16 Disks,encl. $2,348k s s s s DRAM $2,328k c c 2.6 c c s s s s Boards,encl. $983k GB/s i i i i CPUs $912k bus bridge bus bridge Cables,I/O $139k s s s 6.0 s s s c c c … GB/s c c c … Misc. $65k s s s … s … s … s HW total $6,775k … i … i … i i i i … 1 23 19 IRAM Application Inspiration: Database Demand vs. Processor/DRAM speed Database demand: 2X / 9 months 100 Database-Proc. “Greg’s Law” Performance Gap: µProc speed 10 2X / 18 months “Moore’s Law” Processor-Memory Performance Gap: DRAM speed 1 2X /120 months 1996 1997 1998 1999 2000 20 App #2: “Intelligent Disk”(IDISK): Scaleable Decision Support? 1 IRAM/disk + xbar cross bar + fast serial link v. cross bar … cross bar conventional SMP 75.0 Network latency = GB/s f(SW overhead), cross bar cross bar not link distance Move function to IRAM IRAM IRAM IRAM 6.0 data v. data to CPU … … (scan, sort, join,...) … … GB/s … … Cheaper, faster, IRAM IRAM IRAM IRAM … … … more scalable ≈ ( 1/3 $, 3X perf) 21 “Vanilla” Approach to IRAM Estimate performance IRAM version of Alpha (same caches, benchmarks, standard DRAM) – Used optimistic and pessimistic factors for logic (1.3-2.0 slower), SRAM (1.1-1.3 slower), DRAM speed (5X-10X faster) for standard DRAM – SPEC92 benchmark ⇒ 1.2 to 1.8 times slower – Database ⇒ 1.1 times slower to 1.1 times faster – Sparse matrix ⇒ 1.2 to 1.8 times faster Conventional architecture/benchmarks/DRAM not exciting performance; energy,board area only 22 “Vanilla” IRAM - Performance Conclusions IRAM systems with existing architectures provide moderate performance benefits High bandwidth / low latency used to speed up memory accesses, not computation Reason: existing architectures developed under assumption of low bandwidth memory system – Need something better than “build a bigger cache” – Important to investigate alternative architectures that better utilize high bandwidth and low latency of IRAM 23 A More Revolutionary Approach: DRAM Faster logic in DRAM process – DRAM vendors offer faster transistors + same number metal layers as good logic process? @ ≈ 20% higher cost per wafer? – As die cost ≈ f(die area4), 4% die shrink ⇒ equal cost 24 A More Revolutionary Approach: New Architecture Directions “...wires are not keeping pace with scaling of other features. … In fact, for CMOS processes below 0.25 micron ... an unacceptably small percentage of the die will be reachable during a single clock cycle.” “Architectures that require long-distance, rapid interaction will not scale well ...” – “Will Physical Scalability Sabotage Performance Gains?” Matzke, IEEE Computer (9/97) 25 New Architecture Directions “…media processing will become the dominant force in computer arch. & microprocessor design.” “... new media-rich applications... involve significant real-time processing of continuous media streams, and make heavy use of vectors of packed 8-, 16-, and 32-bit integer and Fl. Pt.” Needs include high memory BW, high network BW, continuous media data types, real-time response, fine grain parallelism – “How Multimedia Workloads Will Change Processor Design”, Diefendorff & Dubey, IEEE Computer (9/97) 26 Grading Architecture Options OOO/SS++ IA-64 microSMP Technology scaling C C+ A Fine grain parallelism A A A Coarse grain (n chips) A A B Compiler maturity B C B MIPS/transistor (cost) C B– B Programmer model D B B Energy efficiency D C A Real time performance C B– B Grade Point Average C+ B– B+ 27 Which is Faster? Statistical v.

IRAM Microprocessor Talk

Mipspro C++ Programmer's Guide

MIPS® Architecture for Programmers Volume I-B: Introduction to the Micromips32™ Architecture, Revision 5.03

Mipspro™ Compiling and Performance Tuning Guide

Pluggable Interface Relays CR-M Miniature Relays

64-Bit Superscaler Microprocessor ACT5270

Computer Architectures an Overview

Electronic Products and Relays Selection Table Interface Relays CR-Range and R600 / R500 Range Pluggable Interface Relays

Sony's Emotionally Charged Chip

Pluggable Interface Relays CR-M Miniature Relays Data Sheet

MIPS R5000 Microprocessor Technical Backgrounder

Selection Guides and Order References for Pluggable Interface Relays CR-P

Ilore: Discovering a Lineage of Microprocessors