DRAM TUTORIAL

ISCA 2002

Bruce Jacob David Wang

University of Maryland DRAM: Architectures,

DRAM: why bother? (i mean, Interfaces, and Systems besides the “memory wall” thing? ... is it just a performance issue?) think about embedded systems: think cellphones, think printers, A Tutorial think switches ... nearly every embedded product that used to be expensive is now cheap. why? Bruce Jacob and David Wang for one thing, rapid turnover from high performance to obsolescence guarantees generous supply of CHEAP, HIGH-PERFORMANCE Electrical & Engineering Dept. embedded processors to suit nearly any design need. University of Maryland at College Park what does the “memory wall” mean in this context? perhaps it will take longer for a high- performance design to become http://www.ece.umd.edu/~blj/DRAM/ obsolete?

UNIVERSITY OF MARYLAND

DRAM TUTORIAL ISCA 2002 Outline Bruce Jacob David Wang • University of Basics Maryland • DRAM Evolution: Structural Path NOTE • Advanced Basics

• DRAM Evolution: Interface Path

• Future Interface Trends & Research Areas

• Performance Modeling: Architectures, Systems, Embedded

Break at 10 a.m. — Stop us or starve

DRAM TUTORIAL ISCA 2002 Basics Bruce Jacob David Wang

University of DRAM ORGANIZATION Maryland

first off -- what is DRAM? an DRAM array of storage elements (capacitor-transistor pairs) Storage element Column Decoder “DRAM” is an acronym (explain) why “dynamic”? (capacitor) Word Line In/Out Sense Amps - capacitors are not perfect ... need recharging Buffers ... Lines... - very dense parts; very small; Bit Line capactiros have very little charge ... thus, the bit lines are charged up to 1/2 voltage level and the ssense amps detect the Memory minute change on the lines, then recover the full signal d Lines ... Array or w Decoder W .. .

Switching Ro element

DRAM TUTORIAL ISCA 2002 Basics Bruce Jacob David Wang

University of TRANSMISSION Maryland so how do you interact with this DRAM thing? let’s look at a traditional organization first ... CPU connects to a Column Decoder that connects to the DRAM itself. let’s look at a read operation Data In/Out Sense Amps Buffers MEMORY ... Bit Lines... CPU BUS CONTROLLER

Memory

d Lines ... Array or w Decoder W .. . Ro

DRAM TUTORIAL ISCA 2002 Basics Bruce Jacob David Wang

University of [PRECHARGE and] ROW ACCESS Maryland at this point, all but lines are attt DRAM the 1/2 voltage level. the read discharges the Column Decoder capacitors onto the bit lines ... this pulls the lines just a little bit high or a little bit low; the sense Data In/Out Sense Amps amps detect the change and recover the full signal Buffers ... Bit Lines... the read is destructive -- the MEMORY capacitors have been CPU BUS CONTROLLER discharged ... however, when the sense amps pull the lines to the full logic-level (either high or Memory low), the transistors are kept open and so allow their attached d Lines ... Array or

capacitors to become recharged w Decoder W .. (if they hold a ‘1’ value) AKA: OPEN a DRAM Page/Row . Ro or ACT (Activate a DRAM Page/Row) or RAS (Row Address Strobe)

DRAM TUTORIAL ISCA 2002 Basics Bruce Jacob David Wang

University of COLUMN ACCESS Maryland once the data is valid on ALL of DRAM the bit lines, you can select a subset of the and send them to the output buffers ... CAS Column Decoder picks one of the bits big point: cannot do another Data In/Out Sense Amps RAS or precharge of the lines until finished reading the column Buffers data ... can’t change the values ... Bit Lines... on the bit lines or the output of MEMORY the sense amps until it has been CPU BUS CONTROLLER read by the memory controller

Memory

d Lines ... Array or w Decoder W .. . Ro READ Command or CAS: Column Address Strobe

DRAM TUTORIAL ISCA 2002 Basics Bruce Jacob David Wang

University of DATA TRANSFER Maryland then the data is valid on the data DRAM bus ... depending on what you are using for in/out buffers, you might be able to overlap a litttle Column Decoder or a lot of the data transfer with the next CAS to the same page (this is PAGE MODE) Data In/Out Sense Amps Buffers MEMORY ... Bit Lines... CPU BUS CONTROLLER

Memory

d Lines ... Array or w Decoder Data Out W .. . Ro

... with optional additional CAS: Column Address Strobe

note: page mode enables overlap with CAS

DRAM TUTORIAL ISCA 2002 Basics Bruce Jacob David Wang

University of BUS TRANSMISSION Maryland

NOTE DRAM

Column Decoder

Data In/Out Sense Amps Buffers MEMORY ... Bit Lines... CPU BUS CONTROLLER

Memory

d Lines ... Array or w Decoder W .. . Ro

DRAM TUTORIAL

ISCA 2002 Basics Bruce Jacob David Wang F

University of Maryland DRAM CPU Mem Controller E DRAM “latency” isn’t 1 deterministic because of CAS or RAS+CAS, and there may be A significant queuing delays within the CPU and the memory controller B Each transaction has some D E2/E3 overhead. Some types of C overhead cannot be pipelined. This means that in general, longer bursts are more efficient. A: Transaction request may be delayed in Queue B: Transaction request sent to Memory Controller C: Transaction converted to Command Sequences (may be queued) D: Command/s Sent to DRAM E1: Requires only a CAS or E2: Requires RAS + CAS or E3: Requires PRE + RAS + CAS F: Transaction sent back to CPU “DRAM Latency” = A + B + C + D + E + F

DRAM TUTORIAL ISCA 2002 Basics Bruce Jacob David Wang

University of PHYSICAL ORGANIZATION Maryland

NOTE x8 DRAM x2 DRAM x4 DRAM Column Decoder Column Decoder Column Decoder

Data Sense Amps Data Sense Amps Data Sense Amps Buffers Buffers Buffers ... Bit Lines...... Bit Lines...... Bit Lines...

. Memory . Memory . Memory ......

. Array . Array . Array w Decoder w Decoder w Decoder Ro Ro Ro

x2 DRAM x4 DRAM x8 DRAM

This is per bank … Typical DRAMs have 2+ banks

DRAM TUTORIAL ISCA 2002 Basics Bruce Jacob David Wang

University of Read Timing for Conventional DRAM Maryland let’s look at the interface another way .. the say the data sheets portray it. RAS Row Access [explain] main point: the RAS\ and CAS\ Column Access signals directly control the CAS latches that hold the row and Data Transfer column addresses ...

Address Row Column Row Column Address Address Address Address

DQ Valid Valid Dataout Dataout

DRAM TUTORIAL ISCA 2002 DRAM Evolutionary Tree Bruce Jacob

David Wang . . . .

......

University of . Maryland . MOSYS since DRAM’s inception, there have been a stream of changes to the design, from FPM to EDO Structural to Burst EDO to SDRAM. the Modifications changes are largely structural modifications -- nimor -- that Targeting target THROUGHPUT. FCRAM Latency

[discuss FPM up to SDRAM Conventional Everything up to and including DRAM $ SDRAM has been relatively (Mostly) Structural Modifications inexpensive, especially when considering the pay-off (FPM Targeting Throughput VCDRAM was essentially free, EDO cost a latch, PBEDO cost a counter, SDRAM cost a slight re-design). however, we’re run out of “free” ideas, and now all changes are considered expensive ... thus there is no consensus on new directions and myriad of choices FPM EDO P/BEDO SDRAM ESDRAM has appeared

[ do LATENCY mods starting Interface Modifications with ESDRAM ... and then the INTERFACE mods ] Targeting Throughput

Rambus, DDR/2 Future Trends

DRAM TUTORIAL ISCA 2002 DRAM Evolution Bruce Jacob David Wang

University of Read Timing for Conventional DRAM Maryland Row Access NOTE Column Access Transfer Overlap Data Transfer RAS

CAS

Address Row Column Row Column Address Address Address Address

DQ Valid Valid Dataout Dataout

DRAM TUTORIAL ISCA 2002 DRAM Evolution Bruce Jacob David Wang

University of Read Timing for Fast Page Mode Maryland Row Access FPM aallows you to keep th esense amps actuve for multiple Column Access CAS commands ... Transfer Overlap much better throughput Data Transfer problem: cannot latch a new RAS value in the column address buffer until the read-out of the data is complete CAS

Address

Row Column Column Column Address Address Address Address

DQ Valid Valid Valid Dataout Dataout Dataout

DRAM TUTORIAL ISCA 2002 DRAM Evolution Bruce Jacob David Wang

University of Read Timing for Extended Data Out Maryland Row Access solution to that problem -- instead of simple tri-state Column Access buffers, use a latch as well. Transfer Overlap by putting a latch after the column mux, the next column Data Transfer address command can begin sooner RAS

CAS

Address Row Column Column Column Address Address Address Address

DQ Valid Valid Valid Dataout Dataout Dataout

DRAM TUTORIAL ISCA 2002 DRAM Evolution Bruce Jacob David Wang

University of Read Timing for Burst EDO Maryland Row Access by driving the col-addr latch from an internal counter rather than Column Access an external signal, the minimum cycle time for driving the output Transfer Overlap bus was reduced by roughly 30% Data Transfer RAS

CAS

Address Row Column Address Address

DQ Valid Valid Valid Valid Data Data Data Data

DRAM TUTORIAL ISCA 2002 DRAM Evolution Bruce Jacob David Wang

University of Read Timing for Pipeline Burst EDO Maryland Row Access “pipeline” refers to the setting up of the read pipeline ... first CAS\ Column Access toggle latches the column address, all following CAS\ Transfer Overlap toggles drive data out onto the bus. therefore data stops coming Data Transfer when the memory controller stops toggling CAS\ RAS

CAS

Address Row Column Address Address

DQ Valid Valid Valid Valid Data Data Data Data

DRAM TUTORIAL ISCA 2002 DRAM Evolution Bruce Jacob David Wang

University of Read Timing for Synchronous DRAM Maryland Row Access Clock main benefit: frees up the CPU or memory controller from Column Access having to control the DRAM’s internal latches directly ... the RAS Transfer Overlap controller/CPU can go off and do other things during the idle Data Transfer cycles instead of “wait” ... even though the time-to-first-word CAS latency actually gets worse, the scheme increases system throughput Command

ACT READ

Address Row Col Addr Addr

DQ Valid Valid Valid Valid Data Data Data Data

(RAS + CAS + OE ... == Command Bus)

DRAM TUTORIAL ISCA 2002 DRAM Evolution Bruce Jacob David Wang

University of Inter-Row Read Timing for ESDRAM Maryland Regular CAS-2 SDRAM, R/R to same bank output latch on EDO allowed you Clock to start CAS sooner for next accesss (to same row) Command latch whole row in ESDRAM -- ACT READ PRE ACT READ allows you to start precharge & Address RAS sooner for thee next page Row Col Row Col access -- HIDE THE Addr Addr Bank Addr Addr PRECHARGE OVERHEAD.

DQ Valid Valid Valid Valid Valid Valid Valid Valid Data Data Data Data Data Data Data Data

ESDRAM, R/R to same bank Clock

Command

ACT READ PRE ACT READ

Address

Row Col Bank Row Col Addr Addr Addr Addr

DQ Valid Valid Valid Valid Valid Valid Valid Valid Data Data Data Data Data Data Data Data

DRAM TUTORIAL ISCA 2002 DRAM Evolution Bruce Jacob David Wang

University of Write-Around in ESDRAM Maryland Regular CAS-2 SDRAM, R/W/R to same bank, rows 0/1/0 Clock neat feature of this type of buffering: write-around Command

ACT READ PRE ACT WRITE PRE ACT READ

Address Row Col Row Col Row Col Addr Addr Bank Addr Addr Bank Addr Addr

DQ Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Data Data Data Data Data Data Data Data Data Data Data

ESDRAM, R/W/R to same bank, rows 0/1/0

Clock

Command

ACT READ PRE ACT WRITE READ

Address

Row Col Bank Row Col Col Addr Addr Addr Addr Addr

DQ Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Data Data Data Data Data Data Data Data Data Data Data Data

(can second READ be this aggressive?)

DRAM TUTORIAL ISCA 2002 DRAM Evolution Bruce Jacob $ David Wang

University of Internal Structure of Virtual Channel Maryland 16 Channels Bank B (segments) main thing ... it is like having a Input/Output bunch of open row buffers (a la Buffer ), but the problem is that Bank A you must deal with the 2Kb Segment directly (move into and out of it), not the DRAM banks ... adds an extra couple of cycles of latency ... however, you get good 2Kb Segment if the data you want is cache, and you can “prefetch” 2Kbit # DQs into cache ahead of when you DQs want it ... originally targetted at reducing latency, now that 2Kb Segment SDRAM is CAS-2 and RCD-2, this make sense only in a throughput way 2Kb Segment

Sense Row Decoder Amps Sel/Dec

Activate Prefetch Read Restore Write

Segment cache is software-managed, reduces energy

DRAM TUTORIAL ISCA 2002 DRAM Evolution Bruce Jacob David Wang

University of Internal Structure of Fast Cycle RAM Maryland SDRAM FCRAM FCRAM opts to break up the data array .. only activate a portion of the word line

8K rows requires 13 bits tto select ... FCRAM uses 15 (assuming the array is 8k x 1k ... the data sheet does not specify) 8M Array 8M Array 13 bits (8Kr x 1Kb) 15 bits (?) Row Decoder Row Decoder

Sense Amps Sense Amps

tRCD = 15ns tRCD = 5ns (two clocks) (one clock)

Reduces access time and energy/access

DRAM TUTORIAL

ISCA 2002 . .

DRAM Evolution . .

...... Bruce Jacob . . David Wang .

University of Internal Structure of MoSys 1T-SRAM Maryland

MoSys takes this one step addr further ... DRAM with an SRAM interface & speed but DRAM energy Bank [physical partitioning: 72 banks] Select auto refresh -- how to do this transparently? the logic moves tthrough the arrays, refreshing them when not active. but what is one bank gets repeated access for a long duration? all other banks will be refreshed, but that one will not. solution: they have a bank-sized CACHE of lines ... in theory, should never have a problem (magic) Auto Refresh

$

DQs

DRAM TUTORIAL ISCA 2002 DRAM Evolution Bruce Jacob David Wang

University of Comparison of Low-Latency DRAM Cores Maryland Data Bus Bus Width Peak BW RAS–CAS RAS–DQ DRAM Type here’s an idea of how the Speed (per chip) (per Chip) (tRCD) (tRAC) designs compare ... PC133 SDRAM 133 16 266 MB/s 15 ns 30 ns bus speed == CAS-to-CAS VCDRAM 133 16 266 MB/s 30 ns 45 ns RAS-CAS == time to read data from capacitors into sense amps FCRAM 200 * 2 16 800 MB/s 5 ns 22 ns RAS-DQ == RAS to valid data 1T-SRAM 200 32 800 MB/s — 10 ns

DDR 266 133 * 2 16 532 MB/s 20 ns 45 ns

DRDRAM 400 * 2 16 1.6 GB/s 22.5 ns 60 ns

RLDRAM 300 * 2 32 2.4 GB/s ??? 25 ns

DRAM TUTORIAL ISCA 2002 Outline Bruce Jacob David Wang • University of Basics Maryland • DRAM Evolution: Structural Path

• Advanced Basics • Memory System Details (Lots) • DRAM Evolution: Interface Path

• Future Interface Trends & Research Areas

• Performance Modeling: Architectures, Systems, Embedded

DRAM TUTORIAL ISCA 2002 What Does This All Mean? Bruce Jacob David Wang University of xDDR II Maryland DDR II Some Technology has legs, netDRAM some do not have legs, and EDO some have gone belly up. RLDRAM We’ll start by emaining the BEDO fundamental technologies (I/O ESDRAM packaging etc) then explore ome of these technologies in depth a bit later. FPM SDRAM FCRAM

D-RDRAM DDR

SDRAM SLDRAM

DRAM TUTORIAL ISCA 2002 Cost - Benefit Criterion Bruce Jacob David Wang

University of Maryland Package Cost What is a “good” system?

It’s all about the cost of a system. This is a multi- dimensional tradeoff problem. Interconnect Especially tough when the Cost Bandwidth relative cost factors of pins, die area, and the demands of bandwidth and latency keeps on DRAM changing. Good decisions for Logic Overhead one generation may not be good System for future generations. This is why we don’t keep a DRAM Design protocol for a long time. FPM lasted a while, but we’ve quickly Test and progressed through EDO, Implementation SDRAM, DDR/RDRAM, and now DDR II and whatever else is Latency on the horizon. Power Consumption

DRAM TUTORIAL ISCA 2002 Memory System Design Bruce Jacob David Wang Clock Network I/O Technology University of Maryland

Now we’ll really get our hands dirty, and try to become DRAM designers. That is, we want to Topology Chip Packaging understand the tradeoffs, and design our own memory system with DRAM cells. By doing this, we can gain some insight into some of the basis of claims by DRAM Chip DRAM proponents of various DRAM Memory memory systems. Architecture Pin Count System A Memory System is a system that has many parts. It’s a set of technologies and design decisions. All of the parts are inter-related, but for the sake of discussion, we’ll splite the Address Mapping Access Protocol components into ovals seen here, and try to examine each part of a DRAM system separately. Row Buffer Management

DRAM TUTORIAL ISCA 2002 DRAM Interfaces Bruce Jacob David Wang

University of The Digital Fantasy Maryland

Professor Jacob has shown yoou some nice timing diagrams, I too will show you Row a some nice timing diagrams, but 0 the timing diagrams are a simplification that hides the Col r0 r1 details of implemetation. Why don’t they just run the system at d XXX MHz like the other guy? Data d0 d0 d0 0 d1 d1 d1 d1 Then the latency would be much better, and the bandwidth would be extreme. Perhaps they can’t, RAS CAS Pipelined Access and we’ll explain why. To latency latency understand the reason why some systems can operate at XXX MHz while others cannot, we must go digging past the nice Pretend that the world looks like this timing digrams and the architectural diagrams and see what turns up underneath. So underneath the timing diagram, we find this.... But...

DRAM TUTORIAL Read fromF CRAMTM @400MHz DDR ISCA 2002 (TheNon-t eRealrmina Wtioorldn case) Bruce Jacob David Wang FCRAM side VDDQ(Pad)

University of Maryland

We don’t get nice square or DQS (Pin) even nicely shaped waveforms. DQ0-15 (Pin) Jitter, skew, etc. Vddq and Vssq are the voltage supplies to the I/ O pads on the DRAM chips. The signal bounces around, and very non ideal. So what are the problems, and what are the solution(s) that are used to solve these problems? (note, the 158 VSSQ(Pad) ps skew on parallel data channels If your cycle time is 10ns, or 10,000ps, a skew of Controller side 158 ps is no big deal, but if your cycle time is 1ns, or 1000ps, then a skew of 158ps is a big deal) Already, we see hints of some DQ0-15 (Pin) problems as we try to push DQS (Pin) systems to higher and higher clock frequencies.

skew=158psec skew=102psec

*Toshiba Presentation, Denali MemCon 2002

12

DRAM TUTORIAL ISCA 2002 Signal Propagation Bruce Jacob David Wang

University of A Maryland B

First, we have to introduce the concept that signal propagation takes finite time. Limited by the speed of light, or rather ideal transmission lines we should have speed of approximately 2/3 the speed of light. That gets us 20cm/ns. All signals, including system wide clock signals has to be sent on a system board, so if you sent a clock signal from point A to point B on an ideal signal line, point B won’t be able to tell that the clock has change until at the earliest, 1/20 ns/cm * distance later that the clock has risen. Ideal Transmission Line Then again, PC boards are not exactly ideal transmission lines. ~ 0.66c = 20 cm/ns (ringing effect, drive strength, etc) The concept of “Synchronous” breaks down when different PC Board + Module Connectors + parts of the system observe different clocks. Kind of like Varying Electrical Loads relativity = Rather non-Ideal Transmission Line

DRAM TUTORIAL ISCA 2002 Clocking Issues Bruce Jacob David Wang

University of Figure 1: Maryland Sliding Time 0th Nth When we build a “synchronous system” on a PCB board, how do we distribute the clock signal? Do we want a sliding time domain? Is H Tree do-able to N-modules in parallel? Skew Clk compensation? SRC

Figure 2: H Tree? 0th Nth

Clk SRC

What Kind of Clocking System?

DRAM TUTORIAL ISCA 2002 Clocking Issues Bruce Jacob David Wang

University of Figure 1: Maryland th Write Data 0th N We would want the chips to be on a “global clock”, everyone is perfectly synchronous, but since clock signals are delivered through wires, different chips in the system will see the rising Clk edge of a clock a little bit earlier/ later than other chips. SRC Signal Direction

While an H-Tree may work for a low freq system, we really need a clock for sending (writing) signals from the controller to the chips, and another one for Figure 2: snding signals from chips to Read Data th th controller (reading) 0 N

Clk SRC Signal Direction We need different “clocks” for R/W

DRAM TUTORIAL ISCA 2002 Path Length Differential Bruce Jacob David Wang Path #3 Bus Signal 2 University of Path #2 Maryland Bus Signal 1 Path #1 Controller We purposefully “routed path #2 A B to be a bit longer than path #1 to Intermodule illustrate the point in between Connectors the signal path length differentials. As illustrated, signals will reach load B at a later time than load A simply because it is farther away from controller than load A.

It is also difficult to do path length and impedence matching on a system board. Sometimes heroic efforts must be utilized to get us a nice “parallel” bus.

High Frequency AND Wide Parallel Busses are Difficult to Implement

DRAM TUTORIAL ISCA 2002 Subdividing Wide Busses Bruce Jacob David Wang University of B Maryland 1

It’s hard to bring the Wide parallel bus from point A to point B, but it’s easier to bring in smaller groups of signals from A 2 to B. To ensure proper timing, 3 we also send along a source synchronous clock signal that is path length matched with the signal groun it covers.In this figure, signal groups 1,2, and 3 may have some timing skew with respect to each other, but within the group the signals will have minimal skew. (smaller channel Obstruction can be clocked higher)

1 2 Narrow Channels, 3 Source Synchronous A Local Clock Signals

DRAM TUTORIAL ISCA 2002 Why Subdivision Helps Bruce Jacob David Wang

University of Maryland

Sub Analogy, it’s a lot harder to Channel 1 schedule 8 people for a meeting, but a lot easier to schedule 2 meetings with 4 people each. The results of the two meeting can be correlated later. Sub Channel 2

Worst Case skew of Worst Case Worst Case {Chan 1 + skew of skew of Chan 2} Chan 1 Chan 2

Worst Case Skew must be Considered in System Timing

DRAM TUTORIAL ISCA 2002 Timing Variations Bruce Jacob David Wang

University of Maryland Controller 4 Loads

A “System” is a hard thing to design. Especially one that allows end users to perform 1 Load configurations that will impact Controller timing. To guarentee functional correctness of the system, all corner cases of variances in loading and timing must be Clock accounted for. Cmd to 1 Load

Cmd to 4 Loads

How many in System? How many devices on each DIMM? Who built the ? Infinite variations on timing!

DRAM TUTORIAL ISCA 2002 Loading Balance Bruce Jacob David Wang

University of Maryland Controller

To ensure that a lightly loaded Duplicate system and a fully loaded system do not differ significantly Signal in timing, we either have Lines duplicate signals sent to different memory modules, or we have the same signal line, Controller but the signal line uses variable strengths to drive the I/O pads, depending on if the system has 1,2,3 or 4 loads.

Variable Controller Signal Drive Strength

Controller

DRAM TUTORIAL ISCA 2002 Topology Bruce Jacob David Wang University of DRAM DRAM DRAM DRAM Maryland Chip Chip Chip Chip

Self Explanatory. topology determines loading and signal propagation lengths. DRAM DRAM DRAM DRAM Controller ? Chip Chip Chip Chip DRAM DRAM DRAM DRAM Chip Chip Chip Chip

DRAM DRAM DRAM DRAM Chip Chip Chip Chip

DRAM System Topology Determines Electrical Loading Conditions and Signal Propagation Lengths

DRAM TUTORIAL ISCA 2002 SDRAM Topology Example Bruce Jacob David Wang

University of Maryland x16 x16 DRAM DRAM Very simple topology. The clock Chip Chip signal that turns around is very nice. Solves problem of needing multiple clocks. Command & x16 x16 Address DRAM DRAM Single Chip Chip Channel SDRAM x16 x16 Controller DRAM DRAM Data bus Chip Chip (64 bits) (16 bits) x16 x16 DRAM DRAM Chip Chip

Loading Imbalance

DRAM TUTORIAL ISCA 2002 RDRAM Topology Example Bruce Jacob David Wang

University of RDRAM Maryland Controller All signals in this topology, Addr/ Cmd/Data/Clock are sent from point to point on channels that is path length matched by definition.

Controller

Chip clock Packets traveling down turns Parallel Paths. Skew is around minimal by design.

DRAM TUTORIAL ISCA 2002 I/O Technology Bruce Jacob David Wang

University of Maryland Logic High

RSL vs SSTL2 etc. ∆ v (like ECL vs TTL of another era)

What is “Logic Low”, what is Logic Low “Logic High”? Different Electrical Time Signalling protocols differ on ∆ t voltage swing, high/low level, etc. delta t is on the order of ns, we want it to be on the order of ps. ∆ v Slew Rate = ∆ t

Smaller ∆ v = Smaller ∆ t at same slew rate Increase Rate of bits/s/pin

DRAM TUTORIAL ISCA 2002 I/O - Differential Pair Bruce Jacob David Wang

University of Maryland

Used on clocking systems, i.e. RDRAM (all clock signals are pairs, clk and clk#). Single Ended Transmission Line Highest noise tolerance, does not need as many ground signals. Where as singled ended signals need many ground connections. Also differential pair signals may be clocked even higher, so pin-bandwidth disadvantage is not nearly 2:1 as implied by the diagram.

Differential Pair Transmission Line

Increase Rate of bits/s/pin ? Cost Per Pin? Pin Count?

DRAM TUTORIAL ISCA 2002 I/O - Multi Level Logic Bruce Jacob David Wang

University of Maryland

One of several ways on the table to further increase the bit-rate of the interconnects. logic 10 range Vref_2 logic 11 range Vref_1 logic 01 range V voltage ref_0 logic 00 range time Increase Rate of bits/s/pin

DRAM TUTORIAL ISCA 2002 Packaging Bruce Jacob David Wang

University of Maryland DIP “good old da ys” Different packaging types impact Features Target Specification costs and speed. Slow parts can use the cheapest packaging available. Faster parts may have Package FBGA LQFP to use more expensive SOJ packaging. This has long be Speed 800MBp 550Mbps accepted in the higher margin Small Outline J-lead processor world, but to DRAM, Vdd/Vddq 2.5V/2.5V (1.8V) each cent has to hard fought for. To some extent, the demand for higher performance is pushing Interface SSTL_2 memory makers to use more TSOP expensive packaging to Thin Small Outline Row Cycle 35ns accommodate higher frequency Package Time tRC parts. When RAMBUS first spec’ed FBGA, module makers complained, since they have to Memory Roadmap for purchase expensive equipment LQFP to validate that chips were Hynix NetDDR II properly soldered to the module Low Profile Quad board, whereas something like Flat Package TSOP can be done with visual inspection. FBGA Fine Ball Grid Array

DRAM TUTORIAL ISCA 2002 Access Protocol Bruce Jacob David Wang

University of Single Maryland Cycle

16 bit wide command I have to Cmd send from A to B, I need 16 pins, or if I have less than 16, I need multiple cycles. Cmd r0 How many bits do I need to send from point A to point B? How many pins do I get? Data d0 d0 d0 d0

Cycles = Bits/Pins. Single Cycle Command

Multiple Cycle Cmd

Cmd r0 r0 r0 r0

Data d0 d0 d0 d0 Multiple Cycle Command

DRAM TUTORIAL ISCA 2002 Access Protocol (r/r) Bruce Jacob David Wang

University of Maryland Control DRAM DRAM DRAM DRAM There is inherant latency between issuance of a read command, and the response of the chip with data. To increase efficiency, a pipeline structure is necessary to obtain full utilization of the command, Row a address and data busses. 0 Different from an “ordinary” pipeline on a processor, a Col r0 r1 memory pipeline has data flowing in both directions. Data d0 d0 d0 d0 d1 d1 d1 d1 Architecture wise, we should be concerned with full utilization everywhere, so we can use the RAS CAS Pipelined Access least number of pins for the latency latency greatest benefit, but in actual use, we are usually concerned with full utilization of the data Consecutive Cache Line Read Requests to Same DRAM Row bus.

Command a = Active (open page) r = Read (Column Read) Data d = Data (Data chunk)

DRAM TUTORIAL ISCA 2002 Access Protocol (r/w) Bruce Jacob David Wang DRAM Data In/Out One Datapath - Two Commands University of Buffers Maryland Column Decoder The DRAM chips determine the latency of data after a read command is received, but the Sense Amps controller determines the timing relationship between the write command and the data being written to the dram chips. Col w0 r1 (If the DRAM device cannot handle pipelined R/W, then...) Data d0 d0 d0 d0 d1 d1 d1 d1 Case 1: Controller sends write data at same time as the write Case 1: Read Following a Write Command to Different DRAM Devices command to different devices (pipelined)

Case 2: Controller sends write Col w0 r1 data at same time as the write command to same device. (not d d d d d d d d pipelined) Data 0 0 0 0 1 1 1 1 Case 2: Read Following a Write Command to Same DRAM Device

Col w0 r1

Data d0 d0 d0 d0 d1 d1 d1 d1 Soln: Delay Data of Write Command to match Read Latency

DRAM TUTORIAL ISCA 2002 Access Protocol (pipelines) Bruce Jacob David Wang Col r r r University of 0 1 2 Maryland Data d0 d0 d0 d0 d1 d1 d1 d1 d2 d2 d2 d2 To increase “efficiency”, CAS pipelines is required. How many latency commands must one device support concurrently? Three Back-to-Back Pipelined Read Commands 2? 3? 4? (depends on what?)

Imagine we must increast data rate (higher pin freq), but allow DRAM core to operate slightly Col r0 r1 r2 slower. (2X pin freq., same core latency) Data d0 d0d0 d0 d1 d1d1 d1d2 d2d2 d2 This issue ties access protocol to internal DRAM architecture CAS issues. latency “Same” Latency, 2X pin frequency, Deeper Pipeline

When pin frequency increases, chips must either reduce “real latency”, or support longer bursts, or pipeline more commands.

DRAM TUTORIAL ISCA 2002 Outline Bruce Jacob David Wang • University of Basics Maryland • DRAM Evolution: Structural Path 440BX used 132 pins to control a single SDRAM channel, not counting Pwr & GND. now 845 • Advanced Basics only uses 102.

Also slower versions. 66/100 • DRAM Evolution: Interface Path Also page burst, an entire page. • SDRAM, DDR SDRAM, RDRAM Memory System Burst length programmed to match cacheline size. (i.e. 32 Comparisons byte = 256 bits = 4 cycles of 64 bits) • Processor-Memory System Trends Latency as seen by the controller is really CAS + 1 • RLDRAM, FCRAM, DDR II Memory Systems Summary cycles . • Future Interface Trends & Research Areas

• Performance Modeling: Architectures, Systems, Embedded

DRAM TUTORIAL ISCA 2002 SDRAM System In Detail Bruce Jacob David Wang Dimm1 Dimm2 Dimm3 Dimm4 University of Maryland

Single Channel SDRAM Controller

Addr & Cmd “Mesh T opology” Data Bus Chip (DIMM) Select

DRAM TUTORIAL ISCA 2002 SDRAM Chip Bruce Jacob David Wang 133 MHz (7.5ns cycle time) 256 MBit University of Multiplexed Command/Address Bus Maryland 54 pin Programmable Burst Length, 1,2,4 or 8 TSOP SDRAM: inexpensive packaging, lowest cost (LVTTL Quad Banks Internally signaling), standard 3.3V supply voltage. DRAM core and I/O Supply Voltage of 3.3V share same power supply.

“Power cost” is between 560mW Low Latency, CAS = 2 , 3 and 1W per chip for the duration 14 Pwr/Gnd of the cache line burst. (Note: it LVTTL Signaling (0.8V to 2.0V) costs power just to keep lines 16 Data stored in sense amp/row buffers, (0 to 3.3V rail to rail.) 15 Addr something for row buffer 7 Cmd management policies to think about.) 1 Clk 1 NC About 2/7 of pins used for Addr, Condition Specification Cur. Pwr 2/7 used for Data, 2/7 used fo Pwr/Gnd, and 1/7 used for cmd signals. Operating (Active) Burst = Continous 300mA 1W

Row Commands and column Operating (Active) Burst = 2 170mA 560mW commands are sent on the same bus, so they have to be demultiplexed inside the DRAM Standby (Active) All banks active 60mA 200mW and decoded. Standby (powerdown) All banks inactive 2mA 6.6mW

DRAM TUTORIAL ISCA 2002 SDRAM Access Protocol (r/r) Bruce Jacob 0 1 2 3 4 5 6 7 8 9 10 11 12 13 David Wang 1 SDRAM SDRAM University of chip # 0 chip # 1 1 3 read command and address assertion 2 Maryland Memory 3 Controller data bus utilization 2 4 We’ve spent some time Data bus discussing some pipelined back CASL 3 to back read commands sent to 4 the same chip, now let’s try to pipeline commands to different chips.

In order for the memory Clock signal controller to be able to latch in data on the data bus on consecutive cycles, chip #0 has to hold the data value past the Data return from rising edge of the clock to satisfy 2 4 Data return from the hold time requirements, then chip # 0 chip # 1 chip #0 has to stop, allow the bus to go “quiet”, then chip #1 can start to drive the data bus at least some “setup time” ahead of the rising edge of the next clock. chip # 0 bus idle chip # 1 Clock cycles has to be long time enough to tolerate all of the hold time setup time timing requirements. Back-to-back Memory Read Accesses to Different Chips in SDRAM Clock Cycles are still long enough to allow for pipelined back-to-back Reads

DRAM TUTORIAL ISCA 2002 SDRAM Access Protocol (w/r) Bruce Jacob David Wang Figure 1: University of th th Maryland Consecutive 0 N Reads I show different paths, but theses signals are sharing the bi-directional data bus. For a read to follow a write to a different chip, the worst case skew is when we write to the (N- 1)th chip, then expect to pipeline a read command in the next cycle right behind it. The worst Worst case = Dist(N) - Dist(0) case signal path skew is the sum of the distances. Isn’t from N to N even worse? No, SDRAM does not support pipelined read Figure 1: behind a write on the same chip. th Also, it’s not as bad as I project Read After th (N-1) th here, since read cycles are 0 N center aligned, and writes are Write edge aligned, so in essence, we get 1 1/2 cycles to pipeline this case, instead of just 1 cycle. Still, this problem limits the freq Write scalability of SDRAM, an idle Read cycles maybe inserted to meet timing. Looks Just like SDRAM! Worst case = Dist(N) + Dist(N-1) Bus Turn Around

DRAM TUTORIAL ISCA 2002 SDRAM Access Protocol (w/r) Bruce Jacob David Wang

University of 1 Maryland SDRAM SDRAM Timing bubbles. More dead chip # 0 chip # 1 cycles 2 Memory Controller 3

4 Data bus

1 3 Col w0 r1

Data d0 d0 d0 d0 d1 d1 d1 d1 2 4

Read Following a Write Command to Same SDRAM Device

DRAM TUTORIAL ISCA 2002 DDR SDRAM System Bruce Jacob David Wang Dimm1 Dimm2 Dimm3 University of Maryland

Since data bus has a much lighter load, if we can use better signaling technology, perhaps Single we can run just the data bus at a higher frequency. Channel DDR At the higher frequency, the skews we talked about would be SDRAM terrible with a 64 bit wide data bus. So we use a source Controller synchonous strobe signals (called DQS) that is routed parallel to each 8 bit wide sub channel.

DDR is newer, so let’s use lower core voltage, saves on power too!

Addr & Cmd Same Topology Data Bus as SDRAM DQS (Data Strobe) Chip (DIMM) Select

DRAM TUTORIAL ISCA 2002 DDR SDRAM Chip Bruce Jacob David Wang 133 MHz (7.5ns cycle time) 256 MBit University of Multiplexed Command/Address Bus 66 pin Maryland Programmable Burst Lengths, 2, 4 or 8* TSOP Slightly larger package, same width for Addr and Data, new Quad Banks Internally pins are 2 DQS, Vref, and now differential clocks. Lower supply Supply Voltage of 2.5V* voltage. Low Latency, CAS = 2 , 2.5, 3 * Low voltage swing, now with reference to Vref instead of (0.8 to 2.0V) SSTL-2 Signaling (Vref +/- 0.15V) No power discussion here (0 to 2.5V rail to rail) 16 Pwr/Gnd* because data sheet (micron) is incomplete. 16 Data 0 1 2 3 4 5 15 Addr Read Data returned from DRAM Clk # 7 Cmd chips now gets read in with Clk respect to the timing of the DQS 2 Clk * signals, sent by the dram chips Cmd Read 7 NC * in parallel with the data itself. DQS 2 DQS * The use of DQS introduces Data 1 Vref * “bubbles” in between bursts from different chips, and reduces bandwidth efficiency. CASL = 2 DQS Post-amble

DQS Pre-amble

DRAM TUTORIAL ISCA 2002 DDR SDRAM Protocol (r/r) Bruce Jacob David Wang

University of r0 Maryland SDRAM SDRAM chip # 0 chip # 1 Here we see that two consecutive column read d Memory 0 commands to different chips on r1 the DDR memory channel Controller cannot be placed back to back on the Data bus due to the DQS Data bus signal hand-off issue. They may be pipelined with one idle cycle d in between bursts. 1

This situation is true for all consecutive accesses to DQS pre-amble DQS post-amble different chips, r/r. r/w. w/r. (except w/w, when the controller keeps control of the DQS signal, 0 1 2 3 4 5 6 7 8 just changes target chips) Clk # Clk Because of this overhead, short r r nursts are inefficient on DDR, Cmd 0 1 longer bursts are more efficient. (32 byte cache line = 4 burst, 64 DQS byte line = 8 burst) Data d0 d0 d0 d0 d1 d1 d1 d1

CASL = 2

Back-to-back Memory Read Accesses to Different Chips in DDR SDRAM

DRAM TUTORIAL ISCA 2002 RDRAM System Bruce Jacob David Wang

University of Maryland RDRAM Controller Very different from SDRAM. Everything is sent around in 8 (half cycle) packets. Most systems now runs at 400 MHz, but since everything is DDR, it’s called “800 MHz”. The only difference is that packets can only be initiated at the rising edge of the clock, other than Bus Clock that, there’s no diff between 400 DDR and 800. Column Cmd w0 w1 r2

Very clean topology, very clever clocking scheme. No clock handoff issue, high efficiency. Data Bus d0 d1 d2 Write delay improves matching with read latency. (not perfect, tCWD tCAC as shown) since data bus is 16 t - t bits wide, each read command (Write delay) (CAS access delay) CAC CWD gets 16*8=128 bits back. Each cacheline fetch = multiple packets. Two Write Commands Followed by a Read Command

Up to 32 devices. Packet Protocol : Everything in 8 (half) cycle packets

DRAM TUTORIAL ISCA 2002 Direct RDRAM Chip Bruce Jacob David Wang 400 MHz (2.5ns cycle time) University of 256 MBit Maryland Separate Row-Col Command Busses 86 pin FBGA RDRAM packets does not re- Burst Length = 8* order the data inside the packet. 49 Pwr/Gnd* To compute RDRAM latency, we 4/16/32 Banks Internally* must add in the command 16 Data packet transmission time as well Supply Voltage of 2.5V* 8 Addr/Cmd as data packet transmission 4 Clk* time. Low Latency, CAS = 4 to 6 full cycles* 6 CTL * RDRAM relies on the multitudes 2 NC of banks to try to make sure that RSL Signaling (Vref +/- 0.2V) a high percentage of requests 1 Vref * would hit open pages, and only (800 mV rail to rail) incur the cost of a CAS, instead of a RAS + CAS. Active Active precharge

read read read read

data data data data

All packets are 8 (half) cycles in length, the protocol allows near 100% bandwidth utilization on all channels. (Addr/Cmd/Data)

DRAM TUTORIAL ISCA 2002 RDRAM Drawbacks Bruce Jacob David Wang High Frequency University of I/O Test and Maryland Package Cost RSL: Separate RDRAM provides high Power Plane bandwidth, but what are the 30% die cost costs? RAMBUS pushed in many for logic @ different areas simultaneously. The drawback was that with new 64 Mbit node set of infrastructure, the costs for first generation products were Control Logic - exhorbant. Active Decode Row buffers Logic + Open Single Chip Row Buffer. Provides All (High power Data Bits for for “quiet” state) Each Packet (Power)

Significant Cost Delta for First Generation

DRAM TUTORIAL ISCA 2002 System Comparison Bruce Jacob David Wang SDRAM DDR RDRAM University of Maryland Frequency (MHz) 133 133*2 400*2

Low pin count, higher latency. Pin Count (Data Bus) 64 64 16 In general terms, the system comparison simply points out the various parts that RDRAM Pin Count (Controller) 102 101 33 excells in, i.e. high bandwidth and low pin count., but they also Theoretical Bandwidth 1064 2128 1600 have longer latency, since it takes 10 ns just to move the (MB/s) command from the controller onto the DRAM chip, and Theoretical Efficiency 0.63 0.63 0.48 another 10ns just to get the data from the DRAM chips back onto (data bits/cycle/pin) the controller interface Sustained BW (MB/s)* 655 986 1072 Sustained Efficiency* 0.39 0.29 0.32 (data bits/cycle/pin)

RAS + CAS (tRAC) (ns) 45 ~ 50 45 ~ 50 57 ~ 67 CAS Latency (ns)** 22 ~ 30 22 ~ 30 40 ~ 50 133 MHz P6 Chipset + SDRAM CAS Latency ~ 80 ns *StreamAdd **Load to use latency

DRAM TUTORIAL ISCA 2002 Differences of Philosophy Bruce Jacob David Wang SDRAM - Variants University of Maryland DRAM RDRAM moves complexity from Controller interface into DRAM chips. Chips Is this a good trade off? What does the future look like? Complex Inexpensive Simple Interconnect Interface Logic

RDRAM - Variants

DRAM Controller Chips

Simplified expensive Complex Interconnect Interface Logic

Complexity Moved to DRAM

DRAM TUTORIAL ISCA 2002 Technology Roadmap (ITRS) Bruce Jacob David Wang

University of 2004 2007 2010 2013 2016 Maryland Semi Generation (nm) 90 65 45 32 22 To begin with, we look in a crystal ball to look for trends that CPU MHz 3990 6740 12000 19000 29000 will cause changes or limit scalability in areas that we are MLogicTransistors/ 77.2 154.3 309 617 1235 interested in. ITRS = International Technology cm^2 Roadmap for Semiconductors. High Perf chip pin count 2263 3012 4009 5335 7100 Transistor Frequecies are supposed to nearly double every High Performance chip 1.88 1.61 1.68 1.44 1.22 generation, and transistor cost (cents/pin) budget (as indicated by Million Logic Transistors per cm^2) are Memory pin cost 0.34 - 0.27 - 0.22 - 0.19 - 0.19 - projected to double. (cents/pin) 1.39 0.84 0.34 0.39 0.33 Interconnects between chips are a different story. Measured in Memory pin count 48-160 48-160 62-208 81-270 105-351 cents/pin, pin cost decreases only slowly, and pin budget grows slowly each generation.

Punchline: In the future, Free Transistors and Costly interconnects. Trend: Free Transistors & Costly Interconnects

DRAM TUTORIAL ISCA 2002 Choices for Future Bruce Jacob David Wang Direct Connect Custom DRAM: CPU University of Highest Bandwidth + DRAM DRAM DRAM Maryland DRAM Low Latency

So we have some choices to make. Integration of memory controller will move the memory Direct Connect controller on die, frequency will semi-comm. DRAM: be much higher. Command-data CPU High Bandwidth + DRAM DRAM DRAM path will only cross chip DRAM boundaries twice instead of 4 Low/Moderate Latency times. But interfacing with memory chips directly means that you are to be limited by the Direct Connect lowest common denominator. To DRAM DRAM get highest bandwidth (for a Commodity DRAM given number of pins) AND CPU Low Bandwidth + lowest latency, we’ll need DRAM DRAM Low Latency custom RAM, might as well be SRAM, but it will be prohibatively DRAM DRAM expensive DRAM DRAM DRAM DRAM

Memory DRAM DRAM DRAM CPU Controller DRAM

DRAM DRAM DRAM DRAM Indirect Connection Highest Bandwidth DRAM DRAM

Inexpensive DRAM Highest Latency

DRAM TUTORIAL ISCA 2002 EV7 + RDRAM (Compaq/HP) Bruce Jacob David Wang • University of RDRAM Memory (2 Controllers) Maryland • Direct Connection to processor Two RDRAM controller means 2 independent channels. Only 1 packet has to be generated for • 75ns Load to use latency each 64 byte cache line transaction request. • 12.8 GB/s Peak bandwidth (extra channel stores data. i.e. I belong to CPU#2, exclusively.) • 6 GB/s read or write bandwidth Very aggressive use of available pages on RDRAM memory. • 2048 open pages (2 * 32 * 32)

16 16 Each column read MC 64 16 fetches 128 * 4 = 512 b 16 (data)

16 16 MC 64 16 16

DRAM TUTORIAL ISCA 2002 What if EV7 Used DDR? Bruce Jacob David Wang • Peak Bandwidth 12.8 GB/s University of Maryland • 6 Channels of 133*2 MHz DDR SDRAM ==

EV7 cacheline is 64 bytes, so each 4-channel ganged RDRAM • 6 Controllers of 6 64 bit wide channels, or can fetch 64 bytes with 1 single packet. Each DDR SDRAM channel can • 3 Controllers of 3 128 bit wide channels fetch 64 bytes by itself. So we need 6 controllers if we gang two DDR SDRAMs together into one channel, we have to reduce EV7 + 6 controller EV7 + 3 controller the burst length from 8 to 4. System EV7 + RDRAM Shorter bursts are less efficient. DDR SDRAM DDR SDRAM Sustainable bandwidth drops. Latency 75 ns ~ 50 ns* ~ 50 ns* Pin count ~265** + Pwr/Gnd ~ 600** + Pwr/Gnd ~ 600** + Pwr/Gnd Controller 2 6*** 3*** Count Open pages 2048 144 72

* page hit CAS + memory controller latency. ** including all signals, address, command, data, clock, not including ECC or parity *** 3 controller design is less bandwidth efficient.

DRAM TUTORIAL ISCA 2002 What’s Next? Bruce Jacob David Wang • University of DDR II Maryland

DDR SDRAM was an • FCRAM advancedment from SDRAM, with lowered Vdd, new electrical signal interface (SSTL), new • RLDRAM protocol, but fundamentally the same tRC ~= 60ns. RDRAM has tRC of ~70ns. All comparable in • RDRAM (Yellowstone etc) Row recovery time. So what’s next? What’s on the Horizon? DDR II/FCRAM/ • Kentron QBM RLDRAM/RDRAM-nextGen/ Kentron? What are they, and what do they bring to the table?

DRAM TUTORIAL ISCA 2002 DDR II - DDR Next Gen Bruce Jacob David Wang Lower I/O DRAM core operates at University of Voltage (1.8V) 1:4 freq of data bus freq Maryland (SDRAM 1:1, DDR 1:2) Backward Compat. DDR II is a follow on to DDR 400 Mbps to DDR (Common DDR II command sets are a superset of DDR SDRAM - multidrop modules possible) commands. Lower I/O voltage means lower 800 Mbps power for I/O and possibly faster signal switching due to lower rail -point to point No more Page- to rail voltage. DRAM core now operates at 1:4 Transfer-Until- of data bus frquency. valida Interrupted command may be latched on any given rising edge of clock, FPBGA package Commands but may be delayed a cycle since command bus is running (removes speedpath) at 1:2 frequency to the core now. In a memory system it can run at Burst Length == 4 400 MHz per pin, while it can be cranked up to 800 MHz per pin Only! in an embedded system without connectors. DDR II eliminates the transfer- until-interrupted commands, as well as limits the burst length to 4 only. (simple to test) 4 Banks internally Write Latency = CAS -1 (same as SDRAM and DDR) (increased Bus Utilization)

DRAM TUTORIAL ISCA 2002 DDR II - Continued Bruce Jacob David Wang

University of Posted Commands Maryland Active Read SDRAM & DDR Instead of a controller that keeps (RAS) (CAS) track of cycles, we can now have a “dumber” controller. Control is now simple, kind of like SRAM. tRCD data part I of address one cycle, part II the next cycle. SDRAM & DDR SDRAM relies on memory controller to know

tRCD and issue CAS after tRCD for lowest latency.

Active Read DDR II: Posted CAS (RAS) (CAS)

tRCD data Internal counter delays CAS command, DRAM chip issues “real”

command after tRCD for lowest latency.

DRAM TUTORIAL ISCA 2002 FCRAM Bruce Jacob David Wang

University of Fast Cycle RAM (aka Network-DRAM) Maryland

FCRAM is a trademark of Fujitsu. Toshiba manufactures Features DDR SDRAM FCRAM/Network-DRAM under this trademark, sells Network DRAM. Same thing. Vdd, Vddq 2.5 +/- 0.2V 2.5 +/- 0.15 extra die area devoted to circuits that lowers Row Cycle down to Electrical Interface SSTL-2 SSTL-2 half of DDR, and Random Access (RAC) latency down to 22 to 26ns. Clock Frequency 100~167 MHz 154~200 MHz Writes are delay-matched with CASL, better bus utilization. tRAC ~40ns 22~26ns

tRC ~60ns 25~30ns

# Banks 4 4

Burst Length 2,4,8 2,4

Write Latency 1 Clock CASL -1

FCRAM/Network-DRAM looks like DDR+

DRAM TUTORIAL ISCA 2002 FCRAM Continued Bruce Jacob David Wang

University of Maryland

With faster DRAM turn around time on the tRC, a random access that hits the same page over and over again will have the highest bus utilization. (With random R/W accesses) Also, why Peak BW != sustained BW. Deviations from peak bandwidth could be due to architecture related issues such as tRC (cannot cycle DRAM arrays to grab data out of same DRAM array and re-use sense amps)

Faster tRC allows Samsung to claim higher bus efficiency * , Denali MemCon 2002

DRAM TUTORIAL ISCA 2002 RLDRAM Bruce Jacob David Wang Peak Random Bus Width Row Cycle University of DRAM Type Frequency Bandwidth Access (per chip) Time (tRC) Maryland (per Chip) Time (tRAC) Another Variant, but RLDRAM is targetted toward embedded PC133 133 16 200 MB/s 45 ns 60 ns systems. There are no SDRAM connector specifications, so it can target a higher frequency off DDR 266 133 * 2 16 532 MB/s 45 ns 60 ns the bat. PC800 400 * 2 16 1.6 GB/s 60 ns 70 ns RDRAM

FCRAM 200 * 2 16 0.8 GB/s 25 ns 25 ns

RLDRAM 300 * 2 32 2.4 GB/s 25 ns 25 ns

Comparable to FCRAM in latency Higher Frequency (No Connectors) non-Multiplexed Address (SRAM like)

DRAM TUTORIAL RLDRAM Applications--- (II)L3 Cache ISCA 2002 RLDRAM Continued Bruce Jacob David Wang High-end PC and Server

University of Maryland Infineon proposes that RLDRAM could be integrated onto the as an L3 cache. 64 MB of L3. Shaves 25 ns off of Processor Memory tRAC from going to SDRAM or Controller DDR SDRAM. Not to be used as main memory due to capacity constraints. 2.4GB/s X32 2.4GB/s X32

256Mb ??? 256Mb RLDRAM RLDRAM

RLDRAM is a great replacement to SRAM in L3

MP SM M GS cache applications because of its high density, low power and low cost

2001-09-04 Page 6 *Infineon Presentation, Denali MemCon 2002

DRAM TUTORIAL ISCA 2002 RAMBUS Yellowstone Bruce Jacob David Wang • University of Bi-Directional Differential Signals Maryland

Unlike other DRAM’s. • Ultra low 200mV p-p signal swings Yellowstone is only a voltage and I/O specification, no DRAM AFAIK. • 8 data bits transferred per clock RAMBUS has learned their lesson, they used expensive packaging, 8 layer • 400 MHz system clock , and added cost everywhere. Now the new pitch is “higher performance with • 3.2 GHz effective data frequency same infrastructure”. • Cheap 4 layer PCB

• Commodity packaging

System Clock Data 1.2 V 1.0 V

Octal Data Rate (ODR) Signaling

DRAM TUTORIAL TM ISCA 2002 Kentron QBM Bruce Jacob David Wang DDR A DDR B DDR A DDR B DDR A DDR B DDR A DDR B University of Maryland SWITCH SWITCH SWITCH SWITCH SWITCH SWITCH SWITCH

Quad Band Memory SWITCH Uses Fet switches to control which DIMM sends output. Two DDR memory chips are interleaved to get Quad memory.

Advantages, uses standard SWITCH SWITCH SWITCH SWITCH DDR chips, extra cost is low, only the wrapper electronics. Modification to memory controller required, but minimal. Has to understand that data is open open open being burst back at 4X clock frequency. Does not improve efficiency, but cheap bandwidth. Supports more loads than Clock “ordinary DDR”, so more Controller DDR A capacity. d1 d1 d1 d1 DDR B d0 d0 d0 d0

Output d0 d1 d0 d1 d0 d1 d0 d1

“Wrapper Electr onics around DDR memory” Generates 4 data bits per cycle instead of 2. Quad Band Memory

DRAM TUTORIAL ISCA 2002 A Different Perspective Bruce Jacob David Wang

University of Everything is bandwidth Maryland Instead of thining about things Clock on a strict latency-bandwidth perspective, it might be more helpful to think in terms of Row Cmd/Addr Bandwidth latency vs pin-transition efficiency perspective. The idea is that Col. Cmd/Addr Bandwidth Write Data Bandwidth Read Data Bandwidth

Latency and Bandwidth

Pin-bandwidth and

Pin-transition *Efficiency (bits/cycle/sec)

DRAM TUTORIAL ISCA 2002 Research Areas: Topology Bruce Jacob David Wang

University of Maryland

DRAM systems is basically a networking system with a smart master controller and a large number of “dumb” slave devices. If we are concerned about “efficiency” on a bit/pin/sec level, it might behoove us to draw inspiration from network interfaces, and design something like this... Unidirection command and write packets from controller to DRAM chips, and Unidirection bus from DRAM chips back to the controller. Then it looks like a network system with slot ring interface, no need to deal with bus-turn around issues. Unidirectional Topology:

• Write Packets sent on Command Bus

• Pins used for Command/Address/Data

• Further Increase of Logic on DRAM chips

DRAM TUTORIAL ISCA 2002 Memory Commands? Bruce Jacob David Wang Act Write University of Act Write 0 Maryland 000000

Certain things simply does not make sense to do. Such as Instead of A[ ] = 0; Do “write 0” various STREAM components. Move multimegabyte arrays from DRAM to CPU, just to perform simple “add” function, then move that multi megabyte arrays right back. In such extreme Why do A[ ] = B[ ] in CPU? bandwidth constrained applications, it would be beneficial to have some logic or Memory hardware on DRAM chips that Controller can perform simple computation. This is tricky, since we do not want to add too much logic as to make the DRAM chips prohibatively expensive to manufacture. (logic overhead Move Data inside of DRAM or between DRAMs. decreases with each generation, so adding logic is not an impossible dream) Also, we do not want to add logic into the critical path of a DRAM access. That would serve to slow down a Why do STREAMadd in CPU? general access in terms of the “real latency” in ns. A[ ] = B[ ] + C[ ]

Active Pages *(Chong et. al. ISCA ‘98)

DRAM TUTORIAL ISCA 2002 Address Mapping Bruce Jacob David Wang Physical University of Address Maryland

For a given physical address, there are a number of ways to map the bits of the physical Device Id address to generate the Row Addr Col Addr Bank Id “memory address” in terms of device ID, Row/col addr, and bank id. The mapping policies could impact performance, since badly mapped systems can cause bank conflicts in consecutive accesses. Now, mapping policies must also take temperature control into account, as consecutive accesses that hit the same DRAM chip can potentially create undesirable hot spots. One reason for the additional cost of RDRAM initially was the use of heat spreaders on the memory modules to prevent the hotspots from building up. Access Distribution for Temp Control Avoid Bank Conflicts Access Reordering for performance

DRAM TUTORIAL ISCA 2002 Example: Bank Conflicts Bruce Jacob David Wang Column Decoder University of Column Decoder Maryland Sense Amps Each Memory system consists Column Decoder of one or more memory chips, SenseColumn Amps Decoder and most times, accesses to ... Bit Lines... these chips can be pipelined. Sense Amps Each chip also has multitudes of ... Bit Lines...Sense Amps banks, and most of the times, accesses to these banks can ... Bit Lines... also be pipelined. (key to . Memory ... Bit Lines... efficiency is to pipeline . . commands) . Array

. Memory w Decoder . .

. Array . Memory Ro . w Decoder . . Multiple Banks . ArraMemory y . Ro w Decoder .

to Reduce . Array Ro Access Conflicts w Decoder Ro

Read 05AE5700 Device id 3, Row id 266, Bank id 0 Read 023BB880 Device id 3, Row id 1BA, Bank id 0 Read 05AE5780 Device id 3, Row id 266, Bank id 0 Read 00CBA2C0 Device id 3, Row id 052, Bank id 1

More Banks per Chip == Performance == Logic Overhead

DRAM TUTORIAL ISCA 2002 Example: Access Reordering Bruce Jacob David Wang 1 Read 05AE5700 Device id 3, Row id 266, Bank id 0 2 Read 023BB880 Device id 3, Row id 1BA, Bank id 0 University of 3 Read 05AE5780 Device id 3, Row id 266, Bank id 0 Maryland 4 Read 00CBA2C0 Device id 1, Row id 052, Bank id 1

Each Load command is translated to a row command Act 1 Prec Act 2 Prec Act 3 and a column command. If two commands are mapped to the Read same bank, one must be Read completed before the other can start. Data Data

Or, if we can re-order the sequences, then the entire tRC sequence can be completed faster. Strict Ordering

By allowing Read 3 to bypass Act 1 Act 4 Prec Prec Act 2 Prec Read 2, we do not need to generate another row activation 3 command. Read 4 may also ReadReadRead Read bypass Read 2, since it operates on a different Device/bank Data Data Data Data entirely. DRAM now can do auto precharge, but I put in the Memory Access Re-ordered precharge explicitly to show that two rows cannot be active within tRC (DRAM architecture) constraints. Act = Activate Page (Data moved from DRAM cells to row buffer) Read = Read Data (Data moved from row buffer to memory controller) Prec = Precharge (close page/evict data in row buffer/sense amp)

DRAM TUTORIAL ISCA 2002 Outline Bruce Jacob David Wang • University of Basics Maryland • DRAM Evolution: Structural Path now -- talk about performance issues. • Advanced Basics

• DRAM Evolution: Interface Path

• Future Interface Trends & Research Areas

• Performance Modeling: Architectures, Systems, Embedded

DRAM TUTORIAL ISCA 2002 Simulator Overview Bruce Jacob David Wang

University of CPU: SimpleScalar v3.0a Maryland • 8-way out-of-order NOTE • L1 cache: split 64K/64K, lockup free x32

• L2 cache: unified 1MB, lockup free x1

• L2 blocksize: 128 bytes

Main Memory: 8 64Mb DRAMs

• 100MHz/128-bit memory bus

• Optimistic open-page policy

Benchmarks: SPEC ’95

DRAM TUTORIAL

ISCA 2002 DRAM Configurations Bruce Jacob David Wang FPM, EDO, SDRAM, ESDRAM, DDR: x16 DRAM

x16 DRAM University of Maryland x16 DRAM

x16 DRAM NOTE CPU Memory 128-bit 100MHz bus and caches Controller x16 DRAM

x16 DRAM

x16 DRAM

x16 DRAM DIMM Rambus, Direct Rambus, SLDRAM:

CPU Memory 128-bit 100MHz bus and caches Controller DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM

Fast, Narrow Channel

Note: TRANSFER WIDTH of Direct Rambus Channel • equals that of ganged FPM, EDO, etc. • is 2x that of Rambus & SLDRAM

DRAM TUTORIAL ISCA 2002 DRAM Configurations Bruce Jacob David Wang Rambus & SLDRAM dual-channel:

University of Maryland DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM CPU Memory NOTE 128-bit 100MHz bus and caches Controller DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM

Fast, Narrow Channel x2 Strawman: Rambus, etc. DRAM DRAM

CPU Memory 128-bit 100MHz bus DRAM and caches Controller DRAM ... DRAM

Multiple Parallel Channels

DRAM TUTORIAL ISCA 2002 First … Refresh Matters Bruce Jacob David Wang

University of compress Maryland 1200 Bus Wait Time NOTE Refresh Time Data Transfer Time Data Transfer Time Overlap Column Access Time 800 Row Access Time Bus Transmission Time

400 Time per Access (ns)

0 FPM1 FPM2 FPM3 EDO1 EDO2 SDRAM1 ESDRAM SLDRAM RDRAM DRDRAM DRAM Configurations

Assumes refresh of each bank every 64ms

DRAM TUTORIAL ISCA 2002 Overhead: Memory vs. CPU Bruce Jacob David Wang Total Execution Time in CPI — SDRAM

University of Stalls due to Memory Access Time Maryland 3 Overlap between Execution & Memory Processor Execution (includes caches) NOTE 2.5 Tomorrow’s CPU

2 Yesterday’sToday’s CPU CPU

1.5

1

Clocks Per Instruction (CPI) 0.5

0 Compress Gcc Go Ijpeg Li M88ksim Perl Vortex

BENCHMARK

Variable: speed of processor & caches

DRAM TUTORIAL ISCA 2002 Definitions (var. on Burger, et al) Bruce Jacob David Wang • tPROC — processor with perfect memory University of Maryland • tREAL — realistic configuration

NOTE • tBW — CPU with wide memory paths

• tDRAM — time seen by DRAM system

tREAL tDRAM

Stalls Due to tREAL - tBW BANDWIDTH

tBW Stalls Due to t t LATENCY BW - PROC

CPU-Memory OVERLAP tPROC - (tREAL - tDRAM) CPU+L1+L2 t t Execution REAL - DRAM tPROC

DRAM TUTORIAL ISCA 2002 Memory & CPU — PERL Bruce Jacob David Wang

University of Bandwidth-Enhancing Techniques I: Maryland Stalls due to NOTE Stalls due to Memory Latency Overlap between Execution & Memory 5 Processor Execution

Newer DRAMs Tomorrow’s CPU 4 Yesterday’s CPU Today’s CPU

3

2

Cycles Per Instruction (CPI) 1

0 FPM EDO SLDRAM RDRAM SDRAM DRDRAM ESDRAM DDR DRAM Configuration

DRAM TUTORIAL ISCA 2002 Memory & CPU — PERL Bruce Jacob David Wang

University of Bandwidth-Enhancing Techniques II: Maryland Stalls due to Memory Bandwidth NOTE Stalls due to Memory Latency Execution Time in CPI — PERL 5 Overlap between Execution & Memory Processor Execution

4

3

2

Cycles Per Instruction (CPI) 1

0 FPM/interleaved EDO/interleaved SDRAM & DDR SLDRAM x1/x2 RDRAM x1/x2 DRAM Configuration (10GHz CPUs)

DRAM TUTORIAL ISCA 2002 Average Latency of DRAMs Bruce Jacob David Wang Bus Wait Time Refresh Time University of Data Transfer Time Maryland 500 Data Transfer Time Overlap Column Access Time NOTE Row Access Time 400 Bus Transmission Time

300

200

Avg Time per Access (ns) 100

0 FPM EDO SLDRAM RDRAM SDRAM DRDRAM ESDRAM DDR

DRAM Configurations

note: SLDRAM & RDRAM 2x data transfers

DRAM TUTORIAL ISCA 2002 Average Latency of DRAMs Bruce Jacob David Wang Bus Wait Time Refresh Time University of Data Transfer Time Maryland 500 Data Transfer Time Overlap Column Access Time NOTE Row Access Time 400 Bus Transmission Time

300

200

Avg Time per Access (ns) 100

0 FPM EDO SLDRAM RDRAM SDRAM DRDRAM ESDRAM DDR

DRAM Configurations

note: SLDRAM & RDRAM 2x data transfers

DRAM TUTORIAL ISCA 2002 DDR2 Study Results Bruce Jacob David Wang

University of ) Architectural Comparison Maryland 2 R 3 pc100 D ddr133 D

NOTE ( drd e 2.5

m ddr2 i

T ddr2ems

n 2 ddr2vc o i t u c

e 1.5 x E

d

e 1 z i l a m

r 0.5

o cc1 com go ijpeg li linea mpe mpe peg− perl ran− strea strea

N pres r_wal g2de g2en wit dom m m_n s k c c _wa o_un Benchmark

DRAM TUTORIAL ISCA 2002 DDR2 Study Results Bruce Jacob David Wang

University of Perl Runtime Maryland 0.9 pc100 ddr133 NOTE 0.8

) drd .

c 0.7 ddr2 e

S ddr2ems (

0.6

e ddr2vc m

i 0.5 T

n 0.4 o i t

u 0.3 c e

x 0.2 E 0.1 0 1 Ghz 5 Ghz 10 Ghz Processor Frequency

DRAM TUTORIAL ISCA 2002 Row-Buffer Hit Rates Bruce Jacob David Wang No L2 Cache No L2 Cache FPMDRAM 100 100 EDODRAM SDRAM ESDRAM 80 80 DDRSDRAM University of SLDRAM Maryland RDRAM DRDRAM 60 60

NOTE 40 40 Hit rate in row buffers Hit rate in row buffers 20 20

0 0 Compress Gcc Go Ijpeg Li M88ksim Perl Vortex AcroreadCompress Gcc Go Netscape Perl Photoshop Powerpoint Winword SPEC INT 95 Benchmarks ETCH Traces 1MB L2 Cache 1MB L2 Cache 100 100

80 80

60 60

40 40 Hit rate in row buffers Hit rate in row buffers 20 20

0 0 Compress Gcc Go Ijpeg Li M88ksim Perl Vortex AcroreadCompress Gcc Go Netscape Perl Photoshop Powerpoint Winword SPEC INT 95 Benchmarks ETCH Traces

4MB Cache 4MB L2 Cache 100 100

80 80

60 60

40 40 Hit rate in row buffers Hit rate in row buffers 20 20

0 0 Compress Gcc Go Ijpeg Li M88ksim Perl Vortex AcroreadCompress Gcc Go Netscape Perl Photoshop Powerpoint Winword SPEC INT 95 Benchmarks ETCH Traces

DRAM TUTORIAL ISCA 2002 Row-Buffer Hit Rates Bruce Jacob David Wang Hits vs. Depth in Victim-Row FIFO Buffer 1000 200 1.5e+05 Go University of 800 Li Vortex 150 Maryland 1e+05 600 100 400 NOTE 50000 50 200

0 0 0 0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10

50000 2e+05 4e+05

40000 Compress Perl 1.5e+05 Ijpeg 3e+05 30000 1e+05 2e+05 20000 50000 1e+05 10000

0 0 0 0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10

Compress Vortex 100000 10000

10000 1000

1000 100 100

10 10

1 1 2000 4000 6000 8000 10000 2000 4000 6000 8000 10000 Inter-arrival time (CPU Clocks) Inter-arrival time (CPU Clocks)

DRAM TUTORIAL ISCA 2002 Row Buffers as L2 Cache Bruce Jacob David Wang

University of

Maryland Stalls due to Memory Bandwidth Stalls due to Memory Latency Overlap between Execution & Memory NOTE Processor Execution

20

10 Clocks Per Instruction (CPI)

0 Compress Gcc Go Ijpeg Li M88ksim Perl Vortex

DRAM TUTORIAL ISCA 2002 Row Buffer Management Bruce Jacob David Wang ROW ACCESS COLUMN ACCESS University of Maryland RAS CAS Data In/Out Data In/Out Each memory transaction has to break down into a two part Buffers Buffers access, a row access and a column access. In essence the row buffer/sens amp is action as Column Decoder Column Decoder a cache. where a page is brought in from the memory array and stored in the buffer Sense Amps Sense Amps then the second step is to move that data from the row buffers back into the memory controller. ... Bit Lines...... Bit Lines... from a certain perspective, it makes sense to speculatively move pages from memory arrays into the row buffers to maximize the page hit of a . . Memory Memory . column access, and reduce . . . . latency. The cost of a . Array Array w Decoder speculative row activation w Decoder command is the ~20 bit of Ro bandwidth sent on the command Ro channel from controller to DRAM. instead of prefetching into DRAM, we’re just prefetching inside of DRAM. Row buffer hit rates 40~90%, depending on application. *could* be near 100% if memory system gets speculative row RAS is like Cache Access buffer management commands. (This only makes sense if Why not Speculate? memory controller is integrated)

DRAM TUTORIAL ISCA 2002 Cost-Performance Bruce Jacob David Wang

University of FPM, EDO, SDRAM, ESDRAM: Maryland • Lower Latency => Wide/Fast Bus NOTE • Increase Capacity => Decrease Latency

• Low System Cost

Rambus, Direct Rambus, SLDRAM:

• Lower Latency => Multiple Channels

• Increase Capacity => Increase Capacity

• High System Cost

However, 1 DRDRAM = Multiple SDRAM

DRAM TUTORIAL ISCA 2002 Conclusions Bruce Jacob David Wang

University of 100MHz/128-bit Bus is Current Bottleneck Maryland • Solution: Fast Bus/es & MC on CPU NOTE (e.g. Alpha 21364, , …)

Current DRAMs Solving Bandwidth Problem (but not Latency Problem)

• Solution: New cores with on-chip SRAM (e.g. ESDRAM, VCDRAM, …)

• Solution: New cores with smaller banks (e.g. MoSys “SRAM”, FCRAM, …)

Direct Rambus seems to scale best for future high-speed CPUs

DRAM TUTORIAL ISCA 2002 Outline Bruce Jacob David Wang • University of Basics Maryland • DRAM Evolution: Structural Path now -- let’s talk about DRAM performance at theSYSTEM level. • Advanced Basics previous studies show MEMORY BUS is significant • DRAM Evolution: Interface Path bottleneck in today’s high- performance systems • Future Interface Trends & Research Areas - Schumann reports that in Alpha workstations, 30-60% of PRIMARY MEMORY LATENCY, is due to SYSTEM OVERHEAD • Performance Modeling: other than DRAM latency Architectures, Systems, Embedded - Harvard study cites BUS TURNAROUND as responsible for factor-of-two difference between PREDICTED and MEASURED performance in P6 systems

- our previous work shows today’s busses (1999’s busses) are bottlenecks for tomorrow’s DRAMs so -- look at bus, model system overhead

DRAM TUTORIAL ISCA 2002 Motivation Bruce Jacob David Wang

University of Even when we restrict our focus … Maryland at the SYSTEM LEVEL -- i.e. outside the CPU -- we find a large number of parameters SYSTEM-LEVEL PARAMETERS this study only VARIES a handful, but it still yields a fairly • Number of channels Width of channels large space the parameters we VARY are in • Channel latency Channel bandwidth blue & green • Banks per channel Turnaround time by “partially” independent, i mean that we looked at a small • Request-queue size Request reordering number of possibilities: • Row-access Column-access - turnaround is 0/1 cycle on 800MHz bus • DRAM precharge CAS-to-CAS latency - request ordering is • DRAM buffering L2 cache blocksize INTERLEAVED or NOT • Number of MSHRs Bus protocol Fully | partially | not independent (this study)

DRAM TUTORIAL ISCA 2002 Motivation Bruce Jacob David Wang

University of ... the design space is highly non-linear … Maryland 3 32-Byte Burst and yet, even in this restricted GCC 64-Byte Burst design space, we find highly EXTREMELY COMPLEX results 128-Byte Burst

SYSTEM is very SENSITIVE to CHANGES in these parameters 2 [discuss graph] if you hold all else constant and vary one parameter, you can see extremely large changes in end performance ... up to 40% 1 difference by changing ONE PARAMETER by a FACTOR OF TWO (e.g. doubling the number of banks, doubling the size of the burst, doubling the number of Cycles per Instruction (CPI) channels, etc.) 0 0.8 1.6 3.2 6.4 12.8 25.6

1 chan x 1 byte1 chan x2 2chan bytes x 1 byte 1 chan 2x 4chan bytes x4 2chan bytes x 1 byte1 chan 2x 8chan bytes 4x 4chan bytes x 2 bytes2 chan 4x 8chan bytes x 4 bytes4 chan x 8 bytes System Bandwidth (GB/s = Channels * Width * 800MHz)

DRAM TUTORIAL ISCA 2002 Motivation Bruce Jacob David Wang

University of ... and the cost of poor judgment is high. Maryland 10 so -- we have the worst possible scenario: a design space that is very sensitive to changes in parameters and execution times Worst Organization that can vary by a FACTOR OF 8 Average Organization THREE from worst-case to best Best Organization clearly, we would be well-served to understand this design space 6

4

2 Cycles per Instruction (CPI)

0 bzip gcc mcf parser perl vpr average SPEC 2000 Benchmarks

DRAM TUTORIAL ISCA 2002 System-Level Model Bruce Jacob David Wang

University of SDRAM Timing Maryland so by now we’re very familiar with this picture ... Row Access we cannot use it in this study, Clock because this represents the Column Access interface between the DRAM and the MEMORY Command Transfer Overlap CONTROLLER. ACT READ Data Transfer typically, the CPU’s interface is much simpler: the CPU sends all of the address bits at once with Address CONTROL INFO (r/w), and the Row Col memory controller handles the Addr Addr bit addressing and the RAS/CAS timing DQ Valid Valid Valid Valid Data Data Data Data

DRAM TUTORIAL ISCA 2002 System-Level Model Bruce Jacob David Wang

University of Timing diagrams are at the DRAM level Maryland … not the system level this gives the picture of what is happening at the SYTEM LEVEL Row Access Clock the CPU to MEM Column Access CONTROLLER is shown as “ABUS Active” Command Transfer Overlap the MEM-CONTROLLER to ACT READ Data Transfer DRAM activity is shown as DRAM BANK ACTIVE Address and the data read-out is shown Row Col as “DBUS Active” Addr Addr i DQ Valid Valid Valid Valid Data Data Data Data

DRAM Bank Active

ABUS Active DBUS Active

DRAM TUTORIAL ISCA 2002 System-Level Model Bruce Jacob David Wang

University of Timing diagrams are at the DRAM level Maryland … not the system level f the DRAM’s pins do not connect directly to the CPU (e.g. in a hierarchical bus Row Access organization, or if the data is Clock funnelled through the memory Column Access controller like the northbridge chipset), then there is yet Command Transfer Overlap another DBUS ACTIVE timing slot that follows below and to the ACT READ Data Transfer right ... this can continue to extend to any number of hierarchical levels, as seen in Address huge server systems with Row Col hundreds of GB of DRAM Addr Addr

DQ Valid Valid Valid Valid Data Data Data Data

DRAM Bank Active

ABUS Active DBUS1 Active

DBUS2 Active

DRAM TUTORIAL ISCA 2002 Request Timing Bruce Jacob David Wang

University of Backside bus Frontside bus Maryland Address so let’s formalize this system- Cache CPU Data bus (800 MHz) DRAM DRAM DRAM DRAM level interface. here’s the Data bus request timing in a slightly different way, as well as an example system model taken from schumann’s paper Address describing the 21174 memory controller (800 MHz) MC Row/Column Addresses & Control Control the DRAM’s data pins are connected directly to the CPU (simplest possible model), the memory controller handles the READ REQUEST TIMING: RAS/CAS timing, and the CPU t and memory controller only talk 0 in terms of addresses and ADDRESS BUS control information DRAM BANK

 DATA BUS  DRAM TUTORIAL ISCA 2002 Read/Write Request Shapes Bruce Jacob David Wang READ REQUESTS: University of t0 Maryland ADDRESS BUS 10ns DRAM BANK 90ns such a model gives us these DATA BUS 70ns 10ns types of request shapes for reads and writes ADDRESS BUS 10ns DRAM BANK 90ns this shows a few example bus/ DATA BUS burst configuraitons, in 70ns 20ns particular: ADDRESS BUS 10ns 4-byte bus with burst sizes of 32, DRAM BANK 100ns 64, and 128 bytes per burst DATA BUS 70ns 40ns

WRITE REQUESTS: t0 ADDRESS BUS 10ns DRAM BANK 90ns DATA BUS 40ns 10ns ADDRESS BUS 10ns DRAM BANK 90ns DATA BUS 40ns 20ns

ADDRESS BUS 10ns DRAM BANK 90ns DATA BUS 40ns 40ns DRAM TUTORIAL ISCA 2002 Pipelined/Split Transactions Bruce Jacob David Wang (a) Legal if R/R to different banks: 10ns University of Read: 90ns Maryland 70ns 20ns 20ns 10ns the bus is PIPELINED and Read: 90ns supports SPLIT TRANSACTIONS where one 70ns 20ns request can be NESTLED inside of another if the timing is right

[explain examples] (b) Nestling of writes inside reads is legal if R/W to different banks: Legal if turnaround <= 10ns: Legal if no turnaround: what we’re trying to do is to fit these 2D puzzle pieces together 10ns 10ns in TIME Read: 90ns Read: 90ns 70ns 10ns 70ns 20ns 10 10ns 10 10ns Write: 90ns Write: 90ns 40ns 10ns 40ns 20ns

(c) Bac k-to-back R/W pair that cannot be nestled:

10ns Read: 100ns 70ns 40ns 10 10ns Write: 90ns 40ns 40ns DRAM TUTORIAL ISCA 2002 Channels & Banks Bruce Jacob David Wang D D D D D D D D D D D D D D D D D D D D D University of Maryland ...... as for physical connections, here are the ways we modeled C C C C C C independent DRAM channels and independent BANKS per One independent channel Two independent channels CHANNEL Banking degrees of 1, 2, 4, ... Banking degrees of 1, 2, 4, ... the figure shows a few of the parameters that we study. in D D D D D D D D D D D D addition, we look at: D D D D D D D D D D D D D D D D - turnaround time (0,1 cycle) - queue size (0, 1, 2, 4, 8, 16, ... 32, infinite requests per channel) C C C Four independent channels Banking degrees of 1, 2, 4, ...

1, 2, 4 800 MHz Channels 8, 16, 32, 64 Data Bits per Channel 1, 2, 4, 8 Banks per Channel (Indep.) 32, 64, 128 Bytes per Burst DRAM TUTORIAL ISCA 2002 Burst Scheduling Bruce Jacob David Wang (Back-to-Back Read Requests)

University of 128-Byte Bursts: Maryland how do you chunk up a cache block? (L@ caches use 128- byte blocks) 64-Byte Bursts:

[read the bullets]

LONGER BURSTS amortize the cost of activating and 32-Byte Bursts: precharging the row over more data transferred

SHORTER BURSTS allow the critical word of a FOLLOWING REQUEST to be serviced sooner • Critical-burst-first so -- this is not novel, but it is fairly aggressive • Non-critical bursts are promoted NOTE: we use close-page, autoprecharge policy with ESDRAM-style buffering of • Writes have lowest priority ROW in SRAM. result: get the best possible precharge overlap (tend back up in request queue …) AND multiple burst requests to the same row will not re-invoke a RAS cycle unleass an • Tension between large & small bursts: intervening READ request goes to a different ROW in same BANK amortization vs. faster time to data DRAM TUTORIAL ISCA 2002 New Bar-Chart Definition Bruce Jacob David Wang tREAL tDRAM University of Maryland Stalls Due to tREAL - tBW DRAM Latency we run a series of different simulations to get break-downs: - CPU Activity t - memory activity overlapped Stalls Due to SYS with CPU tSYS - tPROC - non-overlapped - SYSTEM Queue, Bus, ... - non-overlapped - due to DRAM CPU-Memory so the top two are MEMORY tPROC - (tREAL - tDRAM) STALL CYCLES OVERLAP bottom two are PERFECT CPU+L1+L2 t - t MEMORY Execution REAL DRAM tPROC Note: MEMORY LATENCY is not further divided into latency/ bandwidth/etc. • tPROC — CPU with 1-cycle L2 miss

• tREAL — realistic CPU/DRAM config

• tSYS — CPU with 1-cycle DRAM latency

• tDRAM — time seen by DRAM system DRAM TUTORIAL ISCA 2002 System Overhead Bruce Jacob David Wang

University of 3 Maryland Regular Bus Organization 0-Cycle Bus Turnaround so -- we’re modeling a memory system that is fairly aggressive in terms of - scheduling policies 2 - support for concurrency Perfect Memory and we’re trying to find which of the following is to blame for the most overhead: - concurrency - latency 1

- system (queueing, precharge, Cycles per Instruction (CPI) chunks, etc)

0

1 bank/channel 1 bank/channel 1 bank/channel2 banks/channel4 banks/channel8 banks/channel 2 banks/channel4 banks/channel8 banks/channel 2 banks/channel4 banks/channel8 banks/channel 1.6 3.2 6.4 System Bandwidth (GB/s = Channels * Width * Speed) Benchmark = BZIP (SPEC 2000), 32-byte burst, 16-bit bus DRAM TUTORIAL ISCA 2002 System Overhead Bruce Jacob David Wang

University of 3 Maryland Regular Bus Organization 0-Cycle Bus Turnaround the figure shows:

- system overhead is significant (usually 20-40% of the total 2 memory overhead) System overhead 10–100% over perfect memory - the most significant overhead tends to be the DRAM latency - turnaround is relatively insignificant (however, 1

remember that this is an Cycles per Instruction (CPI) 800MHz bus system ...)

0

1 bank/channel 1 bank/channel 1 bank/channel2 banks/channel4 banks/channel8 banks/channel 2 banks/channel4 banks/channel8 banks/channel 2 banks/channel4 banks/channel8 banks/channel 1.6 3.2 6.4 System Bandwidth (GB/s = Channels * Width * Speed) Benchmark = BZIP (SPEC 2000), 32-byte burst, 16-bit bus DRAM TUTORIAL ISCA 2002 Concurrency Effects Bruce Jacob David Wang 1 Bank per Channel 2 Banks per Channel BANKS per CHANNEL University of 3 4 Banks per Channel Maryland 8 Banks per Channel the figure also shows that increasing BANKS PER CHANNEL BANDWIDTH CHANNEL gives you almost as much benefit as increasing 2 CHANNEL BANDWIDTH, which is much more costly to implement

=> clearly, there are some concurrency effects going on, and we’d like to quantify and 1

better understand them Cycles per Instruction (CPI)

0 1.6 3.2 6.4 System Bandwidth (GB/s = Channels * Width * Speed) Benchmark = BZIP (SPEC 2000), 32-byte burst, 16-bit bus

Banks/channel as significant as channel BW DRAM TUTORIAL ISCA 2002 Bandwidth vs. Burst Width Bruce Jacob David Wang

University of 32-Byte Burst Maryland 3 64-Byte Burst 128-Byte Burst take a look at a larger portion of the design space x-axis: SYSTEM BANDWIDTH, which == channels X channel width X 800MHz 2 y-axis: execution time of application on the given configuration different colored bars represent 1 different burst widths

MEMORY OVERHEAD is substantial Cycles per Instruction (CPI) obvious trend: more bandwidth 0 is better 0.8 1.6 3.2 6.4 12.8 25.6 another obvious trend: more bandwidth is NOT NECESSARILY better ... 1 chan x 1 byte1 chan x2 2chan bytes x 1 byte 1 chan 2x chan4 bytes x4 2chan bytes x 1 byte1 chan 2x chan8 bytes 4x chan4 bytes x 2 bytes2 chan 4x chan8 bytes x 4 bytes4 chan x 8 bytes System Bandwidth (GB/s = Channels * Width * 800MHz) Benchmark = GCC (SPEC 2000), 2 banks/channel DRAM TUTORIAL ISCA 2002 Bandwidth vs. Burst Width Bruce Jacob David Wang

University of 32-Byte Burst Maryland 3 64-Byte Burst 128-Byte Burst so -- there are some obvious trade-offs related to BURST SIZE, which can affect the TOTAL EXECUTION TIME by 30% or more, keeping all else constant. 2

1 Cycles per Instruction (CPI) 0 0.8 1.6 3.2 6.4 12.8 25.6

1 chan x 1 byte1 chan x2 2chan bytes x 1 byte 1 chan 2x chan4 bytes x4 2chan bytes x 1 byte1 chan 2x chan8 bytes 4x chan4 bytes x 2 bytes2 chan 4x chan8 bytes x 4 bytes4 chan x 8 bytes System Bandwidth (GB/s = Channels * Width * 800MHz) Benchmark = GCC (SPEC 2000), 2 banks/channel DRAM TUTORIAL ISCA 2002 Bandwidth vs. Burst Width Bruce Jacob David Wang

University of 32-Byte Burst Maryland 3 64-Byte Burst 128-Byte Burst if we look more closely at individual system organizations, there are some clear RULES of THUMBthat apear ... 2

1 Cycles per Instruction (CPI) 0 0.8 1.6 3.2 6.4 12.8 25.6

1 chan x 1 byte1 chan x2 2chan bytes x 1 byte 1 chan 2x chan4 bytes x4 2chan bytes x 1 byte1 chan 2x chan8 bytes 4x chan4 bytes x 2 bytes2 chan 4x chan8 bytes x 4 bytes4 chan x 8 bytes System Bandwidth (GB/s = Channels * Width * 800MHz) Benchmark = GCC (SPEC 2000), 2 banks/channel DRAM TUTORIAL ISCA 2002 Bandwidth vs. Burst Width Bruce Jacob David Wang

University of 32-Byte Burst Maryland 3 64-Byte Burst 128-Byte Burst for one, LARGE BURSTS are optimal for WIDER CHANNELS Wide channels (32/64-bit) want large bursts 2

1 Cycles per Instruction (CPI) 0 0.8 1.6 3.2 6.4 12.8 25.6

1 chan x 1 byte1 chan x2 2chan bytes x 1 byte 1 chan 2x chan4 bytes x4 2chan bytes x 1 byte1 chan 2x chan8 bytes 4x chan4 bytes x 2 bytes2 chan 4x chan8 bytes x 4 bytes4 chan x 8 bytes System Bandwidth (GB/s = Channels * Width * 800MHz) Benchmark = GCC (SPEC 2000), 2 banks/channel DRAM TUTORIAL ISCA 2002 Bandwidth vs. Burst Width Bruce Jacob David Wang

University of 32-Byte Burst Maryland Narrow channels (8-bit) 3 want small bursts 64-Byte Burst 128-Byte Burst for another, SMALL BURSTS are optimal for NARROW CHANNELS

2

1 Cycles per Instruction (CPI) 0 0.8 1.6 3.2 6.4 12.8 25.6

1 chan x 1 byte1 chan x2 2chan bytes x 1 byte 1 chan 2x chan4 bytes x4 2chan bytes x 1 byte1 chan 2x chan8 bytes 4x chan4 bytes x 2 bytes2 chan 4x chan8 bytes x 4 bytes4 chan x 8 bytes System Bandwidth (GB/s = Channels * Width * 800MHz) Benchmark = GCC (SPEC 2000), 2 banks/channel DRAM TUTORIAL ISCA 2002 Bandwidth vs. Burst Width Bruce Jacob David Wang

University of 32-Byte Burst Maryland 64-Byte Burst 3 Medium channels (16-bit) 128-Byte Burst and MEDIUM BURSTS are want medium bursts optimal for MEDIUM CHANNELS

so -- if CONCURRENCY were all-important, we would expect 2 small bursts to be best, because they would allow a LOWER AVERAGE TIME-TO-CRITICAL- WORD for a larger number of simultaneous requests. 1 what we actually see is that the optimal burst width scalies with the bus width, sugesting an optimal number of DATA TRANSFERS per BANK Cycles per Instruction (CPI) ACTIVATION/PRECHARGE 0 cycle 0.8 1.6 3.2 6.4 12.8 25.6 i’ll illustrate that ...

1 chan x 1 byte1 chan x2 2chan bytes x 1 byte 1 chan 2x chan4 bytes x4 2chan bytes x 1 byte1 chan 2x chan8 bytes 4x chan4 bytes x 2 bytes2 chan 4x chan8 bytes x 4 bytes4 chan x 8 bytes System Bandwidth (GB/s = Channels * Width * 800MHz) Benchmark = GCC (SPEC 2000), 2 banks/channel DRAM TUTORIAL ISCA 2002 Burst Width Scales with Bus Bruce Jacob David Wang

University of Range of Burst-Widths Modeled Maryland ADDRESS BUS 10ns this figure shows the entire DRAM BANK 90ns 64-bit channel x 32-byte burst range of burst widths that we DATA BUS 70ns 5ns modeled. note that some of the figure represent several different ADDRESS BUS combinations ... for example, the 10ns 64-bit channel x 64-byte burst one THIRD DOWN FROM TOP DRAM BANK 90ns 32-bit channel x 32-byte burst is DATA BUS 70ns 10ns

2-bute channel + 32-byte burst 4-byte channel + 64-byte burst ADDRESS BUS 10ns 8-byte channel + 128-byte burst 64-bit channel x 128-byte burst DRAM BANK 90ns 32-bit channel x 64-byte burst DATA BUS 70ns 20ns 16-bit channel x 32-byte burst

ADDRESS BUS 10ns 32-bit channel x 128-byte burst DRAM BANK 100ns 16-bit channel x 64-byte burst DATA BUS 70ns 40ns 8-bit channel x 32-byte burst

ADDRESS BUS 10ns DRAM BANK 140ns 16-bit channel x 128-byte burst 8-bit channel x 64-byte burst DATA BUS 70ns 80ns

ADDRESS BUS 10ns 8-bit channel x 128-byte burst DRAM BANK 220ns DATA BUS 70ns 160ns DRAM TUTORIAL ISCA 2002 Burst Width Scales with Bus Bruce Jacob David Wang

University of Range of Burst-Widths Modeled Maryland ADDRESS BUS 10ns the optimal configurations are in DRAM BANK 90ns 64-bit channel x 32-byte burst the middle, suggesting an DATA BUS 70ns 5ns optimal number of DATA TRANSFERS per BANK ADDRESS BUS ACTIVATION/PRECHARGE 10ns 64-bit channel x 64-byte burst cycle DRAM BANK 90ns 32-bit channel x 32-byte burst DATA BUS 70ns 10ns OPTIMAL [BOTTOM] -- too many transfers per burst crowds out other BURST WIDTHS requests ADDRESS BUS 10ns 64-bit channel x 128-byte burst DRAM BANK [TOP] -- too few transfers per 90ns 32-bit channel x 64-byte burst request lets the bank overhead DATA BUS 70ns 20ns 16-bit channel x 32-byte burst (activation/precharge cycle) dominate ADDRESS BUS 10ns 32-bit channel x 128-byte burst however, though this tells us DRAM BANK 100ns 16-bit channel x 64-byte burst how to best organize a channel 8-bit channel x 32-byte burst with a given bandwidth, these DATA BUS 70ns 40ns rules of thumb do not say anything about how the different ADDRESS BUS 10ns configurations (wide/narrow/ DRAM BANK 140ns 16-bit channel x 128-byte burst medium channels) compare to 8-bit channel x 64-byte burst each other ... DATA BUS 70ns 80ns so let’s focus on multiple configurations for ONE ADDRESS BUS 10ns 8-bit channel x 128-byte burst BANDWIDTH CLASS ... DRAM BANK 220ns DATA BUS 70ns 160ns DRAM TUTORIAL ISCA 2002 Focus on 3.2 GB/s — MCF Bruce Jacob David Wang 1 Bank per Channel 2 Banks per Channel University of 7 Maryland 4 Banks per Channel 8 Banks per Channel 6 this is like saying i have 4 800MHz 8-bit Rambus channels ... what should i do? 5

- gang them together? 4 - keep them independent? - something in between? 3 like before, we see that more banks is better, but not always by much 2 Cycles per Instruction (CPI)

1

0 32-Byte Burst 64-Byte Burst 128-Byte Burst

1 chan x 42 bytes chan x 2 bytes4 chan x 1 byte1 chan x 42 bytes chan x 2 bytes4 chan x 1 byte1 chan x 4 bytes2 chan x 2 bytes4 chan x 1 byte 3.2 GB/s System Bandwidth (channels x width x speed) DRAM TUTORIAL ISCA 2002 Focus on 3.2 GB/s — MCF Bruce Jacob David Wang 1 Bank per Channel 2 Banks per Channel University of 7 Maryland 4 Banks per Channel 8 Banks per Channel 6 with large bursts, there is less interleaving of requests, so extra banks are not needed 5

4

3

2 Cycles per Instruction (CPI)

1

#Banks not particularly important given large burst sizes ... 0 32-Byte Burst 64-Byte Burst 128-Byte Burst

1 chan x 42 bytes chan x 2 bytes4 chan x 1 byte1 chan x 42 bytes chan x 2 bytes4 chan x 1 byte1 chan x 4 bytes2 chan x 2 bytes4 chan x 1 byte 3.2 GB/s System Bandwidth (channels x width x speed) DRAM TUTORIAL ISCA 2002 Focus on 3.2 GB/s — MCF Bruce Jacob David Wang 1 Bank per Channel 2 Banks per Channel University of 7 Maryland 4 Banks per Channel 8 Banks per Channel 6 with multiple independent channels, you have a degree of concurrency that, to some extent, OBVIATES THE NEED 5 for BANKING

4

3

2 Cycles per Instruction (CPI)

1

... even less so with multi-channel systems 0 32-Byte Burst 64-Byte Burst 128-Byte Burst

1 chan x 42 bytes chan x 2 bytes4 chan x 1 byte1 chan x 42 bytes chan x 2 bytes4 chan x 1 byte1 chan x 4 bytes2 chan x 2 bytes4 chan x 1 byte 3.2 GB/s System Bandwidth (channels x width x speed) DRAM TUTORIAL ISCA 2002 Focus on 3.2 GB/s — MCF Bruce Jacob David Wang 1 Bank per Channel 2 Banks per Channel University of 7 Maryland 4 Banks per Channel 8 Banks per Channel 6 however, trying to reduce execution time via multiple channels is a bit risky: 5 it is very SENSITIVE to BURST SIZE, becaus emultiple channels givees the longest 4 latency possible to EVERY REQUEST in the system, and having LONG BURSTS exacerbates that problem -- byt 3 increasing the length of time that a request can be delayed by waiting for another ahead of it in the queue 2 Cycles per Instruction (CPI)

1

Multi-channel systems sometimes (but not always) a good idea 0 32-Byte Burst 64-Byte Burst 128-Byte Burst

1 chan x 42 bytes chan x 2 bytes4 chan x 1 byte1 chan x 42 bytes chan x 2 bytes4 chan x 1 byte1 chan x 4 bytes2 chan x 2 bytes4 chan x 1 byte 3.2 GB/s System Bandwidth (channels x width x speed) DRAM TUTORIAL ISCA 2002 Focus on 3.2 GB/s — MCF Bruce Jacob David Wang 1 Bank per Channel 2 Banks per Channel University of 7 Maryland 4 Banks per Channel 8 Banks per Channel 6 another way to look at it: how does the choice of burst size affect the NUMBER or 5 WIDTH of channels?

=> WIDE CHANNELS: an 4 improvement is seen by increasing the burst size

=> NARROW CHANNELS: 3 either no improvement is seen, or a slight degradation is seen by increasing burst size 2 Cycles per Instruction (CPI)

1 4x 1-byte channels 2x 2-byte channels 0 32-Byte Burst 64-Byte Burst 128-Byte Burst 1x 4-byte channels

1 chan x 42 bytes chan x 2 bytes4 chan x 1 byte1 chan x 42 bytes chan x 2 bytes4 chan x 1 byte1 chan x 4 bytes2 chan x 2 bytes4 chan x 1 byte 3.2 GB/s System Bandwidth (channels x width x speed) DRAM TUTORIAL ISCA 2002 Focus on 3.2 GB/s — BZIP Bruce Jacob David Wang 1 Bank per Channel 2 2 Banks per Channel University of Maryland 4 Banks per Channel 8 Banks per Channel we see the same trends in all the benchmarks surveyed the onllly thing that changes is some of the relations between parameters ...

1 Cycles per Instruction (CPI)

0 32-Byte Burst 64-Byte Burst 128-Byte Burst

1 chan x 42 bytes chan x 2 bytes4 chan x 1 byte1 chan x 42 bytes chan x 2 bytes4 chan x 1 byte1 chan x 4 bytes2 chan x 2 bytes4 chan x 1 byte 3.2 GB/s System Bandwidth (channels x width x speed) DRAM TUTORIAL ISCA 2002 Focus on 3.2 GB/s — BZIP Bruce Jacob David Wang 1 Bank per Channel 2 2 Banks per Channel University of Maryland 4 Banks per Channel 8 Banks per Channel for example, in BZIP, the best configurations are at smaller burst sizes than MCF however, though THE OPTIMAL CONFIG changes from banchmark to benchmark, there are always several configs that 1 are within 5-10% of the optimal config -- IN ALL BENCHMARKS Cycles per Instruction (CPI)

BEST CONFIGS are at SMALLER BURST SIZES 0 32-Byte Burst 64-Byte Burst 128-Byte Burst

1 chan x 42 bytes chan x 2 bytes4 chan x 1 byte1 chan x 42 bytes chan x 2 bytes4 chan x 1 byte1 chan x 4 bytes2 chan x 2 bytes4 chan x 1 byte 3.2 GB/s System Bandwidth (channels x width x speed) DRAM TUTORIAL ISCA 2002 Queue Size & Reordering Bruce Jacob David Wang BZIP: 1.6 GB/s (1 channel) University of Maryland 3 Infinite Queue 32-Entry Queue 1-Entry Queue No Queue 2

1 Cycles per Instruction (CPI)

0

1 bank/channel 1 bank/channel 1 bank/channel 2 banks/channel4 banks/channel8 banks/channel 2 banks/channel4 banks/channel8 banks/channel 2 banks/channel4 banks/channel8 banks/channel 32-byte burst 64-byte burst 128-byte burst DRAM TUTORIAL ISCA 2002 Conclusions Bruce Jacob David Wang

University of DESIGN SPACE is NON-LINEAR, Maryland COST of MISJUDGING is HIGH we have a complex edsign space where neighboring designs differ significantly CAREFUL TUNING YIELDS 30–40% GAIN if you are careful, you can beat the performance of the average organization by 30-40% MORE CONCURRENCY == BETTER supporting memory concurrency improves system performance, (but not at expense of LATENCY) as long as it is not done at the expense of memory latency: • Via Channels → NOT w/ LARGE BURSTS - using MULTIPLE CHANNELS is good, but not best solution → - multiple banks/channel is • Via Banks ALWAYS SAFE always good idea - trying to interleave small bursts • Via Bursts → DOESN’T PAY OFF is intuitively appealing, but it doesn’t work • Via MSHRs → NECESSARY - MSHRs: alwas a god idea

In general, bursts should be large enough to amortize the BURSTS AMORTIZE COST OF PRECHARGE precharge cost. Direct Rambus = 16 bytes DDR2 = 16/32 bytes • Typical Systems: 32 bytes (even DDR2) THIS IS NOT ENOUGH → THIS IS NOT ENOUGH DRAM TUTORIAL ISCA 2002 Outline Bruce Jacob David Wang • University of Basics Maryland • DRAM Evolution: Structural Path NOTE • Advanced Basics

• DRAM Evolution: Interface Path

• Future Interface Trends & Research Areas

• Performance Modeling: Architectures, Systems, Embedded DRAM TUTORIAL ISCA 2002 Embedded DRAM Primer Bruce Jacob David Wang

University of Maryland

NOTE CPU DRAM Core Array

Embedded

CPU DRAM Core Array

Not Embedded DRAM TUTORIAL ISCA 2002 Whither Embedded DRAM? Bruce Jacob David Wang

University of Report, August 1996: “[Five] Maryland Architects Look to Processors of Future”

NOTE • Two predict imminent merger of CPU and DRAM

• Another states we cannot keep cramming more data over the pins at faster rates (implication: embedded DRAM)

• A fourth wants gigantic on-chip L3 cache (perhaps DRAM L3 implementation?)

SO WHAT HAPPENED? DRAM TUTORIAL ISCA 2002 Embedded DRAM for DSPs Bruce Jacob David Wang

University of MOTIVATION Maryland TAGLESS SRAM TRADITIONAL CACHE (hardware-managed) NOTE SOFTWARE The cache “covers” HARDWARE manages this the entire address manages this movement of space: any datum movement of CACHE data in the space data may be cached

CACHE Address space Address space Move from includes both includes only memory space “cache” and MAIN primary memory to “cache” space primary memory MEMORY (and memory- creates a new, (and memory- mapped I/O) equivalent data MAIN mapped I/O) Copying object, not a MEMORY from memory mere copy of to cache creates a the original. subordinate copy of the datum that is kept consistent with the datum still in memory. Hardware ensures consistency.

NON-TRANSPARENT addressing TRANSPARENT addressing EXPLICITLY MANAGED contents TRANSPARENTLY MANAGED contents

DSP Compilers => Transparent Cache Model DRAM TUTORIAL ISCA 2002 DSP Buffer Organization Bruce Jacob David Wang Used for Study University of Maryland Fully Assoc 4-Block Cache

victim-0 victim-1 NOTE S0 S1

buffer-0 buffer-1 LdSt0 LdSt1

DSP

LdSt0 LdSt1

DSP

Bandwidth vs. Die-Area Trade-Off for DSP Performance DRAM TUTORIAL ISCA 2002 E-DRAM Performance Bruce Jacob David Wang Embedded Networking Benchmark - Patricia University of Maryland 200MHz C6000 DSP : 50, 100, 200 MHz Memory 10 Cache Line Size NOTE # 32 bytes 64 bytes 8 128 bytes 256 bytes # # 512 bytes 6 1024 bytes

#

CPI 4

# 2 # # # # # # # #

0

0.4 GB/s 0.8 GB/s 1.6 GB/s 3.2 GB/s 6.4 GB/s 12.8 GB/s 25.6 GB/s Bandwidth DRAM TUTORIAL ISCA 2002 E-DRAM Performance Bruce Jacob David Wang Embedded Networking Benchmark - Patricia University of Maryland 200MHz C6000 DSP : 50MHz Memory 10 Cache Line Size NOTE # 32 bytes 64 bytes 8 128 bytes 256 bytes # # 512 bytes 6 1024 bytes

#

CPI 4

# 2 # # # # # Increasing bus width # # # 0

0.4 GB/s 0.8 GB/s 1.6 GB/s 3.2 GB/s 6.4 GB/s 12.8 GB/s 25.6 GB/s Bandwidth DRAM TUTORIAL ISCA 2002 E-DRAM Performance Bruce Jacob David Wang Embedded Networking Benchmark - Patricia University of Maryland 200MHz C6000 DSP : 100MHz Memory 10 Cache Line Size NOTE # 32 bytes 64 bytes 8 128 bytes 256 bytes # # 512 bytes 6 1024 bytes

#

CPI 4

# 2 # # # # # Increasing bus width # # # 0

0.4 GB/s 0.8 GB/s 1.6 GB/s 3.2 GB/s 6.4 GB/s 12.8 GB/s 25.6 GB/s Bandwidth DRAM TUTORIAL ISCA 2002 E-DRAM Performance Bruce Jacob David Wang Embedded Networking Benchmark - Patricia University of Maryland 200MHz C6000 DSP: 200MHz Memory 10 Cache Line Size NOTE # 32 bytes 64 bytes 8 128 bytes 256 bytes # # 512 bytes 6 1024 bytes

#

CPI 4

# 2 # # # # # # # # Increasing bus width 0

0.4 GB/s 0.8 GB/s 1.6 GB/s 3.2 GB/s 6.4 GB/s 12.8 GB/s 25.6 GB/s Bandwidth DRAM TUTORIAL ISCA 2002 Performance-Data Sources Bruce Jacob David Wang “A Performance Study of Contemporary DRAM Architectures,” University of Proc. ISCA ’99. V. Cuppu, B. Jacob, B. Davis, and T. Mudge. Maryland “Organizational Design Trade-Offs at the DRAM, Memory Bus, and Memory Controller Level: Initial Results,” University of Maryland Technical Report UMD-SCA-TR-1999-2. V. Cuppu and B. Jacob.

“DDR2 and Low Latency Variants,” Memory Wall Workshop 2000, in conjunction w/ ISCA ’00. B. Davis, T. Mudge, V. Cuppu, and B. Jacob. “Concurrency, Latency, or System Overhead: Which Has the Largest Impact on DRAM-System Performance?” Proc. ISCA ’01. V. Cuppu and B. Jacob.

“Transparent Data-Memory Organizations for Digital Signal Processors,” Proc. CASES ’01. S. Srinivasan, V. Cuppu, and B. Jacob.

“High Performance DRAMs in Workstation Environments,” IEEE Transactions on , November 2001. V. Cuppu, B. Jacob, B. Davis, and T. Mudge.

Recent experiments by Sadagopan Srinivasan, Ph.D. student at University of Maryland. DRAM TUTORIAL ISCA 2002 Acknowledgments Bruce Jacob David Wang

University of The preceding work was supported Maryland in part by the following sources:

• NSF CAREER Award CCR-9983618

• NSF grant EIA-9806645

• NSF grant EIA-0000439

• DOD award AFOSR-F496200110374

• … and by Compaq and IBM. DRAM TUTORIAL ISCA 2002 CONTACT INFO Bruce Jacob David Wang

University of Bruce Jacob Maryland Electrical & Computer Engineering University of Maryland, College Park http://www.ece.umd.edu/~blj/ [email protected]

Dave Wang

Electrical & Computer Engineering University of Maryland, College Park http://www.wam.umd.edu/~davewang/ [email protected]

UNIVERSITY OF MARYLAND