why memory is a hard problem in modern computer architectures

Felix Held what this talk is (not) about

• very brief and high-level overview • challenges to scale memory subsystem • a lot of details omitted • not the physical memory interface or its training memory subsystem: challenges • throughput • latency • physical memory interfaces throughput

• relatively easy to increase • more data lanes • higher symbol rate on lanes • multiple bits per symbol (not used for DRAM yet) latency

• hard to decrease, involves fighting physics: • speed of light in a medium • interconnect • RC time constants • : MOSFET gates, storage capacitors • R: interconnect • doesn’t really decrease with die shrinks memory subsystem: challenges • data throughput steadily increases • time to get a data word stays about the same DDRx DRAM chips

• optimized for • capacity • cost • architecture: • command bus • bidirectional data bus • multiple banks • organized in rows and columns 4Gb: x4, x8, x16 DDR3 SDRAM DDR3 DRAMFunctional Block Diagrams Figure 4: 512 Meg x 8 Functional Block Diagram

ODT ODT control

ZQ ZQ CAL To ODT/output drivers RZQ RESET# CKE Control ZQCL, ZQCS logic VSSQ A12

CK, CK# VDDQ/2 BC4 (burst chop) CS# Columns 0, 1, and 2 R RTT(WR) Bank 7 TT,nom RAS# OTF Bank 7 CK, CK# Bank 6 Bank 6 CAS# Bank 5 Bank 5 sw1 sw2 decode Bank 4

Command Bank 4 WE# Bank 3 Bank 3 DLL Bank 2 Bank 2 Bank 1 (1 . . . 8) Refresh Bank 1 16 READ DQ8 Mode registers counter Row- Bank 0 Bank 0 FIFO 8 TDQS# 16 64 DQ[7:0] address row- Memory and 19 address array data Read MUX 65,536 DQ[7:0] latch (65,536 x 128 x 64) MUX drivers DQS, DQS# 16 and decoder

VDDQ/2 Sense amplifiers 64 BC4 R R 8,192 TT,nom TT(WR) BC4 OTF sw1 sw2

3 I/O gating DM mask logic DQS/DQS# Bank (1, 2) A[15:0] Address 19 control BA[2:0] register logic 3 VDDQ/2 (128 Write x64) 64 8 drivers Data RTT,nom RTT(WR) interface and Data input Column logic decoder sw1 sw2

Column- 7 DM/TDQS 10 address (shared pin) counter/ 3 latch Columns 0, 1, and 2

CK, CK# Column 2 (select upper or lower nibble for BC4)

FigureImage 5: Source: 256 Meg https://www.micron.com/~/media/documents/products/data-sheet/dram/ddr3/4gb_ddr3_sdram.pdf x 16 Functional Block Diagram

ODT ODT control ZQ ZQ CAL To ODT/output drivers RZQ RESET# Control ZQCL, ZQCS CKE logic VSSQ A12

CK, CK# VDDQ/2 BC4 (burst chop) CS# R Column 0, 1, and 2 TT,nom RTT(WR) RAS# Bank 7 Bank 7 CK, CK# OTF Bank 6 Bank 6 sw1 sw2 CAS# Bank 5 Bank 5 decode Bank 4

Command Bank 4 WE# Bank 3 Bank 3 DLL Bank 2 Bank 2 Bank 1 (1 . . . 16) Refresh Bank 1 13 READ Mode registers counter Row- Bank 0 Bank 0 FIFO 16 15 128 DQ[15:0] address row- memory and 18 address array data READ MUX 32,768 DQ[15:0] latch (32,768 x 128 x 128) MUX drivers LDQS, LDQS#, UDQS, UDQS# 15 and decoder

VDDQ/2 Sense amplifiers BC4 128 RTT,nom R 16,384 TT(WR) BC4 sw1 sw2 OTF LDQS, LDQS# 3 I/O gating DM mask logic UDQS, UDQS# Bank (1 . . . 4) A[14:0] Address 18 control BA[2:0] register logic 3 V /2 (128 DDQ x128) WRITE 128 16 Data drivers RTT,nom RTT(WR) interface and Data Column input sw1 sw2 decoder logic

Column- 7 LDM/UDM 10 address (1, 2) counter/ 3 latch Columns 0, 1, and 2

CK, CK# Column 2 (select upper or lower nibble for BC4)

PDF: 09005aef8417277b Micron Technology, Inc. reserves the right to change products or specifications without notice. 4Gb_DDR3_SDRAM.pdf - Rev. N 12/14 EN 15 © 2009 Micron Technology, Inc. All rights reserved. DDRx DRAM chips

• data rate higher than command rate • double data rate only for data • two times for 1T command rate (usually used) • four times for 2T command rate • burst data transfer with length of typically 8 data words DDR3 DRAM

IDLE

ACTIVATE READ PRECHARGE

WRITE

• open row • access column • close row • tRCD delay to • READ: CL delay • tRP delay to read/write until first data next word on bus ACTIVATE on same bank • WRITE: CWL delay until first data word is read from bus memory controller

• process load/store requests from CPU/GPU/DMA/… • preempt collisions on data bus • RAM refresh • make sure RAM stays in thermal budget • map physical addresses to memory locations • channel, rank, bank, row addresses • schedule memory accesses • higher memory bandwidth utilization memory controller: address mapping • channels • typically interleaved • higher bandwidth in one memory region • bit more power consumption • ranks • command and data bus shared • internal state independent • mostly for more memory capacity memory controller: address mapping • banks • typically interleaved • work around tRP restrictions • delay to open new row in same bank • bit more power consumption • row addresses • often not continuous, instead bank interleaving memory controller: address mapping • column addresses • direct mapping of least significant bits to column address • accessing different columns in one row: low latency • memory access patterns: typically continuous blocks caches

• help solving part of the latency problem • physics TL;DR: • large memory → high latency • small memory → lower latency • memory accesses: often temporal and spacial locality • keep a small region of memory in a fast memory • (more or less) transparent caches

• address bits except offset in cache line get hashed • hash used as address for cache line in cache • higher address bits stored as tag • associativity: number of possible locations in cache for one hash value • replacement strategies: • last recently used • last frequently used caches

• common cache line size on modern x86: 64 bytes • DDRx DIMMs have 64 useable data bits • DDRx burst length is usually set to 8 • → 64 bytes, matches cache line size cache hierarchy

• bigger cache → slower cache • combination of bigger, slower caches and smaller, but faster caches • inclusive or exclusive cache write strategies

• write through: • every write is propagated to the next cache level or main memory • less state tracking needed • more write bandwidth used • write back: • contents only written back when line gets evicted • less writes to higher cache level or main memory • more state needs to be tracked and propagated CPU microarchitecture: load buffer, out-of-order exec • load buffers: • memory reads have not very well predictable latency • wait for data shouldn’t stall the pipeline • single performance: out-of-order execution • execute instructions that don’t depend on this read • try to keep the processor busy • multi thread performance: (possibly) hyper-threading cache prefetching

• speculative loading of memory contents into cache lines • minimize read latency • instruction fetch • branch prediction • data reads • access pattern prediction • cache hinting instructions cache prefetching

• right data prefetched • latency reduction • wrong data prefetched • unnecessary memory bandwidth usage • eviction of other cache lines • prefetching instructions to give processor hints cache coherency

• problem in Symmetric systems • different cores should always see same data in one location at a time • programming models for SMP assume cache coherency • doesn’t scale well with the number of cores • bus snooping vs directory-based • write-invalidate vs write-update • different cache coherency protocols bonus slide: • under certain conditions reads and writes can be reordered • specified in the instruction set architecture • needs to be taken into account when building synchronization primitives • insert (expensive) memory barrier if necessary Thank you for your attention. Questions?