Felix Held What This Talk Is (Not) About

why memory is a hard problem in modern computer architectures Felix Held what this talk is (not) about • very brief and high-level overview • challenges to scale memory subsystem • a lot of details omitted • not the physical memory interface or its training memory subsystem: challenges • throughput • latency • physical memory interfaces throughput • relatively easy to increase • more data lanes • higher symbol rate on lanes • multiple bits per symbol (not used for DRAM yet) latency • hard to decrease, involves fighting physics: • speed of light in a medium • interconnect • RC time constants • C: MOSFET gates, storage capacitors • R: interconnect • doesn’t really decrease with die shrinks memory subsystem: challenges • data throughput steadily increases • time to get a data word stays about the same DDRx DRAM chips • optimized for • capacity • cost • architecture: • command bus • bidirectional data bus • multiple banks • organized in rows and columns 4Gb: x4, x8, x16 DDR3 SDRAM DDR3 DRAMFunctional Block Diagrams Figure 4: 512 Meg x 8 Functional Block Diagram ODT ODT control ZQ ZQ CAL To ODT/output drivers RZQ RESET# CKE Control ZQCL, ZQCS logic VSSQ A12 CK, CK# VDDQ/2 BC4 (burst chop) CS# Columns 0, 1, and 2 R RTT(WR) Bank 7 TT,nom RAS# OTF Bank 7 CK, CK# Bank 6 Bank 6 CAS# Bank 5 Bank 5 sw1 sw2 decode Bank 4 Command Bank 4 WE# Bank 3 Bank 3 DLL Bank 2 Bank 2 Bank 1 (1 . 8) Refresh Bank 1 16 READ DQ8 Mode registers counter Row- Bank 0 Bank 0 FIFO 8 TDQS# 16 64 DQ[7:0] address row- Memory and 19 address array data Read MUX 65,536 DQ[7:0] latch (65,536 x 128 x 64) MUX drivers DQS, DQS# 16 and decoder VDDQ/2 Sense amplifiers 64 BC4 R R 8,192 TT,nom TT(WR) BC4 OTF sw1 sw2 3 I/O gating DM mask logic DQS/DQS# Bank (1, 2) A[15:0] Address 19 control BA[2:0] register logic 3 VDDQ/2 (128 Write x64) 64 8 drivers Data RTT,nom RTT(WR) interface and Data input Column logic decoder sw1 sw2 Column- 7 DM/TDQS 10 address (shared pin) counter/ 3 latch Columns 0, 1, and 2 CK, CK# Column 2 (select upper or lower nibble for BC4) FigureImage 5: Source: 256 Meg https://www.micron.com/~/media/documents/products/data-sheet/dram/ddr3/4gb_ddr3_sdram.pdf x 16 Functional Block Diagram ODT ODT control ZQ ZQ CAL To ODT/output drivers RZQ RESET# Control ZQCL, ZQCS CKE logic VSSQ A12 CK, CK# VDDQ/2 BC4 (burst chop) CS# R Column 0, 1, and 2 TT,nom RTT(WR) RAS# Bank 7 Bank 7 CK, CK# OTF Bank 6 Bank 6 sw1 sw2 CAS# Bank 5 Bank 5 decode Bank 4 Command Bank 4 WE# Bank 3 Bank 3 DLL Bank 2 Bank 2 Bank 1 (1 . 16) Refresh Bank 1 13 READ Mode registers counter Row- Bank 0 Bank 0 FIFO 16 15 128 DQ[15:0] address row- memory and 18 address array data READ MUX 32,768 DQ[15:0] latch (32,768 x 128 x 128) MUX drivers LDQS, LDQS#, UDQS, UDQS# 15 and decoder VDDQ/2 Sense amplifiers BC4 128 RTT,nom R 16,384 TT(WR) BC4 sw1 sw2 OTF LDQS, LDQS# 3 I/O gating DM mask logic UDQS, UDQS# Bank (1 . 4) A[14:0] Address 18 control BA[2:0] register logic 3 V /2 (128 DDQ x128) WRITE 128 16 Data drivers RTT,nom RTT(WR) interface and Data Column input sw1 sw2 decoder logic Column- 7 LDM/UDM 10 address (1, 2) counter/ 3 latch Columns 0, 1, and 2 CK, CK# Column 2 (select upper or lower nibble for BC4) PDF: 09005aef8417277b Micron Technology, Inc. reserves the right to change products or specifications without notice. 4Gb_DDR3_SDRAM.pdf - Rev. N 12/14 EN 15 © 2009 Micron Technology, Inc. All rights reserved. DDRx DRAM chips • data rate higher than command rate • double data rate only for data • two times for 1T command rate (usually used) • four times for 2T command rate • burst data transfer with length of typically 8 data words DDR3 DRAM IDLE ACTIVATE READ PRECHARGE WRITE • open row • access column • close row • tRCD delay to • READ: CL delay • tRP delay to read/write until first data next word on bus ACTIVATE on same bank • WRITE: CWL delay until first data word is read from bus memory controller • process load/store requests from CPU/GPU/DMA/… • preempt collisions on data bus • RAM refresh • make sure RAM stays in thermal budget • map physical addresses to memory locations • channel, rank, bank, row addresses • schedule memory accesses • higher memory bandwidth utilization memory controller: address mapping • channels • typically interleaved • higher bandwidth in one memory region • bit more power consumption • ranks • command and data bus shared • internal state independent • mostly for more memory capacity memory controller: address mapping • banks • typically interleaved • work around tRP restrictions • delay to open new row in same bank • bit more power consumption • row addresses • often not continuous, instead bank interleaving memory controller: address mapping • column addresses • direct mapping of least significant bits to column address • accessing different columns in one row: low latency • memory access patterns: typically continuous blocks caches • help solving part of the latency problem • physics TL;DR: • large memory → high latency • small memory → lower latency • memory accesses: often temporal and spacial locality • keep a small region of memory in a fast memory • (more or less) transparent caches • address bits except offset in cache line get hashed • hash used as address for cache line in cache • higher address bits stored as tag • associativity: number of possible locations in cache for one hash value • replacement strategies: • last recently used • last frequently used caches • common cache line size on modern x86: 64 bytes • DDRx DIMMs have 64 useable data bits • DDRx burst length is usually set to 8 • → 64 bytes, matches cache line size cache hierarchy • bigger cache → slower cache • combination of bigger, slower caches and smaller, but faster caches • inclusive or exclusive cache write strategies • write through: • every write is propagated to the next cache level or main memory • less state tracking needed • more write bandwidth used • write back: • contents only written back when line gets evicted • less writes to higher cache level or main memory • more state needs to be tracked and propagated CPU microarchitecture: load buffer, out-of-order exec • load buffers: • memory reads have not very well predictable latency • wait for data shouldn’t stall the pipeline • single thread performance: out-of-order execution • execute instructions that don’t depend on this read • try to keep the processor busy • multi thread performance: (possibly) hyper-threading cache prefetching • speculative loading of memory contents into cache lines • minimize read latency • instruction fetch • branch prediction • data reads • access pattern prediction • cache hinting instructions cache prefetching • right data prefetched • latency reduction • wrong data prefetched • unnecessary memory bandwidth usage • eviction of other cache lines • prefetching instructions to give processor hints cache coherency • problem in Symmetric MultiProcessing systems • different cores should always see same data in one location at a time • programming models for SMP assume cache coherency • doesn’t scale well with the number of cores • bus snooping vs directory-based • write-invalidate vs write-update • different cache coherency protocols bonus slide: memory ordering • under certain conditions reads and writes can be reordered • specified in the instruction set architecture • needs to be taken into account when building synchronization primitives • insert (expensive) memory barrier if necessary Thank you for your attention. Questions?.

Felix Held What This Talk Is (Not) About

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support