Memory Hierarchy) CS422-Spring 2020

Lecture-13 and 14 (Memory Hierarchy) CS422-Spring 2020 Biswa@CSE-IITK The Ideal World Pipeline Instruction Data (Instruction Supply Supply execution) - Zero-cycle latency - No pipeline stalls - Zero-cycle latency - Infinite capacity -Perfect data flow - Infinite capacity (reg/memory dependencies) - Zero cost - Infinite bandwidth - Zero-cycle interconnect - Perfect control flow (operand communication) - Zero cost - Enough functional units - Zero latency compute Semiconductor Memory • Semiconductor memory began to be competitive in early 1970s • Intel formed to exploit market for semiconductor memory • Early semiconductor memory was Static RAM (SRAM). SRAM cell internals similar to a latch (cross-coupled inverters). • First commercial Dynamic RAM (DRAM) was Intel 1103 • 1Kbit of storage on single chip • charge on a capacitor used to hold value Semiconductor memory quickly replaced core in ‘70s One-transistor DRAM 1-T DRAM Cell word access transistor VREF bit Storage capacitor (FET gate, trench, stack) DRAM vs SRAM • DRAM • Slower access (capacitor) • Higher density (1T 1C cell) • Lower cost • Requires refresh (power, performance, circuitry) • SRAM • Faster access (no capacitor) • Lower density (6T cell) • Higher cost • No need for refresh The Problem? • Bigger is slower • SRAM, 512 Bytes, sub-nanosec • SRAM, KByte~MByte, ~nanosec • DRAM, Gigabyte, ~50 nanosec • Hard Disk, Terabyte, ~10 millisec • Faster is more expensive (dollars and chip area) • SRAM, < 10$ per Megabyte • DRAM, < 1$ per Megabyte • Hard Disk < 1$ per Gigabyte • These sample values scale with time • Other technologies have their place as well • Flash memory, PC-RAM, MRAM, RRAM (not mature yet) Why Memory Hierarchy? • We want both fast and large • But we cannot achieve both with a single level of memory • Idea: Have multiple levels of storage (progressively bigger and slower as the levels are farther from the processor) and ensure most of the data the processor needs is kept in the fast(er) level(s) Memory Wall Problem MemoryWall Performance 1000 100 10 1 1980 1981 1982 1983 1984 1985 1986 1987 µProc 60%/year µProc Time 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 50%/yr) (growing Gap: Performance Processor DRAM 1998 1999 CPU - 2000 Memory 7%/year DRAM Size Affects Latency CPU CPU Small Memory Big Memory Motivates 3D stacking ▪ Signals have further to travel ▪ Fan out to more locations Memory Hierarchy A B Small, Big, Slow Memory CPU Fast Memory (RF, SRAM) (DRAM) holds frequently used data • capacity: Register << SRAM << DRAM • latency: Register << SRAM << DRAM • bandwidth: on-chip >> off-chip On a data access: if data fast memory low latency access (SRAM) if data fast memory high latency access (DRAM) Access Patterns Memory Address (one dot per access) per (one dot MemoryAddress Donald J. Hatfield, Jeanette Gerald: Program Restructuring for Virtual Time Memory. IBM Systems Journal 10(3): 168-192 (1971) Examples Address n loop iterations Instruction fetches subroutine subroutine call Stack return accesses argument access Data accesses scalar accesses Time Locality of Reference • Temporal Locality: If a location is referenced it is likely to be referenced again in the near future. • Spatial Locality: If a location is referenced it is likely that locations near it will be referenced in the near future. Again Temporal Locality Spatial Memory Address (one dot per access) per (one dot MemoryAddress Locality Time Inside a Cache Address Address Processor CACHE Main Memory Data Data copy of main copy of main memory memory location 100 location 101 Data Data 100 Byte Byte Line Data 304 Byte Address 6848 Tag 416 Data Block Cache Events Look at Processor Address, search cache tags to find match. Then either Found in cache Not in cache a.k.a. HIT a.k.a. MISS Return copy Read block of data from of data from Main Memory cache Wait … Return data to processor and update cache Q: Which line do we replace? Placement Policy 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 Block Number 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 Memory Set Number 0 1 2 3 0 1 2 3 4 5 6 7 Cache Fully (2-way) Set Direct Associative Associative Mapped anywhere anywhere in only into block 12 can be placed set 0 block 4 (12 mod 4) (12 mod 8) Direct Mapped Block Tag Index Offset t k b V Tag Data Block 2k In reality, tag-store is placed lines separately t = HIT Data Word or Byte High bits or Low bits Block Index Tag Offset k t b V Tag Data Block 2k lines t = HIT Data Word or Byte Set-Associative Tag Index Block Offset b t k V Tag Data Block V Tag Data Block t = = Data Word or Byte HIT Fully-associative V Tag Data Block t = Tag t = HIT Data Offset Block = Word b or Byte Block (line) Size ? Tag Word0 Word1 Word2 Word3 4 word block, b=2 Split CPU block address offsetb address 32-b bits b bits 2b = block size a.k.a line size (in bytes) Larger block size has distinct hardware advantages • less tag overhead • exploit fast burst transfers from DRAM • exploit fast burst transfers over wide busses What are the disadvantages of increasing block size? Fewer blocks => more conflicts. Can waste bandwidth. Block Size? ◼ Block size is the data that is associated with an address tag ❑ not necessarily the unit of transfer between hierarchies ◼ Sub-blocking: A block divided into multiple pieces (each with V bit) ❑ Can improve “write” performance hit rate ◼ Too small blocks ❑ don’t exploit spatial locality well ❑ have larger tag overhead ◼ Too large blocks ❑ too few total # of blocks ◼ likely-useless data transferred block ◼ Extra bandwidth/energy consumed size Cache Size ◼ Cache size: total data (not including tag) capacity ❑ bigger can exploit temporal locality better ❑ not ALWAYS better ◼ Too large a cache adversely affects hit and miss latency ❑ smaller is faster => bigger is slower ❑ access time may degrade critical path ◼ Too small a cache hit rate ❑ doesn’t exploit temporal locality well ❑ useful data replaced often “working set” size ◼ Working set: the whole set of data the executing application references ❑ Within a time interval cache size Associativity ◼ How many blocks can map to the same index (or set)? ◼ Larger associativity ❑ lower miss rate, less variation among programs ❑ diminishing returns, higher hit latency hit rate ◼ Smaller associativity ❑ lower cost ❑ lower hit latency ◼ Especially important for L1 caches ◼ Power of 2 associativity? associativity CPU – Cache Interaction 0x4 Add E M A we Decode, ALU Y addr bubble IR B Primary Register Data rdata PC addr inst Fetch Cache R D wdata hit? hit? wdata PCen Instruction MD1 MD2 Cache Stall entire CPU on data cache miss To Memory Control Cache Refill Data from Lower Levels of Memory Hierarchy.

Memory Hierarchy) CS422-Spring 2020

Embedded DRAM

Chapter 6 : Memory System

Memory Systems

A Novel Cache Architecture for Multicore Processor Using Vlsi Sujitha .S, Anitha .N

Memory Hierarchy Summary

Problem 1 : Locality of Reference. [10 Points] Problem 2 : Virtual Memory

Week 6: Assignment Solutions

Slap: a Split Latency Adaptive Vliw Pipeline Architecture Which Enables On-The-Fly Variable Simd Vector-Length

The Memory System

14. Caches & the Memory Hierarchy

Unstructured Computations on Emerging Architectures

The Automatic Improvement of Locality in Storage Systems