Caches & Memory

Caches & Memory Hakim Weatherspoon CS 3410 Computer Science Cornell University [Weatherspoon, Bala, Bracy, McKee, and Sirer] Programs 101 C Code RISC-V Assembly int main (int argc, char* argv[ ]) { main: addi sp,sp,-48 int i; sw x1,44(sp) int m = n; sw fp,40(sp) int sum = 0; move fp,sp sw x10,-36(fp) for (i = 1; i <= m; i++) { sw x11,-40(fp) sum += i; la x15,n } lw x15,0(x15) printf (“...”, n, sum); sw x15,-28(fp) } sw x0,-24(fp) li x15,1 sw x15,-20(fp) Load/Store Architectures: L2: lw x14,-20(fp) lw x15,-28(fp) • Read data from memory blt x15,x14,L3 (put in registers) . • Manipulate it .Instructions that read from • Store it back to memory or write to memory… 2 Programs 101 C Code RISC-V Assembly int main (int argc, char* argv[ ]) { main: addi sp,sp,-48 int i; sw ra,44(sp) int m = n; sw fp,40(sp) int sum = 0; move fp,sp sw a0,-36(fp) for (i = 1; i <= m; i++) { sw a1,-40(fp) sum += i; la a5,n } lw a5,0(x15) printf (“...”, n, sum); sw a5,-28(fp) } sw x0,-24(fp) li a5,1 sw a5,-20(fp) Load/Store Architectures: L2: lw a4,-20(fp) lw a5,-28(fp) • Read data from memory blt a5,a4,L3 (put in registers) . • Manipulate it .Instructions that read from • Store it back to memory or write to memory… 3 1 Cycle Per Stage: the Biggest Lie (So Far) Code Stored in Memory (also, data and stack) compute jump/branch targets A memory register ALU D D file B +4 addr PC B control din dout M inst memory extend new imm forward pc Stack, Data, Code detect unit hazard Stored in Memory Instruction Instruction Write- ctrl ctrl ctrl Fetch Decode Execute Memory Back IF/ID ID/EX EX/MEM MEM/WB 4 What’s the problem? CPU Main Memory + big – slow – far away SandyBridge Motherboard, 2011 5 http://news.softpedia.com The Need for Speed CPU Pipeline 6 The Need for Speed CPU Pipeline Instruction speeds: • add,sub,shift: 1 cycle • mult: 3 cycles • load/store: 100 cycles off-chip 50(-70)ns 2(-3) GHz processor 0.5 ns clock 7 The Need for Speed CPU Pipeline 8 What’s the solution? Caches ! Level 1 Level 2 $ Data $ Level 1 Insn $ Intel Pentium 3, 1999 9 Aside • Go back to 04-state and 05-memory and look at how registers, SRAM and DRAM are built. 10 What’s the solution? Caches ! Level 1 Level 2 $ Data $ Level 1 Insn $ What lucky data gets to go here? Intel Pentium 3, 1999 11 Locality Locality Locality If you ask for something, you’re likely to ask for: • the same thing again soon Temporal Locality • something near that thing, soon Spatial Locality total = 0; for (i = 0; i < n; i++) total += a[i]; return total; 12 Your life is full of Locality Last Called Speed Dial Favorites Contacts Google/Facebook/email 13 Your life is full of Locality 14 The Memory Hierarchy Small, Fast 1 cycle, 128 bytes Registers 4 cycles, L1 Caches 64 KB 12 cycles, L2 Cache 256 KB L3 Cache 36 cycles, 2-20 MB Main Memory 50-70 ns, Big, Slow 512 MB – 4 GB 5-20 ms Disk 16GB – 4 TB, Intel Haswell Processor, 2013 15 Some Terminology Cache hit • data is in the Cache • thit : time it takes to access the cache • Hit rate (%hit): # cache hits / # cache accesses Cache miss • data is not in the Cache • tmiss : time it takes to get the data from below the $ • Miss rate (%miss): # cache misses / # cache accesses Cacheline or cacheblock or simply line or block • Minimum unit of info that is present/or not in the cache 16 The Memory Hierarchy 1 cycle, average access time 128 bytes Registers tavg = thit + %miss* tmiss 4 cycles, = 4 + 5% x 100 L1 Caches 64 KB = 9 cycles 12 cycles, L2 Cache 256 KB L3 Cache 36 cycles, 2-20 MB Main Memory 50-70 ns, 512 MB – 4 GB 5-20 ms Disk 16GB – 4 TB, Intel Haswell Processor, 2013 17 Single Core Memory Hierarchy ON CHIP Processor Regs Registers L1 Caches I$ D$ L2 L2 Cache L3 Cache Main Main Memory Memory Disk Disk 18 Multi-Core Memory Hierarchy ON CHIP Processor Processor Processor Processor Regs Regs Regs Regs I$ D$ I$ D$ I$ D$ I$ D$ L2 L2 L2 L2 L3 Main Memory Disk 19 Memory Hierarchy by the Numbers CPU clock rates ~0.33ns – 2ns (3GHz-500MHz) Memory Transistor Access time Access time in $ per GIB Capacity technology count* cycles in 2012 SRAM 6-8 transistors 0.5-2.5 ns 1-3 cycles $4k 256 KB (on chip) SRAM 1.5-30 ns 5-15 cycles $4k 32 MB (off chip) DRAM 1 transistor 50-70 ns 150-200 cycles $10-$20 8 GB (needs refresh) SSD 5k-50k ns Tens of $0.75-$1 512 GB (Flash) thousands Disk 5M-20M ns Millions $0.05- 4 TB $0.1 *Registers,D-Flip Flops: 10-100’s of registers 20 Basic Cache Design Direct Mapped Caches 21 MEMORY 16 Byte Memory addr data 0000 A 0001 B 0010 C 0011 D load 1100 r1 0100 E 0101 F 0110 G 0111 H • Byte-addressable memory 1000 J 1001 K • 4 address bits 16 bytes total 1010 L • b addr bits 2b bytes in memory 1011 M 1100 N 1101 O 1110 P 1111 Q 22 4-Byte, Direct Mapped Cache MEMORY addr data 0000 A CACHE 0001 B 0010 C index index data 0011 D XXXX 00 A Cache entry 01 B = row 0100 E 10 C = (cache) line 0101 F 11 D = (cache) block 0110 G Block Size: 1 byte 0111 H 1000 J 1001 K Direct mapped: 1010 L • Each address maps to 1 cache block 1011 M • 4 entries 2 index bits (2n n bits) 1100 N 1101 O Index with LSB: 1110 P • Supports spatial locality 1111 Q 23 Analogy to a Spice Rack Spice Rack Spice Wall (Cache) (Memory) index spice A B C D E F … Z • Compared to your spice wall • Smaller • Faster • More costly (per oz.) 24 http://www.bedbathandbeyond.com Analogy to a Spice Rack Spice Rack Spice Wall (Cache) (Memory) index tag spice A B C Cinnamoninnamon D E F … Z • How do you know what’s in the jar? • Need labels Tag = Ultra-minimalist label 25 4-Byte, Direct Mapped MEMORY addr data Cache 0000 A 0001 B 0010 C tag|index CACHE XXXX 0011 D index tag data 0100 E 00 00 A 0101 F 01 00 B 0110 G 10 00 C 0111 H 11 00 D 1000 J 1001 K Tag: minimalist label/address 1010 L address = tag + index 1011 M 1100 N 1101 O 1110 P 1111 Q 26 4-Byte, Direct Mapped MEMORY addr data Cache 0000 A 0001 B 0010 C CACHE 0011 D index V tag data 0100 E 00 0 00 X 0101 F 01 0 00 X 0110 G 10 0 00 X 0111 H 11 0 00 X 1000 J 1001 K One last tweak: valid bit 1010 L 1011 M 1100 N 1101 O 1110 P 1111 Q 27 MEMORY Simulation #1 addr data of a 4-byte, DM Cache 0000 A 0001 B 0010 C tag|index CACHE XXXX 0011 D index V tag data 0100 E 00 0 11 X 0101 F 01 0 11 X 0110 G 10 0 11 X 0111 H 11 0 11 X 1000 J 1001 K load 1100 Lookup: 1010 L • Index into $ 1011 M • Check tag 1100 N • Check valid bit 1101 O 1110 P 1111 Q 28 Block Diagram 4-entry, direct mapped Cache tag|index CACHE 1101 V tag data 2 2 1 00 1111 0000 1 11 1010 0101 0 01 1010 1010 Great! 1 11 0000 0000 Are we done? 2 8 = 1010 0101 data Hit! 29 MEMORY Simulation #2: addr data 4-byte, DM Cache 0000 A 0001 B 0010 C CACHE 0011 D index V tag data 0100 E 00 1 11 N 0101 F 01 0 11 X 0110 G 10 0 11 X 0111 H 11 0 11 X 1000 J 1001 K load 1100 Miss Lookup: 1010 L load 1101 • Index into $ 1011 M load 0100 • Check tag 1100 N load 1100 • Check valid bit 1101 O 1110 P 1111 Q 30 Reducing Cold Misses by Increasing Block Size • Leveraging Spatial Locality 31 MEMORY Increasing Block Size addr data 0000 A CACHE 0001 B 0010 C offset index V tag data 0011 D XXXX 00 0 x A | B 0100 E 01 0 x C | D 0101 F 10 0 x E | F 0110 G 11 0 x G | H 0111 H 1000 J • Block Size: 2 bytes 1001 K • Block Offset: least significant bits 1010 L indicate where you live in the block 1011 M 1100 N • Which bits are the index? tag? 1101 O 1110 P 1111 Q 32 MEMORY Simulation #3: addr data 8-byte, DM Cache 0000 A CACHE 0001 B 0010 C tag| index |offset index V tag data 0011 D XXXX 00 0 x X | X 0100 E 01 0 x X | X 0101 F 10 0 x X | X 0110 G 11 0 x X | X 0111 H 1000 J 1001 K load 1100 Lookup: 1010 L load 1101 • Index into $ 1011 M load 0100 • Check tag 1100 N load 1100 • Check valid bit 1101 O 1110 P 1111 Q 33 Removing Conflict Misses with Fully-Associative Caches 34 MEMORY Simulation #4: addr data 8-byte, FA Cache 0000 A 0001 B XXXX 0010 C tag|offset 0011 D CACHE 0100 E 0101 F V tag data V tag data V tag data V tag data 0110 G 0 xxx X | X 0 xxx X | X 0 xxx X | X 0 xxx X | X 0111 H 1000 J 1001 K load 1100 Miss Lookup: 1010 L load 1101 • Index into $ 1011 M load 0100 • Check tags 1100 N load 1100 • Check valid bits 1101 O 1110 P LRU Pointer 1111 Q 35 Pros and Cons of Full Associativity + No more conflicts! + Excellent utilization! But either: Parallel Reads – lots of reading! Serial Reads – lots of waiting tavg = thit + %miss* tmiss = 4 + 5% x 100 = 6 + 3% x 100 = 9 cycles = 9 cycles 36 Pros & Cons Direct Mapped Fully Associative Tag Size Smaller Larger SRAM Overhead Less More Controller Logic Less More Speed Faster Slower Price Less More Scalability Very Not Very # of conflict misses Lots Zero Hit Rate Low High Pathological Cases Common ? 37 Reducing Conflict Misses with Set-Associative Caches Not too conflict-y.

Caches & Memory

Memory Hierarchy Memory Hierarchy

Make the Most out of Last Level Cache in Intel Processors In: Proceedings of the Fourteenth Eurosys Conference (Eurosys'19), Dresden, Germany, 25-28 March 2019

Migration from IBM 750FX to MPC7447A by Douglas Hamilton European Applications Engineering Networking and Computing Systems Group Freescale Semiconductor, Inc

Stealing the Shared Cache for Fun and Profit

IBM Power Systems Performance Report Apr 13, 2021

Cache & Memory System

Cache-Fair Thread Scheduling for Multicore Processors

A Cache Line Fill Circuit for a Micropipelined, Asynchronous Microprocessor

Quickspecs HP Integrity Rx7640 Server Overview

Exploiting Cache Side Channels on CPU-FPGA Cloud Platforms

The Tradeoffs of Fused Memory Hierarchies in Heterogeneous Computing Architectures

Architectural Trade-Offs in a Latency Tolerant Gallium Arsenide Microprocessor