Caches & Memory Hakim Weatherspoon CS 3410 Computer Science Cornell University [Weatherspoon, Bala, Bracy, McKee, and Sirer] Programs 101 C Code RISC-V Assembly int main (int argc, char* argv[ ]) { main: addi sp,sp,-48 int i; sw x1,44(sp) int m = n; sw fp,40(sp) int sum = 0; move fp,sp sw x10,-36(fp) for (i = 1; i <= m; i++) { sw x11,-40(fp) sum += i; la x15,n } lw x15,0(x15) printf (“...”, n, sum); sw x15,-28(fp) } sw x0,-24(fp) li x15,1 sw x15,-20(fp) Load/Store Architectures: L2: lw x14,-20(fp) lw x15,-28(fp) • Read data from memory blt x15,x14,L3 (put in registers) . • Manipulate it .Instructions that read from • Store it back to memory or write to memory… 2 Programs 101 C Code RISC-V Assembly int main (int argc, char* argv[ ]) { main: addi sp,sp,-48 int i; sw ra,44(sp) int m = n; sw fp,40(sp) int sum = 0; move fp,sp sw a0,-36(fp) for (i = 1; i <= m; i++) { sw a1,-40(fp) sum += i; la a5,n } lw a5,0(x15) printf (“...”, n, sum); sw a5,-28(fp) } sw x0,-24(fp) li a5,1 sw a5,-20(fp) Load/Store Architectures: L2: lw a4,-20(fp) lw a5,-28(fp) • Read data from memory blt a5,a4,L3 (put in registers) . • Manipulate it .Instructions that read from • Store it back to memory or write to memory… 3 1 Cycle Per Stage: the Biggest Lie (So Far) Code Stored in Memory (also, data and stack) compute jump/branch targets A memory register ALU D D file B +4 addr PC B control din dout M inst memory extend new imm forward pc Stack, Data, Code detect unit hazard Stored in Memory Instruction Instruction Write- ctrl ctrl ctrl Fetch Decode Execute Memory Back IF/ID ID/EX EX/MEM MEM/WB 4 What’s the problem? CPU Main Memory + big – slow – far away SandyBridge Motherboard, 2011 5 http://news.softpedia.com The Need for Speed CPU Pipeline 6 The Need for Speed CPU Pipeline Instruction speeds: • add,sub,shift: 1 cycle • mult: 3 cycles • load/store: 100 cycles off-chip 50(-70)ns 2(-3) GHz processor 0.5 ns clock 7 The Need for Speed CPU Pipeline 8 What’s the solution? Caches ! Level 1 Level 2 $ Data $ Level 1 Insn $ Intel Pentium 3, 1999 9 Aside • Go back to 04-state and 05-memory and look at how registers, SRAM and DRAM are built. 10 What’s the solution? Caches ! Level 1 Level 2 $ Data $ Level 1 Insn $ What lucky data gets to go here? Intel Pentium 3, 1999 11 Locality Locality Locality If you ask for something, you’re likely to ask for: • the same thing again soon Temporal Locality • something near that thing, soon Spatial Locality total = 0; for (i = 0; i < n; i++) total += a[i]; return total; 12 Your life is full of Locality Last Called Speed Dial Favorites Contacts Google/Facebook/email 13 Your life is full of Locality 14 The Memory Hierarchy Small, Fast 1 cycle, 128 bytes Registers 4 cycles, L1 Caches 64 KB 12 cycles, L2 Cache 256 KB L3 Cache 36 cycles, 2-20 MB Main Memory 50-70 ns, Big, Slow 512 MB – 4 GB 5-20 ms Disk 16GB – 4 TB, Intel Haswell Processor, 2013 15 Some Terminology Cache hit • data is in the Cache • thit : time it takes to access the cache • Hit rate (%hit): # cache hits / # cache accesses Cache miss • data is not in the Cache • tmiss : time it takes to get the data from below the $ • Miss rate (%miss): # cache misses / # cache accesses Cacheline or cacheblock or simply line or block • Minimum unit of info that is present/or not in the cache 16 The Memory Hierarchy 1 cycle, average access time 128 bytes Registers tavg = thit + %miss* tmiss 4 cycles, = 4 + 5% x 100 L1 Caches 64 KB = 9 cycles 12 cycles, L2 Cache 256 KB L3 Cache 36 cycles, 2-20 MB Main Memory 50-70 ns, 512 MB – 4 GB 5-20 ms Disk 16GB – 4 TB, Intel Haswell Processor, 2013 17 Single Core Memory Hierarchy ON CHIP Processor Regs Registers L1 Caches I$ D$ L2 L2 Cache L3 Cache Main Main Memory Memory Disk Disk 18 Multi-Core Memory Hierarchy ON CHIP Processor Processor Processor Processor Regs Regs Regs Regs I$ D$ I$ D$ I$ D$ I$ D$ L2 L2 L2 L2 L3 Main Memory Disk 19 Memory Hierarchy by the Numbers CPU clock rates ~0.33ns – 2ns (3GHz-500MHz) Memory Transistor Access time Access time in $ per GIB Capacity technology count* cycles in 2012 SRAM 6-8 transistors 0.5-2.5 ns 1-3 cycles $4k 256 KB (on chip) SRAM 1.5-30 ns 5-15 cycles $4k 32 MB (off chip) DRAM 1 transistor 50-70 ns 150-200 cycles $10-$20 8 GB (needs refresh) SSD 5k-50k ns Tens of $0.75-$1 512 GB (Flash) thousands Disk 5M-20M ns Millions $0.05- 4 TB $0.1 *Registers,D-Flip Flops: 10-100’s of registers 20 Basic Cache Design Direct Mapped Caches 21 MEMORY 16 Byte Memory addr data 0000 A 0001 B 0010 C 0011 D load 1100 r1 0100 E 0101 F 0110 G 0111 H • Byte-addressable memory 1000 J 1001 K • 4 address bits 16 bytes total 1010 L • b addr bits 2b bytes in memory 1011 M 1100 N 1101 O 1110 P 1111 Q 22 4-Byte, Direct Mapped Cache MEMORY addr data 0000 A CACHE 0001 B 0010 C index index data 0011 D XXXX 00 A Cache entry 01 B = row 0100 E 10 C = (cache) line 0101 F 11 D = (cache) block 0110 G Block Size: 1 byte 0111 H 1000 J 1001 K Direct mapped: 1010 L • Each address maps to 1 cache block 1011 M • 4 entries 2 index bits (2n n bits) 1100 N 1101 O Index with LSB: 1110 P • Supports spatial locality 1111 Q 23 Analogy to a Spice Rack Spice Rack Spice Wall (Cache) (Memory) index spice A B C D E F … Z • Compared to your spice wall • Smaller • Faster • More costly (per oz.) 24 http://www.bedbathandbeyond.com Analogy to a Spice Rack Spice Rack Spice Wall (Cache) (Memory) index tag spice A B C Cinnamoninnamon D E F … Z • How do you know what’s in the jar? • Need labels Tag = Ultra-minimalist label 25 4-Byte, Direct Mapped MEMORY addr data Cache 0000 A 0001 B 0010 C tag|index CACHE XXXX 0011 D index tag data 0100 E 00 00 A 0101 F 01 00 B 0110 G 10 00 C 0111 H 11 00 D 1000 J 1001 K Tag: minimalist label/address 1010 L address = tag + index 1011 M 1100 N 1101 O 1110 P 1111 Q 26 4-Byte, Direct Mapped MEMORY addr data Cache 0000 A 0001 B 0010 C CACHE 0011 D index V tag data 0100 E 00 0 00 X 0101 F 01 0 00 X 0110 G 10 0 00 X 0111 H 11 0 00 X 1000 J 1001 K One last tweak: valid bit 1010 L 1011 M 1100 N 1101 O 1110 P 1111 Q 27 MEMORY Simulation #1 addr data of a 4-byte, DM Cache 0000 A 0001 B 0010 C tag|index CACHE XXXX 0011 D index V tag data 0100 E 00 0 11 X 0101 F 01 0 11 X 0110 G 10 0 11 X 0111 H 11 0 11 X 1000 J 1001 K load 1100 Lookup: 1010 L • Index into $ 1011 M • Check tag 1100 N • Check valid bit 1101 O 1110 P 1111 Q 28 Block Diagram 4-entry, direct mapped Cache tag|index CACHE 1101 V tag data 2 2 1 00 1111 0000 1 11 1010 0101 0 01 1010 1010 Great! 1 11 0000 0000 Are we done? 2 8 = 1010 0101 data Hit! 29 MEMORY Simulation #2: addr data 4-byte, DM Cache 0000 A 0001 B 0010 C CACHE 0011 D index V tag data 0100 E 00 1 11 N 0101 F 01 0 11 X 0110 G 10 0 11 X 0111 H 11 0 11 X 1000 J 1001 K load 1100 Miss Lookup: 1010 L load 1101 • Index into $ 1011 M load 0100 • Check tag 1100 N load 1100 • Check valid bit 1101 O 1110 P 1111 Q 30 Reducing Cold Misses by Increasing Block Size • Leveraging Spatial Locality 31 MEMORY Increasing Block Size addr data 0000 A CACHE 0001 B 0010 C offset index V tag data 0011 D XXXX 00 0 x A | B 0100 E 01 0 x C | D 0101 F 10 0 x E | F 0110 G 11 0 x G | H 0111 H 1000 J • Block Size: 2 bytes 1001 K • Block Offset: least significant bits 1010 L indicate where you live in the block 1011 M 1100 N • Which bits are the index? tag? 1101 O 1110 P 1111 Q 32 MEMORY Simulation #3: addr data 8-byte, DM Cache 0000 A CACHE 0001 B 0010 C tag| index |offset index V tag data 0011 D XXXX 00 0 x X | X 0100 E 01 0 x X | X 0101 F 10 0 x X | X 0110 G 11 0 x X | X 0111 H 1000 J 1001 K load 1100 Lookup: 1010 L load 1101 • Index into $ 1011 M load 0100 • Check tag 1100 N load 1100 • Check valid bit 1101 O 1110 P 1111 Q 33 Removing Conflict Misses with Fully-Associative Caches 34 MEMORY Simulation #4: addr data 8-byte, FA Cache 0000 A 0001 B XXXX 0010 C tag|offset 0011 D CACHE 0100 E 0101 F V tag data V tag data V tag data V tag data 0110 G 0 xxx X | X 0 xxx X | X 0 xxx X | X 0 xxx X | X 0111 H 1000 J 1001 K load 1100 Miss Lookup: 1010 L load 1101 • Index into $ 1011 M load 0100 • Check tags 1100 N load 1100 • Check valid bits 1101 O 1110 P LRU Pointer 1111 Q 35 Pros and Cons of Full Associativity + No more conflicts! + Excellent utilization! But either: Parallel Reads – lots of reading! Serial Reads – lots of waiting tavg = thit + %miss* tmiss = 4 + 5% x 100 = 6 + 3% x 100 = 9 cycles = 9 cycles 36 Pros & Cons Direct Mapped Fully Associative Tag Size Smaller Larger SRAM Overhead Less More Controller Logic Less More Speed Faster Slower Price Less More Scalability Very Not Very # of conflict misses Lots Zero Hit Rate Low High Pathological Cases Common ? 37 Reducing Conflict Misses with Set-Associative Caches Not too conflict-y.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages77 Page
-
File Size-