Caches I CSE410, Winter 2017

L17: Caches I CSE410, Winter 2017 Caches I CSE 410 Winter 2017 Instructor: Teaching Assistants: Justin Hsia Kathryn Chan, Kevin Bi, Ryan Wong, Waylon Huang, Xinyu Sui L17: Caches I CSE410, Winter 2017 Administrivia Midterm re‐grade requests due Thursday (2/16) Lab 3 due next Thursday (2/23) Optional section: Arrays, Structs, Buffer Overflow Midterm Clobber Policy . Final will be cumulative (half midterm, half post‐midterm) . If you perform better on the midterm portion of the final, you replace your midterm score! . Replacement score = F score F avg MT mean 2 L17: Caches I CSE410, Winter 2017 Roadmap C: Java: Memory & data car *c = malloc(sizeof(car)); Car c = new Car(); c->miles = 100; c.setMiles(100); Integers & floats c->gals = 17; c.setGals(17); x86 assembly float mpg = get_mpg(c); float mpg = Procedures & stacks free(c); c.getMPG(); Executables Arrays & structs Assembly get_mpg: Memory & caches language: pushq %rbp Processes movq %rsp, %rbp ... Virtual memory popq %rbp Operating Systems ret OS: Machine 0111010000011000 100011010000010000000010 code: 1000100111000010 110000011111101000011111 Computer system: 3 L17: Caches I CSE410, Winter 2017 How does execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0; i < 200000; i++) { for (int j = 0; j < SIZE; j++) { sum += array[j]; } } Time Plot SIZE 4 L17: Caches I CSE410, Winter 2017 Actual Data 45 40 35 30 25 Time 20 15 10 5 0 0 2000 4000 6000 8000 10000 SIZE 5 L17: Caches I CSE410, Winter 2017 Making memory accesses fast! Cache basics Principle of locality Memory hierarchies Cache organization Program optimizations that consider caches 6 L17: Caches I CSE410, Winter 2017 Processor‐Memory Gap 1989 first Intel CPU with cache on chip 1998 Pentium III has two cache levels on chip µProc 10000 55%/year (2X/1.5yr) 1000 Processor‐Memory Performance Gap 100 “Moore’s Law” (grows 50%/year) Performance 10 DRAM 1 7%/year (2X/10yrs) Year 7 L17: Caches I CSE410, Winter 2017 Problem: Processor‐Memory Bottleneck Processor performance doubled about every 18 months Bus latency / bandwidth evolved much slower Main CPU Reg Memory Core 2 Duo: Core 2 Duo: Can process at least Bandwidth 256 Bytes/cycle 2 Bytes/cycle Latency 100‐200 cycles (30‐60ns) Problem: lots of waiting on memory cycle: single machine step (fixed‐time) 8 L17: Caches I CSE410, Winter 2017 Problem: Processor‐Memory Bottleneck Processor performance doubled about every 18 months Bus latency / bandwidth evolved much slower Main CPU Reg Cache Memory Core 2 Duo: Core 2 Duo: Can process at least Bandwidth 256 Bytes/cycle 2 Bytes/cycle Latency 100‐200 cycles (30‐60ns) Solution: caches cycle: single machine step (fixed‐time) 9 L17: Caches I CSE410, Winter 2017 Cache Pronunciation: “cash” . We abbreviate this as “$” English: A hidden storage space for provisions, weapons, and/or treasures Computer: Memory with short access time used for the storage of frequently or recently used instructions (i‐cache/I$) or data (d‐cache/D$) . More generally: Used to optimize data transfers between any system elements with different characteristics (network interface cache, I/O cache, etc.) 10 L17: Caches I CSE410, Winter 2017 General Cache Mechanics • Smaller, faster, more expensive memory. Cache 7 9 14 3 • Caches a subset of the blocks (a.k.a. lines) Data is copied in block‐sized transfer units • Larger, slower, cheaper memory. Memory 0 1 2 3 • Viewed as partitioned into “blocks” 4 5 6 7 or “lines” 8 9 10 11 12 13 14 15 11 L17: Caches I CSE410, Winter 2017 General Cache Concepts: Hit Request: 14 Data in block b is needed Block b is in cache: 7 9 14 3 Cache Hit! Data is returned to CPU Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 12 L17: Caches I CSE410, Winter 2017 General Cache Concepts: Miss Request: 12 Data in block b is needed Block b is not in cache: 7 129 14 3 Cache Miss! Block b is fetched from Request: 12 12 memory Block b is stored in cache Memory 0 1 2 3 • Placement policy: 4 5 6 7 determines where b goes • Replacement policy: 8 9 10 11 determines which block 12 13 14 15 gets evicted (victim) Data is returned to CPU 13 L17: Caches I CSE410, Winter 2017 Why Caches Work Locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently 14 L17: Caches I CSE410, Winter 2017 Why Caches Work Locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently block Temporal locality: . Recently referenced items are likely to be referenced again in the near future 15 L17: Caches I CSE410, Winter 2017 Why Caches Work Locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently block Temporal locality: . Recently referenced items are likely to be referenced again in the near future Spatial locality: block . Items with nearby addresses tend to be referenced close together in time How do caches take advantage of this? 16 L17: Caches I CSE410, Winter 2017 Example: Any Locality? sum = 0; for (i = 0; i < n; i++) { sum += a[i]; } return sum; Data: . Temporal: sum referenced in each iteration . Spatial: array a[] accessed in stride‐1 pattern Instructions: . Temporal: cycle through loop repeatedly . Spatial: reference instructions in sequence 17 L17: Caches I CSE410, Winter 2017 Locality Example #1 int sum_array_rows(int a[M][N]) { int i, j, sum = 0; for (i = 0; i < M; i++) for (j = 0; j < N; j++) sum += a[i][j]; return sum; } 18 L17: Caches I CSE410, Winter 2017 Locality Example #1 int sum_array_rows(int a[M][N]) M = 3, N=4 { a[0][0] a[0][1] a[0][2] a[0][3] int i, j, sum = 0; a[1][0] a[1][1] a[1][2] a[1][3] for (i = 0; i < M; i++) for (j = 0; j < N; j++) a[2][0] a[2][1] a[2][2] a[2][3] sum += a[i][j]; Access Pattern: 1) a[0][0] return sum; stride = ? 2) a[0][1] } 3) a[0][2] 4) a[0][3] Layout in Memory 5) a[1][0] 6) a[1][1] a a a a a a a a a a a a [0] [0] [0] [0] [1] [1] [1] [1] [2] [2] [2] [2] 7) a[1][2] [0] [1] [2] [3] [0] [1] [2] [3] [0] [1] [2] [3] 8) a[1][3] 9) a[2][0] 76 92 108 10) a[2][1] 11) a[2][2] Note: 76 is just one possible starting address of array a 12) a[2][3] 19 L17: Caches I CSE410, Winter 2017 Locality Example #2 int sum_array_cols(int a[M][N]) { int i, j, sum = 0; for (j = 0; j < N; j++) for (i = 0; i < M; i++) sum += a[i][j]; return sum; } 20 L17: Caches I CSE410, Winter 2017 Locality Example #2 int sum_array_cols(int a[M][N]) M = 3, N=4 { a[0][0] a[0][1] a[0][2] a[0][3] int i, j, sum = 0; a[1][0] a[1][1] a[1][2] a[1][3] for (j = 0; j < N; j++) for (i = 0; i < M; i++) a[2][0] a[2][1] a[2][2] a[2][3] sum += a[i][j]; Access Pattern: 1) a[0][0] return sum; stride = ? 2) a[1][0] } 3) a[2][0] 4) a[0][1] Layout in Memory 5) a[1][1] 6) a[2][1] a a a a a a a a a a a a [0] [0] [0] [0] [1] [1] [1] [1] [2] [2] [2] [2] 7) a[0][2] [0] [1] [2] [3] [0] [1] [2] [3] [0] [1] [2] [3] 8) a[1][2] 9) a[2][2] 76 92 108 10) a[0][3] 11) a[1][3] 12) a[2][3] 21 L17: Caches I CSE410, Winter 2017 Locality Example #3 int sum_array_3D(int a[M][N][L]) { What is wrong int i, j, k, sum = 0; with this code? for (i = 0; i < N; i++) for (j = 0; j < L; j++) for (k = 0; k < M; k++) How can it be sum += a[k][i][j]; fixed? return sum; } a[2][0][0] a[2][0][1] a[2][0][2] a[2][0][3] a[1][0][0] a[1][0][1] a[1][0][2] a[1][0][3] a[0][0][0a][2a][[01][0][] 1a][2a][[10][][10]][2a][2a][[10][][20]][3a][2][1][3] a[1][1][0] a[1][1][1] a[1][1][2] a[1][1][3] a[0][1][0a][2a][[02][10][] 1a][2a][[20][][11]][2a][2a][[20][][21]][3a][2][2][3] m = 2 a[1][2][0] a[1][2][1] a[1][2][2] a[1][2][3] m= 1 a[0][2][0] a[0][2][1] a[0][2][2] a[0][2][3] m= 0 22 L17: Caches I CSE410, Winter 2017 Locality Example #3 int sum_array_3D(int a[M][N][L]) { What is wrong int i, j, k, sum = 0; with this code? for (i = 0; i < N; i++) for (j = 0; j < L; j++) for (k = 0; k < M; k++) How can it be sum += a[k][i][j]; fixed? return sum; } Layout in Memory (M = ?, N = 3, L = 4) a a a a a a a a a a a a a a a a a a a a a a a a [0] [0] [0] [0] [0] [0] [0] [0] [0] [0] [0] [0] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [0] [0] [0] [0] [1] [1] [1] [1] [2] [2] [2] [2] [0] [0] [0] [0] [1] [1] [1] [1] [2] [2] [2] [2] [0] [1] [2] [3] [0] [1] [2] [3] [0] [1] [2] [3] [0] [1] [2] [3] [0] [1] [2] [3] [0] [1] [2] [3] 76 92 108 124 140 156 172 23 L17: Caches I CSE410, Winter 2017 Cache Performance Metrics Huge difference between a cache hit and a cache miss .

Caches I CSE410, Winter 2017

Caching Basics

45-Year CPU Evolution: One Law and Two Equations

Hierarchical Roofline Analysis for Gpus: Accelerating Performance

An Algorithmic Theory of Caches by Sridhar Ramachandran

Generalized Methods for Application Specific Hardware Specialization

14. Caching and Cache-Efficient Algorithms

Open Jishen-Zhao-Dissertation.Pdf

Beating In-Order Stalls with “Flea-Flicker” Two-Pass Pipelining

Performance Analysis of Complex Shared Memory Systems Abridged Version of Dissertation

Chapter 2: Memory Hierarchy Design

Architecting and Programming an Incoherent Multiprocessor Cache

55 Measuring Microarchitectural Details of Multi- and Many-Core