Caches I CSE410, Winter 2017

Caches I CSE410, Winter 2017

L17: Caches I CSE410, Winter 2017 Caches I CSE 410 Winter 2017 Instructor: Teaching Assistants: Justin Hsia Kathryn Chan, Kevin Bi, Ryan Wong, Waylon Huang, Xinyu Sui L17: Caches I CSE410, Winter 2017 Administrivia Midterm re‐grade requests due Thursday (2/16) Lab 3 due next Thursday (2/23) Optional section: Arrays, Structs, Buffer Overflow Midterm Clobber Policy . Final will be cumulative (half midterm, half post‐midterm) . If you perform better on the midterm portion of the final, you replace your midterm score! . Replacement score = F score F avg MT mean 2 L17: Caches I CSE410, Winter 2017 Roadmap C: Java: Memory & data car *c = malloc(sizeof(car)); Car c = new Car(); c->miles = 100; c.setMiles(100); Integers & floats c->gals = 17; c.setGals(17); x86 assembly float mpg = get_mpg(c); float mpg = Procedures & stacks free(c); c.getMPG(); Executables Arrays & structs Assembly get_mpg: Memory & caches language: pushq %rbp Processes movq %rsp, %rbp ... Virtual memory popq %rbp Operating Systems ret OS: Machine 0111010000011000 100011010000010000000010 code: 1000100111000010 110000011111101000011111 Computer system: 3 L17: Caches I CSE410, Winter 2017 How does execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0; i < 200000; i++) { for (int j = 0; j < SIZE; j++) { sum += array[j]; } } Time Plot SIZE 4 L17: Caches I CSE410, Winter 2017 Actual Data 45 40 35 30 25 Time 20 15 10 5 0 0 2000 4000 6000 8000 10000 SIZE 5 L17: Caches I CSE410, Winter 2017 Making memory accesses fast! Cache basics Principle of locality Memory hierarchies Cache organization Program optimizations that consider caches 6 L17: Caches I CSE410, Winter 2017 Processor‐Memory Gap 1989 first Intel CPU with cache on chip 1998 Pentium III has two cache levels on chip µProc 10000 55%/year (2X/1.5yr) 1000 Processor‐Memory Performance Gap 100 “Moore’s Law” (grows 50%/year) Performance 10 DRAM 1 7%/year (2X/10yrs) Year 7 L17: Caches I CSE410, Winter 2017 Problem: Processor‐Memory Bottleneck Processor performance doubled about every 18 months Bus latency / bandwidth evolved much slower Main CPU Reg Memory Core 2 Duo: Core 2 Duo: Can process at least Bandwidth 256 Bytes/cycle 2 Bytes/cycle Latency 100‐200 cycles (30‐60ns) Problem: lots of waiting on memory cycle: single machine step (fixed‐time) 8 L17: Caches I CSE410, Winter 2017 Problem: Processor‐Memory Bottleneck Processor performance doubled about every 18 months Bus latency / bandwidth evolved much slower Main CPU Reg Cache Memory Core 2 Duo: Core 2 Duo: Can process at least Bandwidth 256 Bytes/cycle 2 Bytes/cycle Latency 100‐200 cycles (30‐60ns) Solution: caches cycle: single machine step (fixed‐time) 9 L17: Caches I CSE410, Winter 2017 Cache Pronunciation: “cash” . We abbreviate this as “$” English: A hidden storage space for provisions, weapons, and/or treasures Computer: Memory with short access time used for the storage of frequently or recently used instructions (i‐cache/I$) or data (d‐cache/D$) . More generally: Used to optimize data transfers between any system elements with different characteristics (network interface cache, I/O cache, etc.) 10 L17: Caches I CSE410, Winter 2017 General Cache Mechanics • Smaller, faster, more expensive memory. Cache 7 9 14 3 • Caches a subset of the blocks (a.k.a. lines) Data is copied in block‐sized transfer units • Larger, slower, cheaper memory. Memory 0 1 2 3 • Viewed as partitioned into “blocks” 4 5 6 7 or “lines” 8 9 10 11 12 13 14 15 11 L17: Caches I CSE410, Winter 2017 General Cache Concepts: Hit Request: 14 Data in block b is needed Block b is in cache: 7 9 14 3 Cache Hit! Data is returned to CPU Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 12 L17: Caches I CSE410, Winter 2017 General Cache Concepts: Miss Request: 12 Data in block b is needed Block b is not in cache: 7 129 14 3 Cache Miss! Block b is fetched from Request: 12 12 memory Block b is stored in cache Memory 0 1 2 3 • Placement policy: 4 5 6 7 determines where b goes • Replacement policy: 8 9 10 11 determines which block 12 13 14 15 gets evicted (victim) Data is returned to CPU 13 L17: Caches I CSE410, Winter 2017 Why Caches Work Locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently 14 L17: Caches I CSE410, Winter 2017 Why Caches Work Locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently block Temporal locality: . Recently referenced items are likely to be referenced again in the near future 15 L17: Caches I CSE410, Winter 2017 Why Caches Work Locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently block Temporal locality: . Recently referenced items are likely to be referenced again in the near future Spatial locality: block . Items with nearby addresses tend to be referenced close together in time How do caches take advantage of this? 16 L17: Caches I CSE410, Winter 2017 Example: Any Locality? sum = 0; for (i = 0; i < n; i++) { sum += a[i]; } return sum; Data: . Temporal: sum referenced in each iteration . Spatial: array a[] accessed in stride‐1 pattern Instructions: . Temporal: cycle through loop repeatedly . Spatial: reference instructions in sequence 17 L17: Caches I CSE410, Winter 2017 Locality Example #1 int sum_array_rows(int a[M][N]) { int i, j, sum = 0; for (i = 0; i < M; i++) for (j = 0; j < N; j++) sum += a[i][j]; return sum; } 18 L17: Caches I CSE410, Winter 2017 Locality Example #1 int sum_array_rows(int a[M][N]) M = 3, N=4 { a[0][0] a[0][1] a[0][2] a[0][3] int i, j, sum = 0; a[1][0] a[1][1] a[1][2] a[1][3] for (i = 0; i < M; i++) for (j = 0; j < N; j++) a[2][0] a[2][1] a[2][2] a[2][3] sum += a[i][j]; Access Pattern: 1) a[0][0] return sum; stride = ? 2) a[0][1] } 3) a[0][2] 4) a[0][3] Layout in Memory 5) a[1][0] 6) a[1][1] a a a a a a a a a a a a [0] [0] [0] [0] [1] [1] [1] [1] [2] [2] [2] [2] 7) a[1][2] [0] [1] [2] [3] [0] [1] [2] [3] [0] [1] [2] [3] 8) a[1][3] 9) a[2][0] 76 92 108 10) a[2][1] 11) a[2][2] Note: 76 is just one possible starting address of array a 12) a[2][3] 19 L17: Caches I CSE410, Winter 2017 Locality Example #2 int sum_array_cols(int a[M][N]) { int i, j, sum = 0; for (j = 0; j < N; j++) for (i = 0; i < M; i++) sum += a[i][j]; return sum; } 20 L17: Caches I CSE410, Winter 2017 Locality Example #2 int sum_array_cols(int a[M][N]) M = 3, N=4 { a[0][0] a[0][1] a[0][2] a[0][3] int i, j, sum = 0; a[1][0] a[1][1] a[1][2] a[1][3] for (j = 0; j < N; j++) for (i = 0; i < M; i++) a[2][0] a[2][1] a[2][2] a[2][3] sum += a[i][j]; Access Pattern: 1) a[0][0] return sum; stride = ? 2) a[1][0] } 3) a[2][0] 4) a[0][1] Layout in Memory 5) a[1][1] 6) a[2][1] a a a a a a a a a a a a [0] [0] [0] [0] [1] [1] [1] [1] [2] [2] [2] [2] 7) a[0][2] [0] [1] [2] [3] [0] [1] [2] [3] [0] [1] [2] [3] 8) a[1][2] 9) a[2][2] 76 92 108 10) a[0][3] 11) a[1][3] 12) a[2][3] 21 L17: Caches I CSE410, Winter 2017 Locality Example #3 int sum_array_3D(int a[M][N][L]) { What is wrong int i, j, k, sum = 0; with this code? for (i = 0; i < N; i++) for (j = 0; j < L; j++) for (k = 0; k < M; k++) How can it be sum += a[k][i][j]; fixed? return sum; } a[2][0][0] a[2][0][1] a[2][0][2] a[2][0][3] a[1][0][0] a[1][0][1] a[1][0][2] a[1][0][3] a[0][0][0a][2a][[01][0][] 1a][2a][[10][][10]][2a][2a][[10][][20]][3a][2][1][3] a[1][1][0] a[1][1][1] a[1][1][2] a[1][1][3] a[0][1][0a][2a][[02][10][] 1a][2a][[20][][11]][2a][2a][[20][][21]][3a][2][2][3] m = 2 a[1][2][0] a[1][2][1] a[1][2][2] a[1][2][3] m= 1 a[0][2][0] a[0][2][1] a[0][2][2] a[0][2][3] m= 0 22 L17: Caches I CSE410, Winter 2017 Locality Example #3 int sum_array_3D(int a[M][N][L]) { What is wrong int i, j, k, sum = 0; with this code? for (i = 0; i < N; i++) for (j = 0; j < L; j++) for (k = 0; k < M; k++) How can it be sum += a[k][i][j]; fixed? return sum; } Layout in Memory (M = ?, N = 3, L = 4) a a a a a a a a a a a a a a a a a a a a a a a a [0] [0] [0] [0] [0] [0] [0] [0] [0] [0] [0] [0] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [0] [0] [0] [0] [1] [1] [1] [1] [2] [2] [2] [2] [0] [0] [0] [0] [1] [1] [1] [1] [2] [2] [2] [2] [0] [1] [2] [3] [0] [1] [2] [3] [0] [1] [2] [3] [0] [1] [2] [3] [0] [1] [2] [3] [0] [1] [2] [3] 76 92 108 124 140 156 172 23 L17: Caches I CSE410, Winter 2017 Cache Performance Metrics Huge difference between a cache hit and a cache miss .

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    34 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us