design overview

ANY cache can be viewed as k-way associative. What are the pros and cons of each?

• Fully associative: k = N/B IC220 Caching 2: (more from Chapter 5 - specifically 5.7, 5.8) • 4-way set associative, k = 4

• Direct-mapped, k = 1

1 2

Improving Cache Performance Cache performance key tradeoff

Inherent conflict: Remember key metrics: Miss Rate, Hit Time, Miss Penalty

What happens if we: HIT TIMEvs MISS RATE • Increase the cache size (N)?

• Increase the block size (keeping N the same)?

• Increase associativity (keeping N the same)?

3 4

1 More hierarchy – L2 cache? Memory Hierarchy

• Problem: CPUs get faster, DRAM gets bigger – Must keep hit time small (1 or 2 cycles) – But then cache must be small too (fast SRAM is expensive) – So miss rate gets higher... • Solution: Add another level of cache: – try and optimize the ______on the 1st level cache

– try and optimize the ______on the 2nd level cache

5 6

Questions Split Caches

• Instructions and have different properties • Will the miss rate of a L2 cache be higher or lower than for the L1 – May benefit from different cache organizations (block size, assoc…) cache?

ICache DCache (L1) (L1)

L2 Cache CPU

• Claim: “The is really the lowest level cache” L3, L4, …? What are reasons in favor and against this statement?

Main memory

7 8

2 What does an address refer to? : Main idea

The old way: CPU works with (fake) virtual addresses. • Address refers to a specific in main memory (DRAM). Operating system translates to physical addresses. • This is called a physical address.

Advantages: Problems with this: CPU CPU Virtual address

Physical OS Translation address Physical New challenge: address Cache Cache

Memory Memory 9 10

Pages and virtual address translation Page Tables

• Translation from virtual to physical pages stored in . • Virtual AND physical addresses divided into blocks called pages. • Typical page size is 4KiB (means 12 bits for offset)

Cache

Disk Memory 11 12

3 Pages: virtual memory blocks Address Translation

Terminology: • Page faults: the data is not in memory, retrieve it from disk • Cache block  – huge miss penalty (slow disk), thus • Cache miss  • Cache tag • pages should be fairly  • Byte offset 

• Replacement strategy:

– can handle the faults in software instead of hardware

• Writeback or write-through?

13 14

Making Address Translation Fast Virtual Memory Take-Aways

• A cache for address translations: translation lookaside buffer (TLB) • CPU/programs deal with virtual addresses (virtual page number + page offset). • Translated to physical addresses (physical page # + page offset) between CPU and cache. • Memory is divided into blocks called pages, commonly 4KiB (therefore 12 bits for page offset). • Page tables, managed by the operating system for each , store virtual->physical page number mapping, as well as that process’s permissions (read/write). • TLB is a special CPU cache for page table lookups. • Physical addresses can reside in DRAM (typical), or be stored on disk (making RAM “look” larger to CPU), or can even refer to other devices (memory-mapped I/O).

Typical values: 16-512 PTEs (page table entries), miss-rate: .01% - 1% miss-penalty: 10 – 100 15 16 cycles

4 Modern Systems Program Design 2D array layout

• Consider this C declaration: int A[4][3] = { {10, 11, 12}, {20, 21, 22}, {30, 31, 32}, {40, 41, 42} };

• How is this array stored in memory?

17 20

Program Design for Caches – Example 1 Program Design for Caches – Example 2

• Option #1 • Why might this code be problematic? for (j = 0; j < 20; j++) int A[1024][1024]; for (i = 0; i < 200; i++) int B[1024][1024]; x[i][j] = x[i][j] + 1; for (i = 0; i < 1024; i++) for (j = 0; j < 1024; j++) • Option #2 A[i][j] += B[i][j]; for (i = 0; i < 200; i++) for (j = 0; j < 20; j++) x[i][j] = x[i][j] + 1; • How to fix it?

21 22

5 Concluding Remarks

• Fast memories are small, large memories are slow – We really want fast, large memories – Caching gives this illusion • Principle of locality – Programs use a small part of their memory space frequently • Memory hierarchy – L1 cache ↔ L2 cache ↔ … ↔ DRAM memory ↔ disk • Memory system design is critical for multiprocessors

23

6