Systems

CPU Main CPU Main Memory Memory 400MHz 10MHz Cache 10MHz Design Bus 66MHz Bus 66MHz

Chapter 5 and Appendix Data object Block transfer transfer Main CPU Cache Memory

1 4

Overview Example: Two-level Hierarchy

Access Time

• Problem T1+T2 –CPU vs Memory performance imbalance • Solution – Driven by temporal and spatial locality – Memory hierarchies • Fast L1, L2, L3 caches • Larger but slower memories • Even larger but even T slower secondary storage 1 • Keep most of the action in the higher levels 0 Hit ratio 1

2 5

Locality of Reference Basic Cache Read Operation • Temporal and Spatial • Sequential access to memory • CPU requests contents of memory location • Unit-stride loop (cache lines = 256 bits) • Check cache for this data for (i = 1; i < 100000; i++) • If present, get from cache (fast) sum = sum + a[i]; • If not present, read requi red bl ock f rom main memory to cache • Non-unit stride loop (cache lines = 256 bits) • Then deliver from cache to CPU for (i = 0; i <= 100000; i = i+8) • Cache includes tags to identify which block sum = sum + a[i]; of main memory is in each cache slot

3 6

1 Elements of Cache Design Number of Caches • Cache size • Increased logic density => on-chip cache • Line (block) size – Internal cache: level 1 (L1) • Number of caches – External cache: level 2 (L2) • Unified cache • Mapping function – Balances the load between instruction and data fetches – Block placement – Only one cache needs to be designed / implemented – Block identification • Split caches (data and instruction) • Replacement Algorithm – Pipelined, parallel architectures • Write Policy

7 10

Cache Size Mapping Function • Cache size << main memory size • Cache lines << main memory blocks • Small enough • Direct mapping – Minimize cost – Maps each block into only one possible line – Speed up access (less gates to address the cache) – (block address) MOD (number of lines) – Keep cache on chip • Fully associative • Large enough – Block can be placed anywhere in the cache – Minimize average access time • Set associative • Optimum size depends on the workload – Block can be placed in a restricted set of lines • Practical size? – (block address) MOD (number of sets in cache)

8 11

Line Size Cache Addressing • Optimum size depends on workload • Small blocks do not use locality of reference Block address Block offset principle Tag Index • Larger blocks reduce the number of blocks – Replacement overhead Block offset – selects data object from the block Tag Cache Main Memory • Practical sizes? Index – selects the block set Tag – used to detect a hit

9 12

2 Direct Mapping Replacement Algorithm • Simple for direct-mapped: no choice • Random – Simple to build in hardware •LRU

Associativity Two-way Four-way Eight-way Size LRU Random LRU Random LRU Random 16KB 5.18% 5.69% 4.67% 5.29% 4.39% 4.96% 64KB 1.88% 2.01% 1.54% 1.66% 1.39% 1.53% 256KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%

13 16

Associative Mapping Write Policy • Write is more complex than read – Write and tag comparison can not proceed simultaneously – Only a portion of the line has to be updated • Write policies – Write through – write to the cache and memory – Write back – write only to the cache (dirty bit) • Write miss: – Write allocate – load block on a write miss – No-write allocate – update directly in memory

14 17

K-Way Set Associative Mapping Alpha AXP 21064 Cache

CPU 21 8 5 Address Tag Index offset Data data In out

Valid Tag Data (256)

Write buffer

=? Lower level memory

15 18

3 Write Merging Cache Performance Improvements

Write address V V V V • Average memory-access time 100 1 000 = Hit time + Miss rate x Miss penalty 104 1 0 00 • Cache optimizations 108 1 0 0 0 – Reducing the miss rate 112 1 0 0 0 – Reducing the miss penalty

Write address V V V V – Reducing the hit time

100 11 11

00 00

00 00

00 00 19 22

DECstation 5000 Miss Rates Example

30 Which has the lower average memory access time: 25 A 16-KB instruction cache with a 16-KB data cache or

20 A 32-KB unified cache Instr. Cache

% 15 Data Cache Hit time = 1 cycle Unified 10 Miss ppyenalty = 50 cy cles Load/store hit = 2 cycles on a unified cache 5 Given: 75% of memory accesses are instruction references. 0 1 KB 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB Overall miss rate for split caches = 0.75*0.64% + 0.25*6.47% = 2.10% Cache size Miss rate for unified cache = 1.99% Average memory access times: Direct-mapped cache with 32- blocks Split = 0.75 * (1 + 0.0064 * 50) + 0.25 * (1 + 0.0647 * 50) = 2.05

Percentage of instruction references is 75% 20 Unified = 0.75 * (1 + 0.0199 * 50) + 0.25 * (2 + 0.0199 * 50) = 2.24 23

Cache Performance Measures Cache Performance Equations

• Hit rate: fraction found in that level CPUtime = (CPU execution cycles + Mem stall cycles) * Cycle time – So high that usually talk about Miss rate Mem stall cycles = Mem accesses * Miss rate * Miss penalty – Miss rate fallacy: as MIPS to CPU performance, CPUtime = IC * (CPIexecution + Mem accesses per instr * Miss rate * • Average memory-access time Miss penalty) * Cycle time = Hit time + Miss rate x Miss penalty (ns) Misses per instr = Mem accesses per instr * Miss rate • Miss penalty: time to replace a block from lower CPUtime = IC * (CPIexecution + Misses per instr * Miss penalty) * level, including time to replace in CPU Cycle time – access time to lower level = f(latency to lower level) – transfer time: time to transfer block =f(bandwidth)

21 24

4 Reducing Miss Penalty Critical Word First and Early Restart

• Multi-level Caches • Critical Word First: Request the missed • Critical Word First and Early Restart word first from memory • Priority to Read Misses over Writes • Early Restart: Fetch in normal order, but as •Merggging Write Buffers soon as the requested word arrives, send it to CPU • Victim Caches

25 28

Multi-Level Caches Giving Priority to Read Misses over Writes

• Avg mem access time = Hit time(L1) + Miss Rate SW R3, 512(R0) (L1) X Miss Penalty(L1) • Miss Penalty (L1) = Hit Time (L2) + Miss Rate LW R1, 1024 (R0) (L2) X Miss Penalty (L2) LW R2, 512 (R0) • Avg mem access time = Hit Time (L1) + Miss Rate (()(L1) X (Hit Time ()(L2) + Miss Rate ()(L2) X • Direct-mapped , write-through cache Miss Penalty (L2) mapping 512 and 1024 to the same block • Local Miss Rate: number of misses in a cache divided by the total number of accesses to the and a four word write buffer cache • Will R2=R3? • Global Miss Rate: number of misses in a cache divided by the total number of memory accesses • Priority for Read Miss? generated by the cache

26 29

Performance of Multi-Level Caches Victim Caches

27 30

5 Reducing Miss Rates: 1. Larger Block Size Types of Cache Misses • Compulsory • Effects of larger block sizes – First reference or cold start misses – Reduction of compulsory misses • Capacity • Spatial locality – is too big for the cache – Increase of miss penalty (transfer time) – Fully associative caches – Re duct ion of numb er of bl ock s • Conflict (collision) • Potential increase of conflict misses – Many blocks map to the same block frame (line) • Latency and bandwidth of lower-level memory – Affects – High latency and bandwidth => large block size • Set associative caches • Direct mapped caches • Small increase in miss penalty

31 34

Miss Rates: Absolute and Distribution Example

32 35

Reducing the Miss Rates 2. Larger Caches

1. Larger block size • More blocks 2. Larger Caches • Higher probability of getting the data 3. Higher associativity • Longer hit time and higher cost 4. Pseudo-associative caches • Primarily used in 2nd level caches 5. Compiler optimizations

33 36

6 3. Higher Associativity 5. Compiler Optimizations

• Eight-way set associative is good enough • Avoid hardware changes • 2:1 Cache Rule: • Instructions – Miss Rate of direct mapped cache size N = – Profiling to look at conflicts between groups of Miss Rate 2-way cache size N/2 instructions • Higher Associativity can increase • Data – Clock cycle time – : change nesting of loops to access data in order stored in memory – Hit time for 2-way vs. 1-way external cache +10%, – Blocking: Improve temporal locality by accessing internal + 2% “blocks” of data repeatedly vs. going down whole columns or rows

37 40

4. Pseudo-Associative Caches Loop Interchange

/* Before */ • Fast hit time of direct mapped and lower conflict for (j = 0; j < 100; j = j+1) misses of 2-way set-associative cache? for (i = 0; i < 5000; i = i+1) • Divide cache: on a miss, check other half of cache x[i][j] = 2 * x[i][j]; to see if there, if so have a pseudo-hit (slow hit) /* After */ Hit time for (i = 0; i < 5000; i = i+1) for (j = 0; j < 100; j = j+1) Pseudo hit time Miss penalty x[i][j] = 2 * x[i][j];

• Drawback: •Sequential accesses instead of striding through memory every 100 words; improved spatial locality – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for caches not tied directly to processor (L2) •Same number of executed instructions – Used in MIPS R1000 L2 cache, similar in UltraSPARC

38 41

Pseudo Associative Cache Blocking (1/2)

/* Before */ for (i = 0; i < N; i = i+1) CPU for (j = 0; j < N; j = j+1){ Address r = 0; Data Data for (k = 0; k < N; k = k+1) in out r = r + y[i][k]*z[k][j]; 1 1 Data x[i][j] = r; Tag }; =? •Two Inner Loops: –Read all NxN elements of z[] 3 2 2 –Read N elements of 1 row of y[] repeatedly –Write N elements of 1 row of x[] =? Write buffer •Capacity Misses a function of N & Cache Size: –3 NxNx4 => no capacity misses Lower level memory –Idea: compute on BxB submatrix that fits 39 42

7 Blocking (2/2) 2. Hardware Prefetching

/* After */ • Instruction Prefetching for (jj = 0; jj < N; jj = jj+B) for (kk = 0; kk < N; kk = kk+B) – Alpha 21064 fetches 2 blocks on a miss for (i = 0; i < N; i = i+1) – Extra block placed in stream buffer for (j = jj; j < min(jj+B-1,N); j=j+1){ r = 0; – On miss check stream buffer for(k=kk; k

43 46

Reducing Cache Miss Penalty or 3. Compiler-Controlled Prefetching Miss Rate via Parallelism 1. Nonblocking Caches • Compiler inserts data prefetch instructions – Load data into register (HP PA-RISC loads) 2. Hardware Prefetching – Cache Prefetch: load into cache (MIPS IV, PowerPC) 3. Compiler controlled Prefetching – Special prefetching instructions cannot cause faults; a form of speculative execution • Nonblocking cache: overlap execution with prefetch • Issuing Prefetch Instructions takes time – Is cost of prefetch issues < savings in reduced misses? – Higher superscalar reduces difficulty of issue bandwidth

44 47

1. Nonblocking Cache Reducing Hit Time

1. Small and Simple Caches • Out-of-order 2. Avoiding address Translation during execution Indexing of the Cache – Proceeds with next fhfetches w hiliihile waiting for data to come

45 48

8 Small and Simple Caches Main Memory Background

• Performance of Main Memory: – Latency: Cache Miss Penalty • Access Time: time between request and word arrives • Cycle Time: time between requests – Bandwidth: I/O & Large Block Miss Penalty (L2) •Main Memory is DRAM: Dynamic Memory – Dynamic s in ce needs to be re fres hed pe riod ic ally – Addresses divided into 2 halves (Memory as a 2D matrix): • RAS or Row Access Strobe • CAS or Column Access Strobe • Cache uses SRAM: Static Random Access Memory – No refresh (6 transistors/bit vs. 1 transistor /bit, area is 10X) – Address not divided: Full addreess • Size: DRAM/SRAM - 4-8 Cost & Cycle time: SRAM/DRAM - 8-16

49 52

Avoiding Address Translation during Main Memory Organizations Indexing • Virtual vs. Physical Cache Simple Wide Interleaved CPU CPU CPU • What happens when address translation is

done Cache Multiplexor Cache •Page offset bits to index the cache bus ChCache bus

Memory bus Memory Memory Memory Memory bank 0 bank 1 bank 2 bank 3 Memory

256/512 bits

50 32/64 bits 53sp

TLB and Cache Operation Interleaved Memory

TLB Operation Virtual address Hit Page# Offset TLB

Miss Cache Operation RlddReal address

Hit + Tag Remainder Cache Value Miss

Page Table Main Memory Value

51 54

9 Performance • Timing model (word size is 32 bits) – 1 to send address, – 6 access time, 1 to send data – Cache Block is 4 words • Simple M.P. = 4 x (()1+6+1) = 32 • Wide M.P. = 1 + 6 + 1 = 8 • Interleaved M.P. = 1 + 6 + 4x1 = 11

55

10