<<

ECE 4750 , Fall 2020 T04 Fundamental Memory

School of Electrical and Cornell University

revision: 2020-10-28-14-41

1 Memory Microarchitectural Design Patterns 3 1.1. Transactions and Steps ...... 3 1.2. Microarchitecture Overview ...... 4

2 FSM 5 2.1. High-Level Idea for FSM Cache ...... 6 2.2. FSM Cache ...... 7 2.3. FSM Cache ...... 9 2.4. Analyzing Performance ...... 10

3 Pipelined Cache 12 3.1. High-Level Idea for Pipelined Cache ...... 13 3.2. Pipelined Cache Datapath and Control Unit ...... 14 3.3. Analyzing Performance ...... 20 3.4. Pipelined Cache with TLB ...... 22

4 Cache Microarchitecture Optimizations 25 4.1. Reduce Hit Time ...... 25 4.2. Reduce Miss Rate ...... 26

1 4.3. Reduce Miss Penalty ...... 30 4.4. Cache Optimization Summary ...... 31

5 Case Study: ARM Cortex A8 and Core i7 32 5.1. ARM Cortex A8 ...... 32 5.2. i7 ...... 33

Copyright © 2018 Christopher Batten, Christina Delimitrou. All rights reserved. This hand- out was originally prepared by Prof. Christopher Batten at Cornell University for ECE 4750 / CS 4420. It has since been updated by Prof. Christina Delimitrou in 2017-2020. Download and use of this handout is permitted for individual educational non-commercial purposes only. Redistribution either in part or in whole via both commercial or non-commercial means requires written permission.

2 1. Memory Microarchitectural Design Patterns 1.1. Transactions and Steps

1. Memory Microarchitectural Design Patterns

Time Mem Accesses Avg Cycles Time = × × Mem Access Sequence Sequence Mem Access Cycle

Avg Cycles Avg Cycles Num Misses Avg Extra Cycles = + × Mem Access Hit Num Accesses Miss

Extra Accesses Microarchitecture Hit Latency for Translation FSM Cache >1 1+ Pipelined Cache ≈1 1+ Pipelined Cache + TLB ≈1 ≈0

1.1. Transactions and Steps

• We can think of each memory access as a transaction • Executing a memory access involves a sequence of steps – Check Tag : Check one or more tags in cache – Select Victim : Select victim line from cache using replacement policy – Evict Victim : Evict victim line from cache and write victim to memory – Refill : Refill requested line by reading line from memory – Write Mem : Write requested word to memory – Access Data : Read or write requested word in cache

3 1. Memory Microarchitectural Design Patterns 1.1. Transactions and Steps

Steps for Write-Through 1.2. Microarchitecture with No Write Allocate Overview

mmureq 32b mmuresp

Memory Management Unit

cachereq 32b cacheresp

Control Unit

Control Status Datapath

Tag Data Steps for Write-Back Array Array with Write Allocate

mreq 128b mresp

Main Memory >1 cycle combinational

4 2. FSM Cache

2. FSM Cache Time Mem Accesses Avg Cycles Time = × × Mem Access Sequence Sequence Mem Access Cycle

Avg Cycles Avg Cycles Num Misses Avg Extra Cycles = + × Mem Access Hit Num Accesses Miss

Extra Accesses Microarchitecture Hit Latency for Translation FSM Cache >1 1+ Pipelined Cache ≈1 1+ Pipelined Cache + TLB ≈1 ≈0

Assumptions mmureq 32b mmuresp • Page-based translation, no TLB, physically addr cache • Single-ported combinational cachereq 32b cacheresp SRAMs for tag, data storage • Unrealistic combinational Control Unit main memory Control Status Datapath • Cache requests are 4 B Tag Data Configuration Array Array • Four 16 B cache lines • Two-way set-associative mreq 128b mresp • Replacement policy: LRU Main Memory >1 cycle • Write policy: write-through, combinational no write allocate

5 2. FSM Cache 2.1. High-Level Idea for FSM Cache

2.1. High-Level Idea for FSM Cache

read hit Check Access Tag read miss Data

Select Refill write Victim

Write Mem write hit

Check Access Check Access Check Access Tag Select Access Check Access Refill Tag Data Tag Data Tag Data Check Victim Data Tag Data

read write Write read read read FSM Mem hit hit hit miss hit

6 2. FSM Cache 2.2. FSM Cache Datapath

2.2. FSM Cache Datapath

As with processors, we design our cache datapath by incrementally adding support for each transaction and resolving conflicts with muxes.

Implementing READ transactions that hit

!cachereq_val tag 00offidx 27b 1b 2b

MT cache tagidx tagidx idx off req. addr tarray0 tarray1 read _match _match hit = = Way 0 Way 1 Tag 2b Array [127:96] tarray0_en tarray0_en 32b [95:64] cache [63:32] 128b resp. Data [31:0] data darray_en Array

MRD MT: Check tag MRD: Read data array, return cacheresp

Implementing READ transactions that miss

!cachereq_val tag 00offidx 27b 1b 2b

MT cache tag tagidx tag tagidx idx off req.

addr tarray0 tarray1 _sel read read _match _match victim miss hit victim = =

R0 Way 0 Way 1 Tag 2b Array [127:96] tarray0_en tarray0_en

tarray0_wen tarray0_wen 32b [95:64] word enable cache R1 [63:32] 1111 128b resp. Data [31:0] data darray_en Array z4b 128b darray_en darray_wen MRD MT: Check tag memreq. memresp. MRD: Read data array, return cacheresp addr data R0: Send refill memreq, get memresp R1: Write data array with refill cache line

7 2. FSM Cache 2.2. FSM Cache Datapath

Implementing WRITE transactions that miss

!cachereq_val tag 00offidx 27b 1b 2b write MT cache tag tagidx tag tagidx idx off req.

addr tarray0 tarray1 _sel read read _match _match victim miss hit victim = =

R0 Way 0 Way 1 Tag 2b Array [127:96] tarray0_en tarray0_en

tarray0_wen tarray0_wen 32b [95:64] word 32b enable cache R1 [63:32] 1111 128b resp. cache Data [31:0] data req. darray_en Array data z4b z4b zext 128b darray_en darray_wen MRD MWD z4b_sel 128b MT: Check tag memreq. memreq. memresp. MRD: Read data array, return cacheresp addr data data R0: Send refill memreq, get memresp R1: Write data array with refill cache line MWD: Send write memreq, write data array

Implementing WRITE transactions that hit

!cachereq_val tag 00offidx 27b 1b 2b write MT cache tag tagidx tag tagidx off idx off req.

addr tarray0 tarray1 _sel read read _match _match victim miss hit victim = =

R0 Way 0 Way 1 Tag 1111 2b Array word_ [127:96]

tarray0_en tarray0_en darray_

tarray0_wen tarray0_wen write_ en_sel 32b sel word [95:64] enableword 32b 128b enable cache R1 repl [63:32] 128b 1111 128b resp. cache unit Data [31:0] data req. darray_en Array data z4b z4b zext 128b darray_en darray_wen MRD MWD z4b_sel 128b MT: Check tag memreq. memreq. memresp. MRD: Read data array, return cacheresp addr data data R0: Send refill memreq, get memresp R1: Write data array with refill cache line MWD: Send write memreq, write data array

8 2. FSM Cache 2.3. FSM Cache Control Unit

2.3. FSM Cache Control Unit

MT read R0 R1 MRD miss read write hit MWD

We will need to keep valid bits in the control unit, with one valid bit for every cache line. We will also need to keep use bits which are updated on every access to indicate which was the last “used” line. Assume we create the following two control signals generated by the FSM control unit. hit = ( tarray0_match && valid0[idx] ) || ( tarray1_match && valid1[idx] )

victim = !use[idx]

tarray0 tarray1 darray darray worden z4b memreq cachresp en wen en wen en wen sel sel sel val op val MT MRD R0 R1 MWD

9 2. FSM Cache 2.4. Analyzing Performance

2.4. Analyzing Performance

Time Mem Accesses Avg Cycles Time = × × Mem Access Sequence Sequence Mem Access Cycle

Avg Cycles Avg Cycles Num Misses Avg Extra Cycles = + × Mem Access Hit Num Accesses Miss

Estimating cycle time

!cachereq_val tag 00offidx 27b 1b 2b write MT cache tag tagidx tag tagidx off idx off req.

addr tarray0 tarray1 _sel read read _match _match victim miss hit victim = =

R0 Way 0 Way 1 Tag 1111 2b Array word_ [127:96]

tarray0_en tarray0_en darray_

tarray0_wen tarray0_wen write_ en_sel 32b sel word [95:64] enableword 32b 128b enable cache R1 repl [63:32] 128b 1111 128b resp. cache unit Data [31:0] data req. darray_en Array data z4b z4b zext 128b darray_en darray_wen MRD MWD z4b_sel 128b MT: Check tag memreq. memreq. memresp. MRD: Read data array, return cacheresp addr data data R0: Send refill memreq, get memresp R1: Write data array with refill cache line MWD: Send write memreq, write data array

• register read/write = 1τ • tag array read/write = 10τ • data array read/write = 10τ • mem read/write = 20τ • decoder = 3τ • comparator = 10τ • mux = 3τ • repl unit = 0τ • z4b = 0τ

10 2. FSM Cache 2.4. Analyzing Performance

Estimating AMAL

Consider the following sequence of memory acceses which might correspond to copying 4 B elements from a source array to a destination array. Each array contains 64 elements. What is the AMAL? rd 0x1000 wr 0x2000 rd 0x1004 wr 0x2004 rd 0x1008 wr 0x2008 ... rd 0x1040 wr 0x2040

Consider the following sequence of memory acceses which might correspond to incrementing 4 B elements in an array. The array contains 64 elements. What is the AMAL? rd 0x1000 wr 0x1000 rd 0x1004 wr 0x1004 rd 0x1008 wr 0x1008 ... rd 0x1040 wr 0x1040

11 3. Pipelined Cache

3. Pipelined Cache Time Mem Accesses Avg Cycles Time = × × Mem Access Sequence Sequence Mem Access Cycle

Avg Cycles Avg Cycles Num Misses Avg Extra Cycles = + × Mem Access Hit Num Accesses Miss

Extra Accesses Microarchitecture Hit Latency for Translation FSM Cache >1 1+ Pipelined Cache ≈1 1+ Pipelined Cache + TLB ≈1 ≈0

Assumptions mmureq 32b mmuresp • Page-based translation, no TLB, physically addr cache Memory Management Unit • Single-ported combinational cachereq 32b cacheresp SRAMs for tag, data storage • Unrealistic combinational Control Unit main memory Control Status Datapath • Cache requests are 4 B Tag Data Configuration Array Array • Four 16 B cache lines • Direct-mapped mreq 128b mresp • Replacement policy: LRU Main Memory >1 cycle • Write policy: write-through, combinational no write allocate

12 3. Pipelined Cache 3.1. High-Level Idea for Pipelined Cache

3.1. High-Level Idea for Pipelined Cache

read hit Check Access Tag read miss Data

Select Refill write Victim

Write Mem write hit

Check Access Check Access Check Access Tag Select Access Check Access Refill Tag Data Tag Data Tag Data Check Victim Data Tag Data

read write Write read read read FSM Mem hit hit hit miss hit

Check Access Tag Data

read Check Mem hit Tag Write

write Access hit Data

Check Access Tag Data Pipelined read Tag Select Access Refill hit Check Victim Data

read read Check Access miss hit Tag Data

13 3. Pipelined Cache 3.2. Pipelined Cache Datapath and Control Unit

3.2. Pipelined Cache Datapath and Control Unit

As with processors, we incrementally adding support for each transaction and resolving conflicts using muxes. Implementing READ transactions that hit

M0 Stage M1 Stage cache tagidx idx off req. addr tarray _match Tag Array =

tarray_en [127:96] 32b [95:64] tag 00offidx cache [63:32] 26b 2b 2b 128b resp. Data [31:0] data Array

Implementing WRITE transactions that hit

M0 Stage M1 Stage cache tagidx off idx off req. addr tarray _match Tag Array =

tarray_en [127:96] 32b [95:64] 32b tag 128b 00offidx cache repl [63:32] 26b 2b 2b 128b resp. cache unit Data [31:0] data req. darray Array data zext _wen

128b

memreq. memreq. addr data

14 3. Pipelined Cache 3.2. Pipelined Cache Datapath and Control Unit

Implementing transactions that miss

M0 Stage M0 Stage M1 Stage FSM cache tag tagidx off idx off pipe req. addr tarray _match miss Tag Array = R0 1111

tarray_en [127:96] tarray_wen word_en_sel 32b write [95:64] tag 00offidx word 32b 128b enable cache repl [63:32] 26b 2b 2b 128b 128b resp. R1 cache unit Data [31:0] data req. darray Array data z4b zext _wen darray_ write_ mux_sel z4b_sel 128b 128b

memreq. memreq. memresp. addr data data

• Hybrid pipeline/FSM design pattern • Hit path is pipelined with two-cycle hit latency • Miss path stalls in M0 stage to refill cache line

Pipeline diagram for pipelined cache with 2-cycle hit latency

rd (hit) wr (hit) rd (miss) rd (miss) wr (miss) rd (hit)

15 3. Pipelined Cache 3.2. Pipelined Cache Datapath and Control Unit

Parallel read with pipelined write datapath

M0 Stage M0 Stage M1 Stage FSM cache tag tagidx off idx off pipe req. addr tarray _match read_sel miss Tag parallel_ Array = R0 1111

tarray_en [127:96] tarray_wen word_en_sel 32b write [95:64] tag 00offidx word 32b 128b enable cache repl [63:32] 26b 2b 2b 128b 128b resp. R1 cache unit Data [31:0] data req. darray Array data z4b zext _wen darray_ write_ mux_sel z4b_sel 128b 128b

memreq. memreq. memresp. addr data data

Pipeline diagram for parallel read with pipelined write

rd (hit) rd (hit) rd (hit) wr (hit) wr (hit) rd (hit)

• Achieves single-cycle hit latency for reads • Two-cycle hit latency for writes, but is this latency observable? • With write acks, send write-ack back in M0 stage • How do we resolve structural hazards?

16 3. Pipelined Cache 3.2. Pipelined Cache Datapath and Control Unit

Resolving structural hazard by exposing in ISA

rd (hit) wr (hit) nop rd (hit)

Resolving structural hazard with hardware stalling

rd (hit) wr (hit) rd (hit)

ostall_M0 = val_M0 && ( type_M0 == RD ) && val_M1 && ( type_M1 == WR )

Resolving structural hazard with hardware duplication

rd (hit) wr (hit) rd (hit)

Resolving RAW data hazard with software scheduling

Software scheduling: hazard depends on memory address, so difficult to know at compile time!

Resolving RAW data hazard with hardware stalling

ostall_M0 = ...

17 3. Pipelined Cache 3.2. Pipelined Cache Datapath and Control Unit

Resolving RAW data hazard with hardware bypassing

We could use the previous stall signal as our bypass signal, but we will also need a new bypass path in our datapath. Draw this new bypass path on the following datapath diagram.

M0 Stage M0 Stage M1 Stage FSM cache tag tagidx off idx off off pipe req. addr tarray idx _match miss Tag Array = R0 1111

tarray_en [127:96] tarray_wen word_en_sel read addr 32b write [95:64] tag 00offidx word 32b 128b enable cache repl [63:32] 26b 2b 2b 128b 128b resp. R1 cache unit Data [31:0] data req. darray Array data z4b zext _wen darray_ write_ mux_sel z4b_sel 128b 128b write addr

memreq. memreq. memresp. addr data data

18 3. Pipelined Cache 3.2. Pipelined Cache Datapath and Control Unit

Parallel read and pipelined write in set associative caches

To implement parallel read in set-associative caches, we must speculatively read a line from each way in parallel with tag check. This can be expensive in terms of latency, motivating two-cycle hit latencies for highly associative caches.

M0 Stage M0 Stage M1 Stage FSM cache tag tagidx tag tagidx off idx off pipe req. addr tarray0 tarray1 _match _match read_sel miss parallel_ = = Way 0 Way 1 R0 Tag 1111 Array [127:96] tarray0_en tarray0_en word_en_sel

tarray0_wen tarray0_wen 32b word [95:64] enable way_sel 32b tag 128b 00offidx cache repl [63:32] 26b 2b 2b 128b resp. R1 cache unit darray_en [31:0] data req. darray data z4b zext _wen Way 0 Way 1 darray_ write_ Data mux_sel z4b_sel 128b 128b Array

memreq. memreq. memresp. addr data data

19 3. Pipelined Cache 3.3. Analyzing Performance

3.3. Analyzing Performance

Time Mem Accesses Avg Cycles Time = × × Mem Access Sequence Sequence Mem Access Cycle

Avg Cycles Avg Cycles Num Misses Avg Extra Cycles = + × Mem Access Hit Num Accesses Miss

Estimating cycle time

M0 Stage M0 Stage M1 Stage FSM cache tag tagidx off idx off pipe req. addr tarray _match read_sel miss Tag parallel_ Array = R0 1111

tarray_en [127:96] tarray_wen word_en_sel 32b write [95:64] tag 00offidx word 32b 128b enable cache repl [63:32] 26b 2b 2b 128b 128b resp. R1 cache unit Data [31:0] data req. darray Array data z4b zext _wen darray_ write_ mux_sel z4b_sel 128b 128b

memreq. memreq. memresp. addr data data

• register read/write = 1τ • tag array read/write = 10τ • data array read/write = 10τ • mem read/write = 20τ • decoder = 3τ • comparator = 10τ • mux = 3τ • repl unit = 0τ • z4b = 0τ

20 3. Pipelined Cache 3.3. Analyzing Performance

Estimating AMAL

Assume a parallel-read/pipelined-write microarchitecture with hardware duplication to resolve the structural hazard and stalling to resolve RAW hazards. Consider the following sequence of memory accesses which might correspond to copying 4 B elements from a source array to a destination array. Each array contains 64 elements. What is the AMAL? rd 0x1000 wr 0x2000 rd 0x1004 wr 0x2004 rd 0x1008 wr 0x2008 rd 0x100c wr 0x200c wr 0x1010

21 3. Pipelined Cache 3.4. Pipelined Cache with TLB

3.4. Pipelined Cache with TLB

How should we integrate a MMU (TLB) into a pipelined cache?

Physically Addressed Caches

Perform memory translation before cache access

Main VATLB PA Cache PA Memory

• Advantages: – Physical addresses are unique, so cache entries are unique – Updating memory translation simply requires changing TLB • Disadvantages: – Increases hit latency

Virtually Addressed Caches

Perform memory translation after or in parallel with cache access

Main Processor VACache VA TLB PA Memory

VA Cache PA Main Memory Processor TLB

22 3. Pipelined Cache 3.4. Pipelined Cache with TLB

• Advantages: – Simple one-step for hits • Disadvantages: – Intra-program protection (store protection bits in cache?) – I/O uses physical addr (map into virtual addr space?) – Virtual address homonyms – Virtual address synonyms (aliases)

Virtual Address Homonyms

Single virtual address points to two physical address.

Physical Address Space V Tag Data Physical PA0 Program 1 Page 0x7000 Way 0 1 0x3000 0xcafe Page Table Way 1 VA0 Way 2 0x3000 Way 3 Physical PA1 Page 0x5000 Program 2 Page Table Physical VA0 Page 0x3000

Physical Page in PMem Page Page not allocated

• Example scenario – Program 1 brings VA 0x3000 into cache – Program 1 is context swapped for program 2 – Program 2 hits in cache, but gets incorrect data! • Potential solutions – Flush cache on context swap – Store program ids (address space IDS) in cache

23 3. Pipelined Cache 3.4. Pipelined Cache with TLB

Virtual Address Synonyms (Aliases)

Physical Address Space V Tag Data Physical Program 1 Page Way 0 1 0x3000 0xcafe Page Table Way 1 1 0x1000 0xbeef VA0 Way 2 0x3000 Way 3 Physical Page Program 2 Page Table Physical PA0 Page 0x4000 VA1 0x1000 Physical Page in PMem Page Page not allocated

• Example scenarios – User and OS can potentially have different VAs point to same PA – Memory map the same file (via mmap) in two different programs • Potential solutions – Hardware checks all ways (and potentially different sets) on a miss to ensure that a given physical address can only live in one location in cache – Software forces aliases to share some address bits (page coloring) reducing the number of sets we need to check on a miss (or reducing the need to check any other locations for a direct mapped cache)

24 3. Pipelined Cache 3.4. Pipelined Cache with TLB

Virtually Indexed and Physically Tagged Caches

Virtual Address VPNidx off

Virtual Page Virtual Page Num Offset Index (v bits) (m bits) (n bits)

Direct Mapped $ TLB 2n lines 2b-byte cacheline

PPN off Physical Tag Physical = (k bits) Tag (k bits) hit? • Page offset is the same in VA and PA • Up to n bits of physical address available without translation • If index bits + cache offset (n+b) < page offset bits (m), can do translation in parallel with reading out the physical tag and data • Complete (physical) tag check once tag/data access complete • With 4 KB pages, direct-mapped cache must be <= 4 KB • Larger page sizes (decrease k) or higher associativity (decrease n) enable larger virtually indexed, physically tagged caches

25 4. Cache Microarchitecture Optimizations 4.2. Reduce Hit Time

4. Cache Microarchitecture Optimizations

Main Hit Miss Processor MMU Cache Memory Latency Penalty

AMAL = Hit Latency + ( Miss Rate × Miss Penalty ) • Reduce hit time • Reduce miss rate – Small and simple caches – Large block size – Large cache size • Reduce miss penalty – High associativity – Multi-level – Hardware prefetching – Prioritize reads – Compiler optimizations 4.1. Reduce Hit Time Cache Microarchitecture Optimizations

2.2 Ten Advanced Optimizations of Cache Performance ■ 81

0.5 1-way 2-way 4-way 8-way 0.45

0.4

0.35

0.3

0.25

0.2

0.15 Energy per read in nano joules 0.1

0.05

0 16 KB 32 KB 64 KB 128 KB 256 KB Cache size

Figure 2.4 Energy consumption per read increases as cache size and associativity are increased. As in the previous figure, CACTI is used for the modeling with the same technology parameters. The large penalty for eight-way set associative caches is due to 26 the cost of reading out eight tags and the corresponding data in parallel. Second Optimization: Way Prediction to Reduce Hit Time

Another approach reduces conflict misses and yet maintains the hit speed of direct-mapped cache. In way prediction, extra bits are kept in the cache to predict the way, or block within the set of the next cache access. This prediction means the multiplexor is set early to select the desired block, and only a single tag comparison is performed that clock cycle in parallel with reading the cache data. A miss results in checking the other blocks for matches in the next clock cycle. Added to each block of a cache are block predictor bits. The bits select which of the blocks to try on the next cache access. If the predictor is correct, the cache access latency is the fast hit time. If not, it tries the other block, changes the way predictor, and has a latency of one extra clock cycle. Simulations suggest that set prediction accuracy is in excess of 90% for a two-way set associative cache and 80% for a four-way set associative cache, with better accuracy on I-caches than D-caches. Way prediction yields lower average memory access time for a two- way set associative cache if it is at least 10% faster, which is quite likely. Way prediction was first used in the MIPS in the mid-1990s. It is popular in processors that use two-way set associativity and is used in the ARM Cortex-A8 with four-way set associative caches. For very fast processors, it may be chal- lenging to implement the one cycle stall that is critical to keeping the way predic- tion penalty small. 4. Cache Microarchitecture Optimizations 4.2. Reduce Miss Rate

4.2. Reduce Miss Rate Large Block Size

• Less tag overhead • Exploit fast burst transfers from DRAM and over wide on-chip busses • Can waste bandwidth if data is not used • Fewer blocks → more conflicts

Large Cache Size or High Associativity If cache size is doubled, miss rate√ usually drops by about 2 Direct-mapped cache of size N has about the same miss rate as a two-way set-associative cache of size N/2

27 4. Cache Microarchitecture Optimizations 4.2. Reduce Miss Rate

Hardware Prefetching

L1 Prefetch Main Proc D$ Buffer Memory

Tag Cache Line

• Previous techniques only help capacity and conflict misses • Hardware prefetcher looks for patterns in miss address stream • Attempts to predict what the next miss might be • Prefetches this next miss into a pfetch buffer • Very effective in reducing compulsory misses for streaming accesses

28 4. Cache Microarchitecture Optimizations 4.2. Reduce Miss Rate

Compiler Optimizations

• Restructuring code affects the data block access sequence – Group data accesses together to improve spatial locality – Re-order data accesses to improve temporal locality • Prevent data from entering the cache – Useful for variales that will only be accessed once before eviction – Needs mechanism for software to tell hardware not to cache data (“no-allocate” instruction hits or page table bits) • Kill data that will never be used again – Streaming data exploits spatial locality but not temporal locality – Replace into dead-cache locations

Loop Interchange and Fusion

What type of locality does each optimization improve?

for(j=0; j < N; j++) { for(i=0; i < N; i++) for(i=0; i < M; i++) { a[i] = b[i] * c[i]; x[i][j] = 2 * x[i][j]; } } for(i=0; i < N; i++) d[i] = a[i] * c[i];

for(i=0; i < M; i++) { for(j=0; j < N; j++) { x[i][j] = 2 * x[i][j]; } } for(i=0; i < N; i++) { a[i] = b[i] * c[i]; What type of locality does this improve? d[i] = a[i] * c[i]; }

What type of locality does this improve?

29 4. Cache Microarchitecture Optimizations 4.2. Reduce Miss Rate

Matrix Multiply with Naive Code

for(i=0; i < N; i++) z j for(j=0; j < N; j++) { r = 0; for(k=0; k < N; k++) r = r + y[i][k] * z[k][j]; k x[i][j] = r; }

y k x j

i i

Not touched Old access New access

Matrix Multiply with Cache Tiling

for(jj=0; jj < N; jj=jj+B) for(kk=0; kk < N; kk=kk+B) z j for(i=0; i < N; i++) for(j=jj; j < min(jj+B,N); j++) { r = 0; for(k=kk; k < min(kk+B,N); k++) k r = r + y[i][k] * z[k][j]; x[i][j] = x[i][j] + r; } y k x j

i i

30 4. Cache Microarchitecture Optimizations 4.3. Reduce Miss Penalty

4.3. Reduce Miss Penalty

Multi-Level Caches

Hit Main Processor L1 Cache L2 Cache Memory L1 Miss -- L2 Hit

AMALL1 = Hit Latency of L1 + ( Miss Rate of L1 × AMALL2 )

AMALL2 = Hit Latency of L2 + ( Miss Rate of L2 × Miss Penalty of L2 )

• Local miss rate = misses in cache / accesses to cache • Global miss rate = misses in cache / processor memory accesses • Misses per instruction = misses in cache / number of instructions

• Use smaller L1 is there is also a L2 – Trade increased L1 miss rate for reduced L1 hit time & L1 miss penalty – Reduces average access energy • Use simpler write-through L1 with on-chip L2 – Write-back L2 cahce absorbs write traffic, doesn’t go off-chip – Simplifies processor pipeline – Simplifies on-chip coherence issues • Inclusive Multilevel Cache – Inner cache holds copy of data in outer cache – External coherence is simpler • Exclusive Multilevel Cache – Inner cache may hold data in outer cache – Swap lines between inner/outer cache on miss

31 4. Cache Microarchitecture Optimizations 4.4. Cache Optimization Summary

Prioritize Reads

Hit Write Buffer Main Processor Cache Memory Miss

• Processor not stalled on writes, and read misses can go ahead of writes to main memory • Write buffer may hold updated value of location needed by read miss – On read miss, wait for write buffer to be empty – Check write buffer addresses and bypass

4.4. Cache Optimization Summary

Hit Miss Miss Technique Lat Rate Penalty BW HW Smaller caches − + 0 Avoid TLB before indexing − 1 Large block size − + 0 Large cache size + − 1 High associativity + − 1 Hardware prefetching − 2 Compiler optimizations − 0 Multi-level cache − 2 Prioritize reads − 1 Pipelining + + 1

32 2.6 Putting It All Together: Memory Hierachies in the ARM Cortex-A8 and Intel Core i7 ■ 115

memory. Figure 2.16 shows how the 32-bit virtual address is used to index the TLB and the caches, assuming 32 KB primary caches and a 512 KB secondary cache with 16 KB page size.

5. Case Study: ARM CortexPerformance A8 and Intel of the CoreCortex-A8 i7 5.1. ARM Cortex A8 The memory hierarchy of the Cortex-A8 was simulated with 32 KB primary 5. Case Study: ARMcaches and Cortex a 1 MB eight-way A8 and set associative Intel L2 cache Core using i7 the integer Minnespec benchmarks (see KleinOsowski and Lilja [2002]). Minnespec is a set of benchmarks consisting of the SPEC2000 benchmarks but with different 5.1. ARM Cortexinputs A8 that reduce the running times by several orders of magnitude. Although the use of smaller inputs does not change the instruction mix, it does affect the

Virtual add ress <32>

Virtual page number <18> Page offset <14>

L1 cache index <7> Block offset <6> TLB tag <19> TLB data <19> To CPU

L1 cache tag <19> L1 data <64>

To CPU =?

P h ysical address <32>

L2 tag compare address <15> L2 cache index <11> Block offset <6> To CPU

L2 cache tag <15> L2 data <512>

=? To L1 cache or CPU

Figure 2.16 The virtual address, physical address, indexes, tags, and data blocks for the ARM Cortex-A8 data • L1 datacaches cache and data TLB. Since the instruction and data hierarchies are symmetric, we show only one. The TLB (instruc- tion or data) is fully associative with 32 entries. The L1 cache is four-way set associative with 64-byte blocks and 32 KB – 32c KB,apacity. 64 The B L2 cachecache is eight-way lines, se 4-wayt associative set-associative with 64-byte blocks and 1 withMB capac randomity. This figure replacementdoesn’t show the valid bits and protection bits for the caches and TLB, nor the use of the way prediction bits that would dictate the – 32K/64predicted ba =n 512k of the cache L1 cache. lines, 128 lines per set (set index = 7 bits) – Virtually indexed, physically tagged with single-cycle hit latency • Memory management unit – TLB with multi-level page tables in physical memory – TLB has 32 entries, fully associative, hardware TLB miss handler – Variable page size: 4 KB, 16 KB, 64 KB, 1 MB, 16 MB (fig shows 16 KB) – TLB tag entries can have wildcards to support multiple page sizes • L2 cache – 1 MB, 64 B cache lines, 8-way set-associative – 1M/64 = 16K cache lines, 2K lines per set (set index = 11 bits) – Physically addressed with multi-cycle hit latency

33 118 ■ Chapter Two Memory Hierarchy Design

described in Chapter 4. In 2010, the fastest i7 had a of 3.3 GHz, which yields a peak instruction execution rate of 13.2 billion , or over 50 billion instructions per second for the four-core design. The i7 can support up to three memory channels, each consisting of a sepa- rate set of DIMMs, and each of which can transfer in parallel. Using DDR3-1066 (DIMM PC8500), the i7 has a peak memory bandwith of just over 25 GB/sec. i7 uses 48-bit virtual addresses and 36-bit physical addresses, yielding a max- imum physical memory of 36 GB. Memory management is handled with a two- level TLB (see Appendix B, Section B.4), summarized in Figure 2.19. Figure 2.20 summarizes the i7’s three-level cache hierarchy. The first-level caches are virtually indexed and physically tagged (see Appendix B, Section B.3), while the L2 and L3 caches are physically indexed. Figure 2.21 is labeled with the

Characteristic Instruction TLB Data DLB Second-level TLB Size 128 64 512 Associativity 4-way 4-way 4-way Replacement Pseudo-LRU Pseudo-LRU Pseudo-LRU Access latency 1 cycle 1 cycle 6 cycles Miss 7 cycles 7 cycles Hundreds of cycles to access page table

Figure 2.19 Characteristics of the i7’s TLB structure, which has separate first-level 5.instruction Case Study: and ARM data Cortex TLBs, A8 both and Intel backed Core by i7 a joint second-level TLB. 5.2. Th Intele firs Coret-leve i7l TLBs support the standard 4 KB page size, as well as having a limited number of entries 5.2.of larg Intele 2 to 4 CoreMB pages i7; only 4 KB pages are supported in the second-level TLB. 118 ■ Chapter Two Memory Hierarchy Design

Characteristic L1 L2 L3 describedSize in Chapter 4 32. In KB 2010, I/32 KBthe Dfastest 256 i7 hadKB a clock rate 2 MB of 3.3per coreGHz, which yieldsAssociativity a peak instruction 4-way execution I/8-way rate D of 13.2 8-way billion instructions 16-way per second, or overAccess 50 latency billion instructions 4 cycles, per pipelined second for 10 the cycles four-core design. 35 cycles The i7 can support up to three memory channels, each consisting of a sepa- Replacement scheme Pseudo-LRU Pseudo- Pseudo-LRU but with an rate set of DIMMs, and each of which can transferLRU in parallel.ordered selection Using DDR3-1066 algorihtm (DIMM PC8500), the i7 has a peak memory bandwith of just over 25 GB/sec. i7 uses 48-bit virtual addresses and 36-bit physical addresses, yielding a max- •Figure Write-back 2.20 Characteristics with merging of write the three-level buffer (more cache like hierarchy no write allocate) in the i7. All three imumcaches physical use write memory-back and of a blo36 cGB.k siz eMemory of 64 byt managementes. The L1 an dis L2 handled caches withare se aparat two-e • L3 is inclusive of L1/L2 levelfor ea TLBch cor (seee, whil Aep ptheen dL3ix c Bac, hSee icst siharon eBd.4 amo), summarizedng the cores in on Figure a chip a2n.1d9 i.s a total of 2 •MB Hardware Figureper cor e2.20. All prefetching thrsummarizesee caches from arthee L2n oi7’sn intoblo three-levelc L1,king from and a L3cachellow into multipl hierarchy. L2 e out sThetandi first-levelng writes. •cAma cVirtuallyheergis anrge writvi indexed,rtuea bufflly ienr d physicallyise xuesedd a fornd thp taggedhey L1sic acall L1cyh eta caches, gwhigedch ( sholdee As pdatapen dinix th Be, eSveenctti othatn B .th3)e, •whileli nePhysically is thenot L2presen and addressedt L3in L1 caches wh L2/L3en are it is physically writt cachesen. (That indexed. is, an L1Figure write 2.21miss isdo labeledes not c withause ththee line to be allocated.) L3 is inclusive of L1 and L2; we explore this property in further detail when we explain multiprocessor caches. Replacement is by a variant on pseudo- CharacteristicLRU; in the case of L3 Instruction the block TLB replaced Data is alway DLBs the lowes Second-levelt numbered TLBway whose access bit is turned off. This is not quite random but is easy to compute. Size 128 64 512 Associativity 4-way 4-way 4-way Replacement Pseudo-LRU Pseudo-LRU Pseudo-LRU Access latency 1 cycle 1 cycle 6 cycles Miss 7 cycles 7 cycles Hundreds of cycles to access page table

•Figure 48 bit 2.19 virtualCharacteristics addresses andof the 36 i7’s bit physicalTLB structure, addresses which (36 has GB separate physical first-level mem) instruction and data TLBs, both backed by a joint second-level TLB. The first-level • 4 KB pages except for few large 2–4MB pages in L1 TLBs TLBs support the standard 4 KB page size, as well as having a limited number of entries of large 2 to 4 MB pages; only 4 KB pages are supported in the second-level TLB.

Characteristic L1 L2 L3 Size 32 KB I/32 KB D 256 KB 2 MB per core Associativity 4-way I/8-way D 8-way 16-way Access latency 4 cycles, pipelined34 10 cycles 35 cycles Replacement scheme Pseudo-LRU Pseudo- Pseudo-LRU but with an LRU ordered selection algorihtm

Figure 2.20 Characteristics of the three-level cache hierarchy in the i7. All three caches use write-back and a block size of 64 bytes. The L1 and L2 caches are separate for each core, while the L3 cache is shared among the cores on a chip and is a total of 2 MB per core. All three caches are nonblocking and allow multiple outstanding writes. Amerging write buffer is used for the L1 cache, which holds data in the event that the line is not present in L1 when it is written. (That is, an L1 write miss does not cause the line to be allocated.) L3 is inclusive of L1 and L2; we explore this property in further detail when we explain multiprocessor caches. Replacement is by a variant on pseudo- LRU; in the case of L3 the block replaced is always the lowest numbered way whose access bit is turned off. This is not quite random but is easy to compute.