EEC 581 Computer Architecture

Memory Hierarchy Design (II)

Department of Electrical Engineering and Computer Science Cleveland State University

Topics to be covered

 Cache Penalty Reduction Techniques  Victim cache  Assist cache  Non-blocking cache  Data Prefetch mechanism

2

1 3Cs Absolute Miss Rate (SPEC92)

•Compulsory misses are a tiny fraction of the overall misses •Capacity misses reduce with increasing sizes •Conflict misses reduce with increasing associativity 0.14 1-way Conflict 0.12 2-way 0.1 4-way 0.08 8-way 0.06 Capacity 0.04

Miss Rate per Type 0.02 0

1 2 4 8

16 32 64

128 Cache Size (KB) Compulsory

3

2:1 Cache Rule Miss rate DM cache size X ~= Miss rate 2-way SA cache size X/2

Conflict

4

2 3Cs Relative Miss Rate

100% 1-way 80% 2-way Conflict 4-way 60% 8-way

40% Capacity

20%

Miss Rate per Type

0%

1 2 4 8

16 32 64

128 Caveat: fixed block size Compulsory Cache Size (KB)

5

Victim Caching [Jouppi’90]

 Victim cache (VC) Processor  A small, fully associative structure  Effective in direct-mapped caches L1 VC  Whenever a line is displaced from L1 cache, it is loaded into VC Memory  Processor checks both L1 and VC Victim Cache Organization simultaneously  Swap data between VC and L1 if  L1 misses and VC hits  When data has to be evicted from VC, it is written back to memory

6

3 % of Conflict Misses Removed

Dcache

Icache

7

Assist Cache [Chan et al. ‘96]

 Assist Cache (on-chip) avoids thrashing in main (off-chip) L1 Processor cache (both run at full speed)  64 x 32-byte fully associative CAM L1 AC  Data enters Assist Cache when miss (FIFO replacement policy Memory in Assist Cache) Assist Cache  Data conditionally moved to L1 Organization or back to memory during eviction  Flush back to memory when brought in by “Spatial locality hint” instructions  Reduce pollution

8

4 Multi-lateral Cache Architecture

Processor Core

A B

Memory

 A Fully Connected Multi-Lateral Cache Architecture  Most of the cache architectures be generalized into this form

9

Cache Architecture Taxonomy

Processor Processor Processor

A B A A B

Memory Memory Memory General Description Single-level cache Two-level cache

Processor Processor Processor

A B A B A B

Memory Memory Memory Victim cache Assist cache NTS, and PCS caches

10

5 Non-blocking (Lockup-Free) Cache [Kroft ‘81]

 Prevent pipeline from stalling due to cache misses (continue to provide hits to other lines while servicing a miss on one/more lines)  Uses Miss Status Handler Register (MSHR)  Tracks cache misses, allocate one entry per cache miss (called fill buffer in Intel P6 proliferation)  New cache miss checks against MSHR  Pipeline stalls at a cache miss only when MSHR is full  Carefully choose number of MSHR entries to match the sustainable bandwidth

11

Bus Utilization (MSHR = 2)

Time

Lead-off latency 4 data chunk m1 m2 m3 Initiation m4 interval m5 Stall due to insufficient MSHR

Data Bus Idle Transfer BUS Memory bus utilization

12

6 Bus Utilization (MSHR = 4)

Time

Stall

Data Bus Idle Transfer BUS Memory bus utilization

13

Sample question

 What is the major difference between CDC6600’s Scoreboarding algorithm and IBM 360/91’s Tomasulo algorithm? (One sentence)

 Why IBM 360/91 only implemented Tomasulo’s algorithm in the floating-point unit but not in the integer unit? (One sentence)

 What are the “two main functions” of a ReOrder Buffer (ROB)?

14

7 Sample question

 What is the major difference between CDC6600’s Scoreboarding algorithm and IBM 360/91’s Tomasulo algorithm? (One sentence) Tomasulo algorithm does register renaming.  Why IBM 360/91 only implemented Tomasulo’s algorithm in the floating-point unit but not in the integer unit? (One sentence) Due to the long latency of the FPU. There are only 4 registers in the FPU.  What are the “two main functions” of a ReOrder Buffer (ROB)? To support i) precise exception and ii) branch mis- prediction recovery

15

Sample question

 What is the main responsibility of the Load Store Queue?

 Given 4 architectural registers (R0 to R3) and 16 physical registers (T0 to T15). The current RAT content is indicated in the leftmost table below. Note that the physical registers are allocated in circular numbering order (i.e., T0, T1, to T15 then back to T0). Assume the renaming logic can rename 4 instructions per clock cycle. For the following instruction sequence, fill the RAT content one cycle later. (The destination register of arithmetic instructions is on the left-hand side.) What are the “two main functions” of a ReOrder Buffer (ROB)? RAT (after 1 cycle)

16

8 Sample question

 What is the main responsibility of the Load Store Queue? To perform memory address disambiguation and maintain memory ordering.  Given 4 architectural registers (R0 to R3) and 16 physical registers (T0 to T15). The current RAT content is indicated in the leftmost table below. Note that the physical registers are allocated in circular numbering order (i.e., T0, T1, to T15 then back to T0). Assume the renaming logic can rename 4 instructions per clock cycle. For the following instruction sequence, fill the RAT content one cycle later. (The destination register of arithmetic instructions is on the left-hand side.) What are the “two main functions” of a ReOrder Buffer (ROB)?

17

Sample question

 Caches and main memory are sub-divided into multiple banks in order to allow parallel access. What is an alternative way of allowing parallel access?

 What a cache that allows multiple cache misses to be outstanding to main memory at the same time, the pipeline is not stalled. What is called?

 While cache misses are outstanding to main memory, what is the structure that keeps bookkeeping information about the outstanding cache misses? This structure often augments the cache.

18

9 Sample question

 Caches and main memory are sub-divided into multiple banks in order to allow parallel access. What is an alternative way of allowing parallel access? Multiporting, duplicating  What a cache that allows multiple cache misses to be outstanding to main memory at the same time, the pipeline is not stalled. What is called? Non-blocking (or lockup-free)  ( While cache misses are outstanding to main memory, what is the structure that keeps bookkeeping information about the outstanding cache misses? This structure often augments the cache. Miss status handling registers (MSHRs)

19

Sample question

 Consider a processor with separate instruction and data caches (and no L2 cache). We are focusing on improving the data cache performance since our instruction cache achieves 100% hit rate with various optimizations. The data cache is 4kB, direct-mapped, and has single cycle access latency. The processor supports a 64-bit virtual address space, 8kB pages and no more than 16GB physical memory. The cache block size is 32 bytes. The data cache is virtually indexed and physically tagged. Assume that the data TLB hit rate is 100%.  The miss rate of the data cache is measured to be 10%. The miss penalty is 20 cycles. Compute the average memory access latency (in terms of number of cycles) for data accesses.  To improve the overall memory access latency, we decided to introduce a victim cache. It is fully associative and has eight entries. Its access latency is one cycle. To save power and energy consumption, we decided to access the victim cache only after we detect a miss from the data cache. The victim cache hit rate is measured to be 50% (i.e., the probability of finding data in the victim cache given that the data cache doesn’t have it). Further, only after we detect a miss from the victim cache we start miss handling. Compute the average memory access latency for data accesses.

20

10 Prefetch (Data/Instruction)

 Predict what data will be needed in future  Pollution vs. Latency reduction  If you correctly predict the data that will be required in the future, you reduce latency. If you mispredict, you bring in unwanted data and pollute the cache  To determine the effectiveness  When to initiate prefetch? (Timeliness)  Which lines to prefetch?  How big a line to prefetch? (note that cache mechanism already performs prefetching.)  What to replace?  Software (data) prefetching vs. hardware prefetching

21

Software-controlled Prefetching

 Use instructions  Existing instruction  Alpha’s load to r31 (hardwired to 0)  Specialized instructions and hints  Intel’s SSE: prefetchnta, prefetcht0/t1/t2  MIPS32: PREF  PowerPC: dcbt (data cache block touch), dcbtst (data cache block touch for store)  Compiler or hand inserted prefetch instructions

22

11 Alpha

 The Alpha architecture supports data prefetch via load instructions with a destination of register R31 or F31.

LDBU, LDF, LDG, LDL, LDT, LDWU Normal cache line prefetches. Prefetch with modify intent; sets the dirty LDS and modified bits. LDQ Prefetch, evict next; no temporal locality.

 The Alpha architecture also defines the following instructions.

FETCH Prefetch Data FETCH_M Prefetch Data, Modify Intent

PowerPC dcbt Data Cache Block Touch Dcbtst Data Cache Block Touch for Store Intel SSE  The SSE prefetch instruction has the following variants:

prefetcht0 Temporal data; prefetch data into all cache levels. Temporal with respect to first level cache; prefetch prefetcht1 data in all cache levels except 0th cache level. Temporal with respect to second level cache; prefetcht2 prefetch data in all cache levels, except 0th and 1st cache levels. Non-temporal with respect to all cache levels; prefetchnta prefetch data into non-temporal cache structure, with minimal cache pollution.

12 Software-controlled Prefetching for (i=0; i < N; i++) { /* unroll loop 4 times */ prefetch (&a[i+1]); for (i=0; i < N-4; i+=4) { prefetch (&b[i+1]); prefetch (&a[i+4]); prefetch (&b[i+4]); sop = sop + a[i]*b[i]; } sop = sop + a[i]*b[i]; sop = sop + a[i+1]*b[i+1]; sop = sop + a[i+2]*b[i+2]; sop = sop + a[i+3]*b[i+3]; } sop = sop + a[N-4]*b[N-4]; sop = sop + a[N-3]*b[N-3]; sop = sop + a[N-2]*b[N-2]; sop = sop + a[N-1]*b[N-1];

• Prefetch latency <= computational time 25

Hardware-based Prefetching

 Sequential prefetching  Prefetch on miss  Tagged prefetch  Both techniques are based on “One Block Lookahead (OBL)” prefetch: Prefetch line (L+1) when line L is accessed based on some criteria

26

13 Sequential Prefetching

 Prefetch on miss  Initiate prefetch (L+1) whenever an access to L results in a miss  Alpha 21064 does this for instructions (prefetched instructions are stored in a separate structure called stream buffer)

 Tagged prefetch  Idea: Whenever there is a “first use” of a line (demand fetched or previously prefetched line), prefetch the next one  One additional “Tag bit” for each cache line  Tag the prefetched, not-yet-used line (Tag = 1)  Tag bit = 0 : the line is demand fetched, or a prefetched line is referenced for the first time  Prefetch (L+1) only if Tag bit = 1 on L

27

Sequential Prefetching

Prefetch-on-miss when accessing contiguous lines

Demand fetched Demand fetched Demand fetched

Prefetched Prefetched Prefetched

Demand fetched

Prefetched miss hit miss

Tagged Prefetch when accessing contiguous lines

0 Demand fetched 0 Demand fetched 0 Demand fetched 1 Prefetched 0 Prefetched 0 Prefetched 1 Prefetched 0 Prefetched 1 Prefetched miss 28hit hit

14 29

Virtual Memory  Virtual memory – separation of logical memory from physical memory.  Only a part of the program needs Main memory to be in memory for execution. Hence, logical address space is like a cache can be much larger than physical to the hard address space. disc!  Allows address spaces to be shared by several processes (or threads).  Allows for more efficient process creation.

 Virtual memory can be implemented via:  Demand paging  Demand segmentation

30

15 Virtual Address

 The concept of a virtual (or logical) address space that is bound to a separate physical address space is central to memory management  Virtual address – generated by the CPU  Physical address – seen by the memory

 Virtual and physical addresses are the same in compile-time and load-time address-binding schemes; virtual and physical addresses differ in execution-time address-binding schemes

31

Advantages of Virtual Memory

 Translation:  Program can be given consistent view of memory, even though physical memory is scrambled  Only the most important part of program (“Working Set”) must be in physical memory.  Contiguous structures (like stacks) use only as much physical memory as necessary yet grow later.  Protection:  Different threads (or processes) protected from each other.  Different pages can be given special behavior  (Read Only, Invisible to user programs, etc).  Kernel data protected from User programs  Very important for protection from malicious programs => Far more “viruses” under Microsoft Windows  Sharing:  Can map same physical to multiple users (“Shared memory”)

32

16 Use of Virtual Memory

stack stack

Shared Libs Shared Shared Libs page

heap heap Static data Static data code code

Process A Process B

33

Virtual vs. Physical Address Space

Virtual Virtual Physical Main Address Memory Address Memory 0 A 0

4k B 4k C 8k C 8k

12k D 12k . 16k A Disk . 20k . 24k B . . 28k . D 4G .

34

17 Paging

 Divide physical memory into fixed-size blocks (e.g., 4KB) called frames  Divide logical memory into blocks of same size (4KB) called pages  To run a program of size n pages, need to find n free frames and load program  Set up a to map page addresses to frame addresses ( sets up the page table)

35

Page Table and Address Translation

Virtual page number (VPN) Page offset

Page Main Table Memory

Physical page # (PPN) =

Physical address

36

18 Page Table Structure Examples  One-to-one mapping, space?  Large pages  Internal fragmentation (similar to having large line sizes in caches)  Small pages  Page table size issues  Multi-level Paging  Inverted Page Table Example: 64 bit address space, 4 KB pages (12 bits), 512 MB (29 bits) RAM

Number of pages = 264/212 = 252 (The page table has as many entrees)

Each entry is ~4 bytes, the size of the Page table is 254 Bytes = 16 Petabytes!

Can’t fit the page table in the 512 MB RAM!

37

Multi-level (Hierarchical) Page Table

 Divide virtual address into multiple levels

Level 1 is stored in P1the Main memoryP2 Page offset

P1

P2 = Level 1 page directory Level 2 (pointer array) page table PPN Page offset (stores PPN)

38

19 Inverted Page Table

 One entry for each real page of memory  Shared by all active processes  Entry consists of the virtual address of the page stored in that real memory location, with Process ID information  Decreases memory needed to store each page table, but increases time needed to search the table when a page reference occurs

39

Linear Inverted Page Table  Contain entries (size of physical memory) in a linear array  Need to traverse the array sequentially to find a match  Can be time consuming PID = 8 Virtual Address PPN VPN = 0x2AA70 Offset Index PID VPN 0 1 0x74094 1 12 0xFEA00 2 1 0x00023 ...... 0x120C 14 0x2409A 0x120D match 8 0x2AA70 ...... PPN = 0x120D Offset Physical Address Linear Inverted Page Table

40

20 Hashed Inverted Page Table

 Use hash table to limit the search to smaller number of page-table entries

Virtual Address PID = 8 VPN = 0x2AA70 Offset

Hash PID VPN Next 0 1 0x74094 0x0012 1 12 0xFEA00 --- 2 1 0x00023 0x120D ...... 0x120C 14 0x2409A 0x0980 2 0x120D 8 0x2AA70 0x00A0 .. . match...... Hash anchor table

41

Fast Address Translation

 How often address translation occurs?  Where the page table is kept?  Keep translation in the hardware  Use Translation Lookaside Buffer (TLB)  Instruction-TLB & Data-TLB  Essentially a cache (tag array = VPN, data array=PPN)  Small (32 to 256 entries are typical)  Typically fully associative (implemented as a content addressable memory, CAM) or highly associative to minimize conflicts

42

21 43

Example: data TLB

VPN <35> offset <13>

Address Space Number <8> <4><1> <35> <31> ASN ProtVTag PPN

. . .

. . . 128:1 mux

44-bit physical address =

44

22 TLB and Caches

 Several Design Alternatives  VIVT: Virtually-indexed Virtually-tagged Cache  VIPT: Virtually-indexed Physically-tagged Cache  PIVT: Physically-indexed Virtually-tagged Cache  Not outright useful, R6000 is the only used this.  PIPT: Physically-indexed Physically-tagged Cache

45

46

23 Virtually-Indexed Virtually-Tagged (VIVT)

cache line return

Processor VIVT TLB Main Memory Core VA Cache miss

hit

 Fast cache access  Only require address translation when going to memory (miss)  Issues?

47

VIVT Cache Issues - Aliasing  Homonym  Same VA maps to different PAs  Occurs when there is a context switch  Solutions  Include process id (PID) in cache or  Flush cache upon context switches  Synonym (also a problem in VIPT)  Different VAs map to the same PA  Occurs when data is shared by multiple processes  Duplicated cache line in VIPT cache and VIVT$ w/ PID  Data is inconsistent due to duplicated locations  Solution  Can Write-through solve the problem?  Flush cache upon context switch  If (index+offset) < page offset, can the problem be solved? (discussed later in VIPT)

48

24 49

Physically-Indexed Physically-Tagged (PIPT)

cache line return

Processor PIPT TLB Main Memory Core Cache VA PA miss

hit

• Slower, always translate address before accessing memory • Simpler for data coherence

50

25 Virtually-Indexed Physically-Tagged (VIPT)

TLB PA miss Main Memory Processor VA Core VIPT Cache cache line return hit  Gain benefit of a VIVT and PIPT  Parallel Access to TLB and VIPT cache  No Homonym  How about Synonym?

51

Deal w/ Synonym in VIPT Cache

Index VPN A Process A point to the same location within a page

Process B VPN B • VPN A != VPN B Index • How to eliminate duplication? • make cache Index A == Index B ?

Tag array Data array 52

26 Synonym in VIPT Cache

VPN Page Offset

Cache Tag Set Index Line Offset

a  If two VPNs do not differ in a then there is no synonym problem, since they will be indexed to the same set of a VIPT cache  Imply # of sets cannot be too big  Max number of sets = page size / cache line size  Ex: 4KB page, 32B line, max set = 128  A complicated solution in MIPS R10000

53

R10000’s Solution to Synonym

• 32KB 2-Way Virtually-Indexed L1 VPN 12 bit 10 bit 4-bit

• Direct-Mapped Physical L2 a= VPN[1:0] stored as part of L2 cache Tag – L2 is Inclusive of L1 – VPN[1:0] is appended to the “tag” of L2 • Given two virtual addresses VA1 and VA2 that differs in VPN[1:0] and both map to the same physical address PA – Suppose VA1 is accessed first so blocks are allocated in L1&L2 – What happens when VA2 is referenced? 1 VA2 indexes to a different block in L1 and misses 2 VA2 translates to PA and goes to the same block as VA1 in L2 3. Tag comparison fails (since VA1[1:0]VA2[1:0]) 4. Treated just like as a L2 conflict miss  VA1’s entry in L1 is ejected (or dirty-written back if needed) due to inclusion policy

54

27 Deal w/ Synonym in MIPS R10000

VA1 VA2 Page offset Page offset index index a1 a2

1

miss 0 TLB

L1 VIPT cache

L2 PIPT Cache Physical index || a2

a2 !=a1 a1 Phy. Tag data

55

Deal w/ Synonym in MIPS R10000

VA1 VA2 Page offset Page offset index index a1 a2

0 Only one copy is present in L1 1 TLB

L1 VIPT cache

L2 PIPT Cache Data return

a2 Phy. Tag data

56

28