EEC 581 Architecture

Multiprocessor and

Department of Electrical Engineering and Computer Science Cleveland State University

Memory Hierarchy in a Multiprocessor

Shared Bus-based P P P P P P $ $ $ Cache

Memory Memory

Fully-connected shared memory Distributed shared memory (Dancehall) P P P P P

$ $ $ $ $ Memory Memory Interconnection Network

Interconnection Network Memory Memory

2

1 Cache Coherency

 Closest cache level is private  Multiple copies of cache line can be present across different processor nodes  Local updates  Lead to incoherent state  Problem exhibits in both write-through and writeback caches  Bus-based  globally visible  Point-to-point interconnect  visible only to communicated processor nodes

3

Example (Writeback Cache)

P P P Rd? Rd? Cache Cache Cache X= -100 X= -100 X=X= - 100505

X= -100 Memory

4

2 Example (Write-through Cache)

P P P Rd? Cache Cache Cache X= -100 X= 505 X=X= - 100505

X=X= -505100 Memory

5

Defining Coherence

 An MP is coherent if the results of any execution of a program can be reconstructed by a hypothetical serial order

Implicit definition of coherence  Write propagation  Writes are visible to other processes  Write serialization  All writes to the same location are seen in the same order by all processes (to “all” locations called write atomicity)

 E.g., w1 followed by w2 seen by a read from P1, will be seen in the same order by all reads by other processors Pi

6

3 Sounds Easy?

A=0 B=0

P0 P1 P2 P3

T1 A=1 B=2

T2 A=1 A=1 B=2 B=2 T3 A=1 A=1 B=2 B=2 B=2 A=1 T3 A=1 A=1 B=2 B=2 B=2 B=2 A=1 A=1

See A’s update before B’s See B’s update before A’s

7

Bus Snooping based on Write-Through Cache

 All the writes will be shown as a transaction on the shared bus to memory

 Two protocols  Update-based Protocol  Invalidation-based Protocol

8

4 (Update-based Protocol on Write-Through cache) P P P

Cache Cache Cache X= - 505100 X= 505

Bus transaction X=X= -505100 Memory Bus snoop

 Each processor’s cache controller constantly snoops on the bus  Update local copies upon snoop hit

9

Bus Snooping (Invalidation-based Protocol on Write-Through cache)

Load X P P P

Cache Cache Cache X=X= -505100 X= 505

Bus transaction X=X= -505100 Memory Bus snoop

 Each processor’s cache controller constantly snoops on the bus  Invalidate local copies upon snoop hit

10

5 A Simple Snoopy Coherence Protocol for a WT, No Write-Allocate Cache

PrRd / --- PrWr / BusWr

Valid

BusWr / --- PrRd / BusRd

Invalid Observed / Transaction

Processor-initiated Transaction Bus-snooper-initiated Transaction PrWr / BusWr 11

Cache Coherence Protocols for WB caches

 A cache has an exclusive copy of a line if  It is the only cache having a valid copy  Memory may or may not have it  Modified (dirty) cache line  The cache having the line is the owner of the line, because it must supply the block

12

6 Protocol (Update-based Protocol on Writeback cache)

P P P Store X

Cache Cache Cache X= - 100505 X= - 100505 X= - 100505 update

update

Bus transaction Memory

• Update data for all processor nodes who share the same data • For a processor node keeps updating the memory location, a lot of traffic

will be incurred 13

Cache Coherence Protocol (Update-based Protocol on Writeback cache)

Load X P P Store X P

Cache Cache Cache X= 505333 X= 505333 X= 505333 Hit ! update update

Bus transaction Memory

• Update data for all processor nodes who share the same data • For a processor node keeps updating the memory location, a lot of traffic

will be incurred 14

7 Cache Coherence Protocol (Invalidation-based Protocol on Writeback cache)

P P P Store X

Cache Cache Cache X= -100 X= -100 X= - 100505 invalidate invalidate

Bus transaction Memory

• Invalidate the data copies for the sharing processor nodes • Reduced traffic when a processor node keeps updating the same

memory location 15

Cache Coherence Protocol (Invalidation-based Protocol on Writeback cache)

Load X P P P

Cache Cache Cache X= 505 X= 505 Miss ! Snoop hit

Bus transaction Memory Bus snoop

• Invalidate the data copies for the sharing processor nodes • Reduced traffic when a processor node keeps updating the same

memory location 16

8 Cache Coherence Protocol (Invalidation-based Protocol on Writeback cache) Store X P P P Store X Store X Cache Cache Cache X= 505333987444 X= 505

Bus transaction Memory Bus snoop

• Invalidate the data copies for the sharing processor nodes • Reduced traffic when a processor node keeps updating the same

memory location 17

MSI Writeback Invalidation Protocol

 Modified  Dirty  Only this cache has a valid copy  Shared  Memory is consistent  One or more caches have a valid copy  Invalid

 Writeback protocol: A cache line can be written multiple times before the memory is updated.

18

9 MSI Writeback Invalidation Protocol

 Two types of request from the processor  PrRd  PrWr

 Three types of bus transactions post by cache controller  BusRd  PrRd misses the cache  Memory or another cache supplies the line  BusRd eXclusive (Read-to-own)  PrWr is issued to a line which is not in the Modified state  BusWB  Writeback due to replacement  Processor does not directly involve in initiating this operation

19

MSI Example P1 P2 P3

Cache Cache Cache X=10 S

Bus BusRd

MEMORY X=10

Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier P1 reads X S ------BusRd Memory

20

10 MSI Example P1 P2 P3

Cache Cache Cache X=10 S X=10 S

Bus BusRd

MEMORY X=10

Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier P1 reads X S ------BusRd Memory P3 reads X S --- S BusRd Memory

21

MSI Example P1 P2 P3

Cache Cache Cache X=10--- SI X=10X=-25 SM

Bus BusRdX

MEMORY X=10

Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier P1 reads X S ------BusRd Memory P3 reads X S --- S BusRd Memory P3 writes X I --- M BusRdX Memory

Does not come from memory if having “BusUpgrade” 22

11 MSI Example P1 P2 P3

Cache Cache Cache ---X=-25 IS X=-25 MS

Bus BusRd

MEMORY X=10X=-25

Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier P1 reads X S ------BusRd Memory P3 reads X S --- S BusRd Memory P3 writes X I --- M BusRdX Memory P1 reads X S --- S BusRd P3 Cache

23

MSI Example P1 P2 P3

Cache Cache Cache X=-25 S X=-25 S X=-25 MS

Bus BusRd

MEMORY X=10X=-25

Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier P1 reads X S ------BusRd Memory P3 reads X S --- S BusRd Memory P3 writes X I --- M BusRdX Memory P1 reads X S --- S BusRd P3 Cache P2 reads X S S S BusRd Memory

24

12 MESI Writeback Invalidation Protocol  To reduce two types of unnecessary bus transactions  BusRdX that snoops and converts the block from S to M when only you are the sole owner of the block  BusRd that gets the line in S state when there is no sharers (that lead to the overhead above)

 Introduce the Exclusive state  One can write to the copy without generating BusRdX

 Illinois Protocol: Proposed by Pamarcos and Patel in 1984  Employed in Intel, PowerPC, MIPS

25

MESI Writeback Invalidation Protocol Processor Request (Illinois Protocol)

PrWr / --- PrRd / --- PrRd, PrWr / ---

Exclusive Modified

PrWr / BusRdX PrWr / BusRdX PrRd / BusRd (not-S)

Invalid Shared

S: Shared Signal PrRd / --- Processor-initiated PrRd / 26BusRd (S)

13 MESI Writeback Invalidation Protocol Bus Transactions (Illinois Protocol)

• Whenever possible, Illinois protocol performs $-to-$ transfer rather than having memory to supply the data • Use a Selection algorithm if there are multiple suppliers (Alternative: add an O state or force update memory) • Most of the MESI implementations simply write to memory

ExclusiveBusRd / Flush Modified Or ---)

BusRdX / --- BusRd / Flush

BusRdX / Flush

Invalid Shared BusRd / Flush* BusRdX / Flush*

27 Bus-snooper-initiated Flush*: Flush for data supplier; no action for other sharers

MESI Writeback Invalidation Protocol (Illinois Protocol)

PrWr / --- PrRd / --- PrRd, PrWr / ---

ExclusiveBusRd / Flush Modified (or ---)

PrWr / BusRdX PrWr / BusRdX PrRd / BusRd BusRdX / --- BusRd / Flush (not-S) BusRdX / Flush

Invalid Shared BusRd / Flush* BusRdX / Flush* S: Shared Signal PrRd / --- Processor-initiated PrRd / 28BusRd (S) Bus-snooper-initiated Flush*: Flush for data supplier; no action for other sharers

14 MOESI Protocol  Add one additional state ─ Owner state  Similar to Shared state  The O state processor will be responsible for supplying data (copy in memory may be stale)  Employed by CPU0 CPU1  Sun UltraSparc  AMD Opteron L2 L2  In dual-core Opteron, cache-to-cache System Request Interface transfer is done through a system Crossbar request interface (SRI) running at full CPU speed Mem Hyper- Controller Transport 29

Implication on Multi-Level Caches

 How to guarantee coherence in a multi-level cache hierarchy  Snoop all cache levels?  Intel’s 8870 chipset has a “snoop filter” for quad-core

 Maintaining inclusion property  Ensure data in the outer level must be present in the inner level  Only snoop the outermost level (e.g. L2)  L2 needs to know L1 has write hits  Use Write-Through cache  Use Write-back but maintain another “modified-but-stale” bit in L2

30

15 Inclusion Property

 Not so easy …  Replacement: Different bus observes different access activities, e.g. L2 may replace a line frequently accessed in L1  Split L1 caches: Imagine all caches are direct-mapped.  Different cache line sizes

31

Inclusion Property

 Use specific cache configurations  E.g., DM L1 + bigger DM or set-associative L2 with the same cache line size

 Explicitly propagate L2 action to L1  L2 replacement will flush the corresponding L1 line  Observed BusRdX bus transaction will invalidate the corresponding L1 line  To avoid excess traffic, L2 maintains an Inclusion bit for filtering (to indicate in L1 or not)

32

16 Directory-based Coherence Protocol

P P P P

$ $ $ $

Interconnection Network

Memory

Modified bit Presence bits, one for each node  Snooping-based protocol Directory  N transactions for an N-node MP  All caches need to watch every memory request from each processor  Not a scalable solution for maintaining coherence in large shared memory systems  Directory protocol  Directory-based control of who has what;

 HW overheads to keep the directory (~ # lines * # processors) 33

Directory-based Coherence Protocol

P P P P P

$ $ $ $ $

Interconnection Network

0 1 0 0 0 1 1 0 C(k) Memory 1 0 1 0 0 0 0 0 C(k+1) 0 0 1 0 0 0 0 1 C(k+j)

1 modified bit for each cache block in memory 1 presence bit for each processor, each cache block in memory

34

17 Directory-based Coherence Protocol (Limited Dir)

P0 P1 P13 P14 P15

$ $ $ $ $

Interconnection Network

0 1 0 0 0 0 1 1 1 1 0 Memory 1 1 0 0 0 1 0 - - - -

0 0 - - - - 0 - - - -

1 modified bit for each cache block in memory Presence encoding is NULL or not

Encoded Present bits (lg2N), each cache line can reside35 in 2 processors in this example

Distributed Directory Coherence Protocol P P P

$ $ $ Memory Memory Memory

Directory Directory Directory

Interconnection Network

Directory Directory Directory

Memory Memory Memory

$ $ $

P P P

 Centralized directory is less scalable (contention)  Distributed shared memory (DSM) for a large MP system  Interconnection network is no longer a shared bus  Maintain cache coherence (CC-NUMA)  Each address has a “home” 36

18 Distributed Directory Coherence Protocol P P P P

$ $ $ $

Memory Memory Memory Memory

Snoop bus Snoop bus

Directory Directory

Interconnection Network

 Stanford DASH (4 CPUs in each cluster, total 16 clusters)  Invalidation-based cache coherence  Directory keeps one of the 3 status of a cache block at its home node  Uncached  Shared (unmodified state)  Dirty

37

DASH P P P P

$ $ $ $

Memory Memory Memory Memory

Snoop bus Snoop bus

Directory Directory

Interconnection Network  Processor Level  Local Cluster Level  Home Cluster Level (address is at home) If dirty, needs to get it from remote node which owns it  Remote Cluster Level

38

19 Directory Coherence Protocol: Read Miss

P Miss Z (read) P P

$Z Z$ $ Home of Z

Memory Memory Memory Z Go to Home Node

0 01 1 1

Interconnection Network

Data Z is shared (clean)

39

Directory Coherence Protocol: Read Miss

P Miss Z (read) P P

$Z $Z $

Memory Memory Memory Z Go toData Home Request Node

Respond with Owner Info 10 01 1 01

Interconnection Network

Data ZData is Clean, Z is Dirty Shared by 3 nodes

40

20 Directory Coherence Protocol: Write Miss

P Miss Z (write) P P

$Z $Z $Z

InvalidateInvalidate Memory Memory Memory Z Go to Home Node ACK Respond w/ACK sharers 01 01 10 10

Interconnection Network

Write Z can proceed in P0

41

21