EEC 581 Computer Architecture
Multiprocessor and Memory Coherence
Department of Electrical Engineering and Computer Science Cleveland State University
Memory Hierarchy in a Multiprocessor
Shared cache Bus-based shared memory P P P P P P $ $ $ Cache
Memory Memory
Fully-connected shared memory Distributed shared memory (Dancehall) P P P P P
$ $ $ $ $ Memory Memory Interconnection Network
Interconnection Network Memory Memory
2
1 Cache Coherency
Closest cache level is private Multiple copies of cache line can be present across different processor nodes Local updates Lead to incoherent state Problem exhibits in both write-through and writeback caches Bus-based globally visible Point-to-point interconnect visible only to communicated processor nodes
3
Example (Writeback Cache)
P P P Rd? Rd? Cache Cache Cache X= -100 X= -100 X=X= - 100505
X= -100 Memory
4
2 Example (Write-through Cache)
P P P Rd? Cache Cache Cache X= -100 X= 505 X=X= - 100505
X=X= -505100 Memory
5
Defining Coherence
An MP is coherent if the results of any execution of a program can be reconstructed by a hypothetical serial order
Implicit definition of coherence Write propagation Writes are visible to other processes Write serialization All writes to the same location are seen in the same order by all processes (to “all” locations called write atomicity)
E.g., w1 followed by w2 seen by a read from P1, will be seen in the same order by all reads by other processors Pi
6
3 Sounds Easy?
A=0 B=0
P0 P1 P2 P3
T1 A=1 B=2
T2 A=1 A=1 B=2 B=2 T3 A=1 A=1 B=2 B=2 B=2 A=1 T3 A=1 A=1 B=2 B=2 B=2 B=2 A=1 A=1
See A’s update before B’s See B’s update before A’s
7
Bus Snooping based on Write-Through Cache
All the writes will be shown as a transaction on the shared bus to memory
Two protocols Update-based Protocol Invalidation-based Protocol
8
4 Bus Snooping (Update-based Protocol on Write-Through cache) P P P
Cache Cache Cache X= - 505100 X= 505
Bus transaction X=X= -505100 Memory Bus snoop
Each processor’s cache controller constantly snoops on the bus Update local copies upon snoop hit
9
Bus Snooping (Invalidation-based Protocol on Write-Through cache)
Load X P P P
Cache Cache Cache X=X= -505100 X= 505
Bus transaction X=X= -505100 Memory Bus snoop
Each processor’s cache controller constantly snoops on the bus Invalidate local copies upon snoop hit
10
5 A Simple Snoopy Coherence Protocol for a WT, No Write-Allocate Cache
PrRd / --- PrWr / BusWr
Valid
BusWr / --- PrRd / BusRd
Invalid Observed / Transaction
Processor-initiated Transaction Bus-snooper-initiated Transaction PrWr / BusWr 11
Cache Coherence Protocols for WB caches
A cache has an exclusive copy of a line if It is the only cache having a valid copy Memory may or may not have it Modified (dirty) cache line The cache having the line is the owner of the line, because it must supply the block
12
6 Cache Coherence Protocol (Update-based Protocol on Writeback cache)
P P P Store X
Cache Cache Cache X= - 100505 X= - 100505 X= - 100505 update
update
Bus transaction Memory
• Update data for all processor nodes who share the same data • For a processor node keeps updating the memory location, a lot of traffic
will be incurred 13
Cache Coherence Protocol (Update-based Protocol on Writeback cache)
Load X P P Store X P
Cache Cache Cache X= 505333 X= 505333 X= 505333 Hit ! update update
Bus transaction Memory
• Update data for all processor nodes who share the same data • For a processor node keeps updating the memory location, a lot of traffic
will be incurred 14
7 Cache Coherence Protocol (Invalidation-based Protocol on Writeback cache)
P P P Store X
Cache Cache Cache X= -100 X= -100 X= - 100505 invalidate invalidate
Bus transaction Memory
• Invalidate the data copies for the sharing processor nodes • Reduced traffic when a processor node keeps updating the same
memory location 15
Cache Coherence Protocol (Invalidation-based Protocol on Writeback cache)
Load X P P P
Cache Cache Cache X= 505 X= 505 Miss ! Snoop hit
Bus transaction Memory Bus snoop
• Invalidate the data copies for the sharing processor nodes • Reduced traffic when a processor node keeps updating the same
memory location 16
8 Cache Coherence Protocol (Invalidation-based Protocol on Writeback cache) Store X P P P Store X Store X Cache Cache Cache X= 505333987444 X= 505
Bus transaction Memory Bus snoop
• Invalidate the data copies for the sharing processor nodes • Reduced traffic when a processor node keeps updating the same
memory location 17
MSI Writeback Invalidation Protocol
Modified Dirty Only this cache has a valid copy Shared Memory is consistent One or more caches have a valid copy Invalid
Writeback protocol: A cache line can be written multiple times before the memory is updated.
18
9 MSI Writeback Invalidation Protocol
Two types of request from the processor PrRd PrWr
Three types of bus transactions post by cache controller BusRd PrRd misses the cache Memory or another cache supplies the line BusRd eXclusive (Read-to-own) PrWr is issued to a line which is not in the Modified state BusWB Writeback due to replacement Processor does not directly involve in initiating this operation
19
MSI Example P1 P2 P3
Cache Cache Cache X=10 S
Bus BusRd
MEMORY X=10
Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier P1 reads X S ------BusRd Memory
20
10 MSI Example P1 P2 P3
Cache Cache Cache X=10 S X=10 S
Bus BusRd
MEMORY X=10
Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier P1 reads X S ------BusRd Memory P3 reads X S --- S BusRd Memory
21
MSI Example P1 P2 P3
Cache Cache Cache X=10--- SI X=10X=-25 SM
Bus BusRdX
MEMORY X=10
Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier P1 reads X S ------BusRd Memory P3 reads X S --- S BusRd Memory P3 writes X I --- M BusRdX Memory
Does not come from memory if having “BusUpgrade” 22
11 MSI Example P1 P2 P3
Cache Cache Cache ---X=-25 IS X=-25 MS
Bus BusRd
MEMORY X=10X=-25
Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier P1 reads X S ------BusRd Memory P3 reads X S --- S BusRd Memory P3 writes X I --- M BusRdX Memory P1 reads X S --- S BusRd P3 Cache
23
MSI Example P1 P2 P3
Cache Cache Cache X=-25 S X=-25 S X=-25 MS
Bus BusRd
MEMORY X=10X=-25
Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier P1 reads X S ------BusRd Memory P3 reads X S --- S BusRd Memory P3 writes X I --- M BusRdX Memory P1 reads X S --- S BusRd P3 Cache P2 reads X S S S BusRd Memory
24
12 MESI Writeback Invalidation Protocol To reduce two types of unnecessary bus transactions BusRdX that snoops and converts the block from S to M when only you are the sole owner of the block BusRd that gets the line in S state when there is no sharers (that lead to the overhead above)
Introduce the Exclusive state One can write to the copy without generating BusRdX
Illinois Protocol: Proposed by Pamarcos and Patel in 1984 Employed in Intel, PowerPC, MIPS
25
MESI Writeback Invalidation Protocol Processor Request (Illinois Protocol)
PrWr / --- PrRd / --- PrRd, PrWr / ---
Exclusive Modified
PrWr / BusRdX PrWr / BusRdX PrRd / BusRd (not-S)
Invalid Shared
S: Shared Signal PrRd / --- Processor-initiated PrRd / 26BusRd (S)
13 MESI Writeback Invalidation Protocol Bus Transactions (Illinois Protocol)
• Whenever possible, Illinois protocol performs $-to-$ transfer rather than having memory to supply the data • Use a Selection algorithm if there are multiple suppliers (Alternative: add an O state or force update memory) • Most of the MESI implementations simply write to memory
ExclusiveBusRd / Flush Modified Or ---)
BusRdX / --- BusRd / Flush
BusRdX / Flush
Invalid Shared BusRd / Flush* BusRdX / Flush*
27 Bus-snooper-initiated Flush*: Flush for data supplier; no action for other sharers
MESI Writeback Invalidation Protocol (Illinois Protocol)
PrWr / --- PrRd / --- PrRd, PrWr / ---
ExclusiveBusRd / Flush Modified (or ---)
PrWr / BusRdX PrWr / BusRdX PrRd / BusRd BusRdX / --- BusRd / Flush (not-S) BusRdX / Flush
Invalid Shared BusRd / Flush* BusRdX / Flush* S: Shared Signal PrRd / --- Processor-initiated PrRd / 28BusRd (S) Bus-snooper-initiated Flush*: Flush for data supplier; no action for other sharers
14 MOESI Protocol Add one additional state ─ Owner state Similar to Shared state The O state processor will be responsible for supplying data (copy in memory may be stale) Employed by CPU0 CPU1 Sun UltraSparc AMD Opteron L2 L2 In dual-core Opteron, cache-to-cache System Request Interface transfer is done through a system Crossbar request interface (SRI) running at full CPU speed Mem Hyper- Controller Transport 29
Implication on Multi-Level Caches
How to guarantee coherence in a multi-level cache hierarchy Snoop all cache levels? Intel’s 8870 chipset has a “snoop filter” for quad-core
Maintaining inclusion property Ensure data in the outer level must be present in the inner level Only snoop the outermost level (e.g. L2) L2 needs to know L1 has write hits Use Write-Through cache Use Write-back but maintain another “modified-but-stale” bit in L2
30
15 Inclusion Property
Not so easy … Replacement: Different bus observes different access activities, e.g. L2 may replace a line frequently accessed in L1 Split L1 caches: Imagine all caches are direct-mapped. Different cache line sizes
31
Inclusion Property
Use specific cache configurations E.g., DM L1 + bigger DM or set-associative L2 with the same cache line size
Explicitly propagate L2 action to L1 L2 replacement will flush the corresponding L1 line Observed BusRdX bus transaction will invalidate the corresponding L1 line To avoid excess traffic, L2 maintains an Inclusion bit for filtering (to indicate in L1 or not)
32
16 Directory-based Coherence Protocol
P P P P
$ $ $ $
Interconnection Network
Memory
Modified bit Presence bits, one for each node Snooping-based protocol Directory N transactions for an N-node MP All caches need to watch every memory request from each processor Not a scalable solution for maintaining coherence in large shared memory systems Directory protocol Directory-based control of who has what;
HW overheads to keep the directory (~ # lines * # processors) 33
Directory-based Coherence Protocol
P P P P P
$ $ $ $ $
Interconnection Network
0 1 0 0 0 1 1 0 C(k) Memory 1 0 1 0 0 0 0 0 C(k+1) 0 0 1 0 0 0 0 1 C(k+j)
1 modified bit for each cache block in memory 1 presence bit for each processor, each cache block in memory
34
17 Directory-based Coherence Protocol (Limited Dir)
P0 P1 P13 P14 P15
$ $ $ $ $
Interconnection Network
0 1 0 0 0 0 1 1 1 1 0 Memory 1 1 0 0 0 1 0 - - - -
0 0 - - - - 0 - - - -
1 modified bit for each cache block in memory Presence encoding is NULL or not
Encoded Present bits (lg2N), each cache line can reside35 in 2 processors in this example
Distributed Directory Coherence Protocol P P P
$ $ $ Memory Memory Memory
Directory Directory Directory
Interconnection Network
Directory Directory Directory
Memory Memory Memory
$ $ $
P P P
Centralized directory is less scalable (contention) Distributed shared memory (DSM) for a large MP system Interconnection network is no longer a shared bus Maintain cache coherence (CC-NUMA) Each address has a “home” 36
18 Distributed Directory Coherence Protocol P P P P
$ $ $ $
Memory Memory Memory Memory
Snoop bus Snoop bus
Directory Directory
Interconnection Network
Stanford DASH (4 CPUs in each cluster, total 16 clusters) Invalidation-based cache coherence Directory keeps one of the 3 status of a cache block at its home node Uncached Shared (unmodified state) Dirty
37
DASH Memory Hierarchy P P P P
$ $ $ $
Memory Memory Memory Memory
Snoop bus Snoop bus
Directory Directory
Interconnection Network Processor Level Local Cluster Level Home Cluster Level (address is at home) If dirty, needs to get it from remote node which owns it Remote Cluster Level
38
19 Directory Coherence Protocol: Read Miss
P Miss Z (read) P P
$Z Z$ $ Home of Z
Memory Memory Memory Z Go to Home Node
0 01 1 1
Interconnection Network
Data Z is shared (clean)
39
Directory Coherence Protocol: Read Miss
P Miss Z (read) P P
$Z $Z $
Memory Memory Memory Z Go toData Home Request Node
Respond with Owner Info 10 01 1 01
Interconnection Network
Data ZData is Clean, Z is Dirty Shared by 3 nodes
40
20 Directory Coherence Protocol: Write Miss
P Miss Z (write) P P
$Z $Z $Z
InvalidateInvalidate Memory Memory Memory Z Go to Home Node ACK Respond w/ACK sharers 01 01 10 10
Interconnection Network
Write Z can proceed in P0
41
21