18-742 Parallel Computer Architecture Lecture 5: Cache Coherence
Chris Craik (TA) Carnegie Mellon University Readings: Coherence
Required for Review
Papamarcos and Patel, “A low-overhead coherence solution for multiprocessors with private cache memories,” ISCA 1984.
Kelm et al, “Cohesion: A Hybrid Memory Model for Accelerators”, ISCA 2010.
Required
Censier and Feautrier, “A new solution to coherence problems in multicache systems,” IEEE Trans. Comput., 1978.
Goodman, “Using cache memory to reduce processor-memory traffic,” ISCA 1983.
Laudon and Lenoski, “The SGI Origin: a ccNUMA highly scalable server,” ISCA 1997.
Lenoski et al, “The Stanford DASH Multiprocessor,” IEEE Computer, 25(3):63-79, 1992.
Martin et al, “Token coherence: decoupling performance and correctness,” ISCA 2003.
Recommended
Baer and Wang, “On the inclusion properties for multi-level cache hierarchies,” ISCA 1988.
Lamport, “How to Make a Multiprocessor Computer that Correctly Executes Multiprocess Programs”, IEEE Trans. Comput., Sept 1979, pp 690-691.
Culler and Singh, Parallel Computer Architecture, Chapters 5 and 8.
2 Shared Memory Model
Many parallel programs communicate through shared memory
Proc 0 writes to an address, followed by Proc 1 reading
This implies communication between the two
Proc 0 Proc 1 Mem[A] = 1 …
Print Mem[A]
Each read should receive the value last written by anyone
This requires synchronization (what does last written mean?)
What if Mem[A] is cached (at either end)?
3 Cache Coherence
Basic question: If multiple processors cache the same block, how do they ensure they all see a consistent state?
P1 P2
Interconnection Network
1000 x Main Memory
4 The Cache Coherence Problem
P1 P2 ld r2, x
1000
Interconnection Network
1000 x Main Memory
5 The Cache Coherence Problem
P1 P2 ld r2, x
ld r2, x 1000 1000
Interconnection Network
1000 x Main Memory
6 The Cache Coherence Problem
P1 P2 ld r2, x
ld r2, x 2000 1000 add r1, r2, r4 st x, r1
Interconnection Network
1000 x Main Memory
7 The Cache Coherence Problem
P1 P2 ld r2, x
ld r2, x 2000 1000 Should NOT add r1, r2, r4 load 1000 st x, r1 ld r5, x
Interconnection Network
1000 x Main Memory
8 Cache Coherence: Whose Responsibility?
Software
Can the programmer ensure coherence if caches are invisible to software?
What if the ISA provided a cache-flush instruction?
What needs to be flushed (what lines, and which caches)? When does this need to be done?
Hardware
Simplifies software’s job
Knows sharer set
Doesn’t need to be conservative in synchronization
9 Coherence: Guarantees
Writes to location A by P0 should be seen by P1 (eventually), and all writes to A should appear in some order
Coherence needs to provide:
Write propagation: guarantee that updates will propagate
Write serialization: provide a consistent global order seen by all processors Need a global point of serialization for this store ordering
Ordering between writes to different locations is a memory consistency model problem: separate issue
10 Coherence: Update vs. Invalidate
How can we safely update replicated data?
Option 1: push updates to all copies
Option 2: ensure there is only one copy (local), update it
On a Read:
If local copy isn’t valid, put out request
(If another node has a copy, it returns it, otherwise memory does)
11 Coherence: Update vs. Invalidate
On a Write:
Read block into cache as before Update Protocol:
Write to block, and simultaneously broadcast written data to sharers
(Other nodes update their caches if data was present) Invalidate Protocol:
Write to block, and simultaneously broadcast invalidation of address to sharers
(Other nodes clear block from cache)
12 Update vs. Invalidate
Which do we want?
Write frequency and sharing behavior are critical Update + If sharer set is constant and updates are infrequent, avoids the cost of invalidate-reacquire(broadcast update pattern) - If data is rewritten without intervening reads by other cores, updates were useless - Write-through cache policy bus becomes bottleneck
Invalidate + After invalidation broadcast, core has exclusive access rights + Only cores that keep reading after each write retain a copy - If write contention is high, leads to ping-ponging (rapid mutual invalidation-reacquire)
13 Cache Coherence Methods
How do we ensure that the proper caches are updated?
Snoopy Bus [Goodman83, Papamarcos84]
Bus-based, single point of serialization
Processors observe other processors’ actions and infer ownership E.g.: P1 makes “read-exclusive” request for A on bus, P0 sees this and invalidates its own copy of A
Directory [Censier78, Lenoski92, Laudon97]
single point of serialization per block, distributed among nodes
Processors make explicit requests for blocks
Directory tracks ownership (sharer set) for each block
Directory coordinates invalidation appropriately E.g.: P1 asks directory for exclusive copy, directory asks P0 to invalidate, waits for ACK, then responds to P1 14 Snoopy Bus vs. Directory Coherence
Snoopy + Critical path is short: miss bus transaction to memory + Global serialization is easy: bus provides this already (arbitration) + Simple: adapt bus-based uniprocessors easily - Requires single point of serialization (bus): not scalable
(not quite true that snoopy needs bus: recent work on this later)
Directory - Requires extra storage space to track sharer sets
Can be approximate (false positives are OK) - Adds indirection to critical path: request directory mem - Protocols and race conditions are more complex + Exactly as scalable as interconnect and directory storage (much more scalable than bus)
15 Snoopy-Bus Coherence
16 Snoopy Cache Coherence
Caches “snoop” (observe) each other’s write/read operations
A simple protocol:
PrRd/-- PrWr / BusWr Write-through, no- write-allocate cache
Valid Actions: PrRd, BusWr PrWr, BusRd, PrRd / BusRd BusWr
Invalid
PrWr / BusWr
17 Snoopy Invalidation Protocol: MSI
Extend single valid bit per block to three states:
M(odified): cache line is only copy and is dirty
S(hared): cache line is one of several copies
I(nvalid): not present
Read miss makes a Read request on bus, saves in S state Write miss makes a ReadEx request, saves in M state
When a processor snoops ReadEx from another writer, it must invalidate its own copy (if any) SM upgrade can be made without re-reading data from memory (via Invl)
18 MSI State Machine
BusRd/Flush M PrWr/BusRdX
PrWr/BusRdX PrRd/-- PrWr/-- BusRdX/Flush
PrRd/BusRd
S I
PrRd/-- BusRd/--
BusRdX/--
ObservedEvent/Action [Culler/Singh96] 19 MSI Correctness
Write propagation
Immediately after write:
New value will exist only in writer’s cache
Block will be in M state
Transition into M state ensures all other caches are in I state
Upon read in another thread, that cache will miss BusRd
BusRd causes flush (writeback), and read will see new value Write serialization
Only one cache can be in M state at a time
Entering M state generates BusRdX
BusRdX causes transition out of M state in other caches
Order of block ownership (M state) defines write ordering
This order is global by virtue of central bus
20 More States: MESI
InvalidSharedModified sequence takes two bus ops
What if data is not shared? Unnecessary broadcasts
Exclusive state: this is the only copy, and it is clean
Block is exclusive if, during BusRd, no other cache had it
Wired-OR “shared” signal on bus can determine this: snooping caches assert the signal if they also have a copy Another BusRd also causes transition into Shared
Silent transition ExclusiveModified is possible on write!
MESI is also called the Illinois protocol [Papamarcos84]
21 MESI State Machine
M
PrWr/-- BusRd/Flush PrWr/BusRdX E
BusRd/ $ Transfer PrWr/BusRdX
S PrRd (S’)/BusRd
PrRd (S)/BusRd
I BusRdX/Flush (all incoming)
[Culler/Singh96] 22 Snoopy Invalidation Tradeoffs
Should a downgrade from M go to S or I? S: if data is likely to be reused I: if data is likely to be migratory Cache-to-cache transfer On a BusRd, should data come from another cache, or mem? Another cache may be faster, if memory is slow or highly contended Memory Simpler: mem doesn’t need to wait to see if cache has data first Less contention at the other caches Requires writeback on M downgrade Writeback on Modified->Shared: necessary? One possibility: Owner (O) state (MOESI system) One cache owns the latest data (memory is not updated) Memory writeback happens when all caches evict copies
23 Update-Based Coherence: Dragon
Four states:
(E)xclusive: Only copy, clean
(Sm): shared, modified (Owner state in MOESI)
(Sc): shared, clean (with respect to Sm, not memory)
(M)odified: only copy, dirty Use of updates allows multiple copies of dirty data
No I state: there is no invalidate (only ordinary evictions)
Invariant: at most one Sm in a sharer set
If a cache is Sm, it is authoritative, and Sc caches are clean relative to this, not clean to memory
McCreight, E. “The Dragon computer system: an early overview.” Tech Report, Xerox Corp., Sept 1984. (cited in Culler/Singh96)
24 Dragon State Machine
PrRd/-- PrRd/-- BusUpd/Update PrRdMiss/BusRd(S’) BusRd/-- PrRdMiss/BusRd(S) E Sc
PrWr/BusUpd(S)
PrWr/-- PrWr/BusUpd(S’) BusUpd/Update
PrWrMiss/ (BusRd(S); BusUpd) PrWrMiss/BusRd(S’) Sm BusRd/Flush M
PrRd/-- PrRd/-- PrWr/BusUpd(S’) PrWr/-- PrWr/BusUpd(S) BusRd/Flush
[Culler/Singh96] 25 Update-Protocol Tradeoffs
Shared-Modified state vs. keeping memory up-to-date
Equivalent to write-through cache when there are multiple sharers
Immediate vs. lazy notification of Sc-block eviction
Immediate: if there is only one sharer left, it can go to E or M and save a BusUpd later (only one saved)
Lazy: no extra traffic required upon eviction
26 Directory-Based Coherence
27 Directory-Based Protocols
Required when scaling past the capacity of a single bus
Distributed, but:
Coherence still requires single point of serialization (for write serialization)
This can be different for every block (striped across nodes)
We can reason about the protocol for a single block: one server (directory node), many clients (private caches)
Directory receives Read and ReadEx requests, and sends Invl requests: invalidation is explicit (as opposed to snoopy buses)
28 Directory: Data Structures
0x00 Shared: {P0, P1, P2} 0x04 --- 0x08 Exclusive: P2 0x0C --- … ---
Key operation to support is set inclusion test
False positives are OK: want to know which caches may contain a copy of a block, and spurious invals are ignored
False positive rate determines performance Most accurate (and expensive): full bit-vector
Compressed representation, linked list, Bloom filter [Zebchuk09] are all possible
Here, we will assume directory has perfect knowledge 29 Directory: Basic Operations
Follow semantics of snoop-based system
but with explicit request, reply messages
Directory:
Receives Read, ReadEx, Upgrade requests from nodes
Sends Inval/Downgrade messages to sharers if needed
Forwards request to memory if needed
Replies to requestor and updates sharing state
Protocol design is flexible
Exact forwarding paths depend on implementation
For example, do cache-to-cache transfer?
30 MESI Directory Transaction: Read
P0 acquires an address for reading: 1. Read
P0 Home
2. DatEx (DatShr)
P1
Culler/Singh Fig. 8.16 31 RdEx with Former Owner
1. RdEx
2. Invl P0 Home
3a. Rev
Owner
3b. DatEx
32 Contention Resolution (for Write)
1a. RdEx 1b. RdEx 4. Invl 3. RdEx P0 Home P1 5a. Rev
2a. DatEx 2b. NACK
5b. DatEx
33 Issues with Contention Resolution
Need to escape race conditions by:
NACKing requests to busy (pending invalidate) entries
Original requestor retries
OR, queuing requests and granting in sequence
(Or some combination thereof)
Fairness
Which requestor should be preferred in a conflict?
Interconnect delivery order, and distance, both matter We guarantee that some node will make forward progress Ping-ponging is a higher-level issue
With solutions like combining trees (for locks/barriers) and better shared-data-structure design
34 Protocol and Directory Tradeoffs
Forwarding vs. strict request-reply
Increases complexity
Shorten critical path by creating a chain of request forwarding
Speculative replies from memory/directory node
Decreases critical path length in best case
More complex implementation (and potentially more network traffic)
Directory storage can imply protocol design
E.g., linked list for sharer set
1A. Hansson et al, “Avoiding message-dependent deadlock in network-based systems on chip,” VLSI Design 2007. 35 Other Issues / Backup
36 Memory Consistency (Briefly)
We consider only sequential consistency [Lamport79] here
Sequential Consistency gives the appearance that:
All operations (R and W) happen atomically in a global order
Operations from a single thread occur in order in this stream
Proc 0 Proc 1 Proc 2 A = 1; while (A == 0) ; B = 1; while (B == 0) ; print A A = 1 ?
Thus, ordering between different mem locations exists
More relaxed models exist; usually require memory barriers when synchronizing
37 Correctness Issue: Inclusion
What happens with multilevel caches?
Snooping level (say, L2) must know about all data in private hierarchy above it (L1)
What about directories? Inclusive cache is one solution [Baer88]
L2 must contain all data that L1 contains
Must propagate invalidates upward to L1 Other options
Non-inclusive: inclusion property is optional
Why would L2 evict if it has more sets and more ways than L1?
Prefetching!
Exclusive: line is in L1 xor L2 (AMD K7)
38