18-742 Parallel Lecture 5: Coherence

Chris Craik (TA) Carnegie Mellon University Readings: Coherence

 Required for Review

 Papamarcos and Patel, “A low-overhead coherence solution for multiprocessors with private cache memories,” ISCA 1984.

 Kelm et al, “Cohesion: A Hybrid Memory Model for Accelerators”, ISCA 2010.

 Required

 Censier and Feautrier, “A new solution to coherence problems in multicache systems,” IEEE Trans. Comput., 1978.

 Goodman, “Using cache memory to reduce -memory traffic,” ISCA 1983.

 Laudon and Lenoski, “The SGI Origin: a ccNUMA highly scalable server,” ISCA 1997.

 Lenoski et al, “The Stanford DASH Multiprocessor,” IEEE Computer, 25(3):63-79, 1992.

 Martin et al, “Token coherence: decoupling performance and correctness,” ISCA 2003.

 Recommended

 Baer and Wang, “On the inclusion properties for multi-level cache hierarchies,” ISCA 1988.

 Lamport, “How to Make a Multiprocessor Computer that Correctly Executes Multiprocess Programs”, IEEE Trans. Comput., Sept 1979, pp 690-691.

 Culler and Singh, Parallel Computer Architecture, Chapters 5 and 8.

2 Model

 Many parallel programs communicate through shared memory

 Proc 0 writes to an address, followed by Proc 1 reading

 This implies communication between the two

Proc 0 Proc 1 Mem[A] = 1 …

Print Mem[A]

 Each read should receive the value last written by anyone

 This requires synchronization (what does last written mean?)

 What if Mem[A] is cached (at either end)?

3 Cache Coherence

 Basic question: If multiple processors cache the same block, how do they ensure they all see a consistent state?

P1 P2

Interconnection Network

1000 x Main Memory

4 The Cache Coherence Problem

P1 P2 ld r2, x

1000

Interconnection Network

1000 x Main Memory

5 The Cache Coherence Problem

P1 P2 ld r2, x

ld r2, x 1000 1000

Interconnection Network

1000 x Main Memory

6 The Cache Coherence Problem

P1 P2 ld r2, x

ld r2, x 2000 1000 add r1, r2, r4 st x, r1

Interconnection Network

1000 x Main Memory

7 The Cache Coherence Problem

P1 P2 ld r2, x

ld r2, x 2000 1000 Should NOT add r1, r2, r4 load 1000 st x, r1 ld r5, x

Interconnection Network

1000 x Main Memory

8 Cache Coherence: Whose Responsibility?

 Software

 Can the programmer ensure coherence if caches are invisible to software?

 What if the ISA provided a cache-flush instruction?

 What needs to be flushed (what lines, and which caches)?  When does this need to be done?

 Hardware

 Simplifies software’s job

 Knows sharer set

 Doesn’t need to be conservative in synchronization

9 Coherence: Guarantees

 Writes to location A by P0 should be seen by P1 (eventually), and all writes to A should appear in some order

 Coherence needs to provide:

 Write propagation: guarantee that updates will propagate

 Write serialization: provide a consistent global order seen by all processors  Need a global point of serialization for this store ordering

 Ordering between writes to different locations is a memory problem: separate issue

10 Coherence: Update vs. Invalidate

 How can we safely update replicated data?

 Option 1: push updates to all copies

 Option 2: ensure there is only one copy (local), update it

 On a Read:

 If local copy isn’t valid, put out request

 (If another node has a copy, it returns it, otherwise memory does)

11 Coherence: Update vs. Invalidate

 On a Write:

 Read block into cache as before Update Protocol:

 Write to block, and simultaneously broadcast written data to sharers

 (Other nodes update their caches if data was present) Invalidate Protocol:

 Write to block, and simultaneously broadcast invalidation of address to sharers

 (Other nodes clear block from cache)

12 Update vs. Invalidate

 Which do we want?

 Write frequency and sharing behavior are critical  Update + If sharer set is constant and updates are infrequent, avoids the cost of invalidate-reacquire(broadcast update pattern) - If data is rewritten without intervening reads by other cores, updates were useless - Write-through cache policy  bus becomes bottleneck

 Invalidate + After invalidation broadcast, core has exclusive access rights + Only cores that keep reading after each write retain a copy - If write contention is high, leads to ping-ponging (rapid mutual invalidation-reacquire)

13 Cache Coherence Methods

 How do we ensure that the proper caches are updated?

 Snoopy Bus [Goodman83, Papamarcos84]

 Bus-based, single point of serialization

 Processors observe other processors’ actions and infer ownership  E.g.: P1 makes “read-exclusive” request for A on bus, P0 sees this and invalidates its own copy of A

 Directory [Censier78, Lenoski92, Laudon97]

 single point of serialization per block, distributed among nodes

 Processors make explicit requests for blocks

 Directory tracks ownership (sharer set) for each block

 Directory coordinates invalidation appropriately  E.g.: P1 asks directory for exclusive copy, directory asks P0 to invalidate, waits for ACK, then responds to P1 14 Snoopy Bus vs. Directory Coherence

 Snoopy + Critical path is short: miss  bus transaction to memory + Global serialization is easy: bus provides this already (arbitration) + Simple: adapt bus-based uniprocessors easily - Requires single point of serialization (bus): not scalable

 (not quite true that snoopy needs bus: recent work on this later)

 Directory - Requires extra storage space to track sharer sets

 Can be approximate (false positives are OK) - Adds indirection to critical path: request  directory  mem - Protocols and race conditions are more complex + Exactly as scalable as interconnect and directory storage (much more scalable than bus)

15 Snoopy-Bus Coherence

16 Snoopy Cache Coherence

 Caches “snoop” (observe) each other’s write/read operations

 A simple protocol:

PrRd/-- PrWr / BusWr  Write-through, no- write-allocate cache

Valid  Actions: PrRd, BusWr PrWr, BusRd, PrRd / BusRd BusWr

Invalid

PrWr / BusWr

17 Snoopy Invalidation Protocol: MSI

 Extend single valid bit per block to three states:

 M(odified): cache line is only copy and is dirty

 S(hared): cache line is one of several copies

 I(nvalid): not present

 Read miss makes a Read request on bus, saves in S state  Write miss makes a ReadEx request, saves in M state

 When a processor snoops ReadEx from another writer, it must invalidate its own copy (if any)  SM upgrade can be made without re-reading data from memory (via Invl)

18 MSI State Machine

BusRd/Flush M PrWr/BusRdX

PrWr/BusRdX PrRd/-- PrWr/-- BusRdX/Flush

PrRd/BusRd

S I

PrRd/-- BusRd/--

BusRdX/--

ObservedEvent/Action [Culler/Singh96] 19 MSI Correctness

 Write propagation

 Immediately after write:

 New value will exist only in writer’s cache

 Block will be in M state

 Transition into M state ensures all other caches are in I state

 Upon read in another , that cache will miss  BusRd

 BusRd causes flush (writeback), and read will see new value  Write serialization

 Only one cache can be in M state at a time

 Entering M state generates BusRdX

 BusRdX causes transition out of M state in other caches

 Order of block ownership (M state) defines write ordering

 This order is global by virtue of central bus

20 More States: MESI

 InvalidSharedModified sequence takes two bus ops

 What if data is not shared? Unnecessary broadcasts

 Exclusive state: this is the only copy, and it is clean

 Block is exclusive if, during BusRd, no other cache had it

 Wired-OR “shared” signal on bus can determine this: snooping caches assert the signal if they also have a copy  Another BusRd also causes transition into Shared

 Silent transition ExclusiveModified is possible on write!

 MESI is also called the Illinois protocol [Papamarcos84]

21 MESI State Machine

M

PrWr/-- BusRd/Flush PrWr/BusRdX E

BusRd/ $ Transfer PrWr/BusRdX

S PrRd (S’)/BusRd

PrRd (S)/BusRd

I BusRdX/Flush (all incoming)

[Culler/Singh96] 22 Snoopy Invalidation Tradeoffs

 Should a downgrade from M go to S or I?  S: if data is likely to be reused  I: if data is likely to be migratory  Cache-to-cache transfer  On a BusRd, should data come from another cache, or mem?  Another cache  may be faster, if memory is slow or highly contended  Memory  Simpler: mem doesn’t need to wait to see if cache has data first  Less contention at the other caches  Requires writeback on M downgrade  Writeback on Modified->Shared: necessary?  One possibility: Owner (O) state (MOESI system)  One cache owns the latest data (memory is not updated)  Memory writeback happens when all caches evict copies

23 Update-Based Coherence: Dragon

 Four states:

 (E)xclusive: Only copy, clean

 (Sm): shared, modified (Owner state in MOESI)

 (Sc): shared, clean (with respect to Sm, not memory)

 (M)odified: only copy, dirty  Use of updates allows multiple copies of dirty data

 No I state: there is no invalidate (only ordinary evictions)

 Invariant: at most one Sm in a sharer set

 If a cache is Sm, it is authoritative, and Sc caches are clean relative to this, not clean to memory

McCreight, E. “The Dragon computer system: an early overview.” Tech Report, Xerox Corp., Sept 1984. (cited in Culler/Singh96)

24 Dragon State Machine

PrRd/-- PrRd/-- BusUpd/Update PrRdMiss/BusRd(S’) BusRd/-- PrRdMiss/BusRd(S) E Sc

PrWr/BusUpd(S)

PrWr/-- PrWr/BusUpd(S’) BusUpd/Update

PrWrMiss/ (BusRd(S); BusUpd) PrWrMiss/BusRd(S’) Sm BusRd/Flush M

PrRd/-- PrRd/-- PrWr/BusUpd(S’) PrWr/-- PrWr/BusUpd(S) BusRd/Flush

[Culler/Singh96] 25 Update-Protocol Tradeoffs

 Shared-Modified state vs. keeping memory up-to-date

 Equivalent to write-through cache when there are multiple sharers

 Immediate vs. lazy notification of Sc-block eviction

 Immediate: if there is only one sharer left, it can go to E or M and save a BusUpd later (only one saved)

 Lazy: no extra traffic required upon eviction

26 Directory-Based Coherence

27 Directory-Based Protocols

 Required when scaling past the capacity of a single bus

 Distributed, but:

 Coherence still requires single point of serialization (for write serialization)

 This can be different for every block (striped across nodes)

 We can reason about the protocol for a single block: one server (directory node), many clients (private caches)

 Directory receives Read and ReadEx requests, and sends Invl requests: invalidation is explicit (as opposed to snoopy buses)

28 Directory: Data Structures

0x00 Shared: {P0, P1, P2} 0x04 --- 0x08 Exclusive: P2 0x0C --- … ---

 Key operation to support is set inclusion test

 False positives are OK: want to know which caches may contain a copy of a block, and spurious invals are ignored

 False positive rate determines performance  Most accurate (and expensive): full bit-vector

 Compressed representation, linked list, Bloom filter [Zebchuk09] are all possible

 Here, we will assume directory has perfect knowledge 29 Directory: Basic Operations

 Follow semantics of snoop-based system

 but with explicit request, reply messages

 Directory:

 Receives Read, ReadEx, Upgrade requests from nodes

 Sends Inval/Downgrade messages to sharers if needed

 Forwards request to memory if needed

 Replies to requestor and updates sharing state

 Protocol design is flexible

 Exact forwarding paths depend on implementation

 For example, do cache-to-cache transfer?

30 MESI Directory Transaction: Read

P0 acquires an address for reading: 1. Read

P0 Home

2. DatEx (DatShr)

P1

Culler/Singh Fig. 8.16 31 RdEx with Former Owner

1. RdEx

2. Invl P0 Home

3a. Rev

Owner

3b. DatEx

32 Contention Resolution (for Write)

1a. RdEx 1b. RdEx 4. Invl 3. RdEx  P0 Home P1  5a. Rev

2a. DatEx 2b. NACK 

5b. DatEx

33 Issues with Contention Resolution

 Need to escape race conditions by:

 NACKing requests to busy (pending invalidate) entries

 Original requestor retries

 OR, queuing requests and granting in sequence

 (Or some combination thereof)

 Fairness

 Which requestor should be preferred in a conflict?

 Interconnect delivery order, and distance, both matter  We guarantee that some node will make forward progress  Ping-ponging is a higher-level issue

 With solutions like combining trees (for locks/barriers) and better shared-data-structure design

34 Protocol and Directory Tradeoffs

 Forwarding vs. strict request-reply

 Increases complexity

 Shorten critical path by creating a chain of request forwarding

 Speculative replies from memory/directory node

 Decreases critical path length in best case

 More complex implementation (and potentially more network traffic)

 Directory storage can imply protocol design

 E.g., linked list for sharer set

1A. Hansson et al, “Avoiding message-dependent in network-based systems on chip,” VLSI Design 2007. 35 Other Issues / Backup

36 Memory Consistency (Briefly)

 We consider only sequential consistency [Lamport79] here

 Sequential Consistency gives the appearance that:

 All operations (R and W) happen atomically in a global order

 Operations from a single thread occur in order in this stream

Proc 0 Proc 1 Proc 2 A = 1; while (A == 0) ; B = 1; while (B == 0) ; print A A = 1 ?

 Thus, ordering between different mem locations exists

 More relaxed models exist; usually require memory barriers when synchronizing

37 Correctness Issue: Inclusion

 What happens with multilevel caches?

 Snooping level (say, L2) must know about all data in private hierarchy above it (L1)

 What about directories?  Inclusive cache is one solution [Baer88]

 L2 must contain all data that L1 contains

 Must propagate invalidates upward to L1  Other options

 Non-inclusive: inclusion property is optional

 Why would L2 evict if it has more sets and more ways than L1?

 Prefetching!

 Exclusive: line is in L1 xor L2 (AMD K7)

38