18-742 Parallel Computer Architecture Lecture 5: Cache Coherence

18-742 Parallel Computer Architecture Lecture 5: Cache Coherence Chris Craik (TA) Carnegie Mellon University Readings: Coherence Required for Review Papamarcos and Patel, “A low-overhead coherence solution for multiprocessors with private cache memories,” ISCA 1984. Kelm et al, “Cohesion: A Hybrid Memory Model for Accelerators”, ISCA 2010. Required Censier and Feautrier, “A new solution to coherence problems in multicache systems,” IEEE Trans. Comput., 1978. Goodman, “Using cache memory to reduce processor-memory traffic,” ISCA 1983. Laudon and Lenoski, “The SGI Origin: a ccNUMA highly scalable server,” ISCA 1997. Lenoski et al, “The Stanford DASH Multiprocessor,” IEEE Computer, 25(3):63-79, 1992. Martin et al, “Token coherence: decoupling performance and correctness,” ISCA 2003. Recommended Baer and Wang, “On the inclusion properties for multi-level cache hierarchies,” ISCA 1988. Lamport, “How to Make a Multiprocessor Computer that Correctly Executes Multiprocess Programs”, IEEE Trans. Comput., Sept 1979, pp 690-691. Culler and Singh, Parallel Computer Architecture, Chapters 5 and 8. 2 Shared Memory Model Many parallel programs communicate through shared memory Proc 0 writes to an address, followed by Proc 1 reading This implies communication between the two Proc 0 Proc 1 Mem[A] = 1 … Print Mem[A] Each read should receive the value last written by anyone This requires synchronization (what does last written mean?) What if Mem[A] is cached (at either end)? 3 Cache Coherence Basic question: If multiple processors cache the same block, how do they ensure they all see a consistent state? P1 P2 Interconnection Network 1000 x Main Memory 4 The Cache Coherence Problem P1 P2 ld r2, x 1000 Interconnection Network 1000 x Main Memory 5 The Cache Coherence Problem P1 P2 ld r2, x ld r2, x 1000 1000 Interconnection Network 1000 x Main Memory 6 The Cache Coherence Problem P1 P2 ld r2, x ld r2, x 2000 1000 add r1, r2, r4 st x, r1 Interconnection Network 1000 x Main Memory 7 The Cache Coherence Problem P1 P2 ld r2, x ld r2, x 2000 1000 Should NOT add r1, r2, r4 load 1000 st x, r1 ld r5, x Interconnection Network 1000 x Main Memory 8 Cache Coherence: Whose Responsibility? Software Can the programmer ensure coherence if caches are invisible to software? What if the ISA provided a cache-flush instruction? What needs to be flushed (what lines, and which caches)? When does this need to be done? Hardware Simplifies software’s job Knows sharer set Doesn’t need to be conservative in synchronization 9 Coherence: Guarantees Writes to location A by P0 should be seen by P1 (eventually), and all writes to A should appear in some order Coherence needs to provide: Write propagation: guarantee that updates will propagate Write serialization: provide a consistent global order seen by all processors Need a global point of serialization for this store ordering Ordering between writes to different locations is a memory consistency model problem: separate issue 10 Coherence: Update vs. Invalidate How can we safely update replicated data? Option 1: push updates to all copies Option 2: ensure there is only one copy (local), update it On a Read: If local copy isn’t valid, put out request (If another node has a copy, it returns it, otherwise memory does) 11 Coherence: Update vs. Invalidate On a Write: Read block into cache as before Update Protocol: Write to block, and simultaneously broadcast written data to sharers (Other nodes update their caches if data was present) Invalidate Protocol: Write to block, and simultaneously broadcast invalidation of address to sharers (Other nodes clear block from cache) 12 Update vs. Invalidate Which do we want? Write frequency and sharing behavior are critical Update + If sharer set is constant and updates are infrequent, avoids the cost of invalidate-reacquire(broadcast update pattern) - If data is rewritten without intervening reads by other cores, updates were useless - Write-through cache policy bus becomes bottleneck Invalidate + After invalidation broadcast, core has exclusive access rights + Only cores that keep reading after each write retain a copy - If write contention is high, leads to ping-ponging (rapid mutual invalidation-reacquire) 13 Cache Coherence Methods How do we ensure that the proper caches are updated? Snoopy Bus [Goodman83, Papamarcos84] Bus-based, single point of serialization Processors observe other processors’ actions and infer ownership E.g.: P1 makes “read-exclusive” request for A on bus, P0 sees this and invalidates its own copy of A Directory [Censier78, Lenoski92, Laudon97] single point of serialization per block, distributed among nodes Processors make explicit requests for blocks Directory tracks ownership (sharer set) for each block Directory coordinates invalidation appropriately E.g.: P1 asks directory for exclusive copy, directory asks P0 to invalidate, waits for ACK, then responds to P1 14 Snoopy Bus vs. Directory Coherence Snoopy + Critical path is short: miss bus transaction to memory + Global serialization is easy: bus provides this already (arbitration) + Simple: adapt bus-based uniprocessors easily - Requires single point of serialization (bus): not scalable (not quite true that snoopy needs bus: recent work on this later) Directory - Requires extra storage space to track sharer sets Can be approximate (false positives are OK) - Adds indirection to critical path: request directory mem - Protocols and race conditions are more complex + Exactly as scalable as interconnect and directory storage (much more scalable than bus) 15 Snoopy-Bus Coherence 16 Snoopy Cache Coherence Caches “snoop” (observe) each other’s write/read operations A simple protocol: PrRd/-- PrWr / BusWr Write-through, no- write-allocate cache Valid Actions: PrRd, BusWr PrWr, BusRd, PrRd / BusRd BusWr Invalid PrWr / BusWr 17 Snoopy Invalidation Protocol: MSI Extend single valid bit per block to three states: M(odified): cache line is only copy and is dirty S(hared): cache line is one of several copies I(nvalid): not present Read miss makes a Read request on bus, saves in S state Write miss makes a ReadEx request, saves in M state When a processor snoops ReadEx from another writer, it must invalidate its own copy (if any) SM upgrade can be made without re-reading data from memory (via Invl) 18 MSI State Machine BusRd/Flush M PrWr/BusRdX PrWr/BusRdX PrRd/-- PrWr/-- BusRdX/Flush PrRd/BusRd S I PrRd/-- BusRd/-- BusRdX/-- ObservedEvent/Action [Culler/Singh96] 19 MSI Correctness Write propagation Immediately after write: New value will exist only in writer’s cache Block will be in M state Transition into M state ensures all other caches are in I state Upon read in another thread, that cache will miss BusRd BusRd causes flush (writeback), and read will see new value Write serialization Only one cache can be in M state at a time Entering M state generates BusRdX BusRdX causes transition out of M state in other caches Order of block ownership (M state) defines write ordering This order is global by virtue of central bus 20 More States: MESI InvalidSharedModified sequence takes two bus ops What if data is not shared? Unnecessary broadcasts Exclusive state: this is the only copy, and it is clean Block is exclusive if, during BusRd, no other cache had it Wired-OR “shared” signal on bus can determine this: snooping caches assert the signal if they also have a copy Another BusRd also causes transition into Shared Silent transition ExclusiveModified is possible on write! MESI is also called the Illinois protocol [Papamarcos84] 21 MESI State Machine M PrWr/-- BusRd/Flush PrWr/BusRdX E BusRd/ $ Transfer PrWr/BusRdX S PrRd (S’)/BusRd PrRd (S)/BusRd I BusRdX/Flush (all incoming) [Culler/Singh96] 22 Snoopy Invalidation Tradeoffs Should a downgrade from M go to S or I? S: if data is likely to be reused I: if data is likely to be migratory Cache-to-cache transfer On a BusRd, should data come from another cache, or mem? Another cache may be faster, if memory is slow or highly contended Memory Simpler: mem doesn’t need to wait to see if cache has data first Less contention at the other caches Requires writeback on M downgrade Writeback on Modified->Shared: necessary? One possibility: Owner (O) state (MOESI system) One cache owns the latest data (memory is not updated) Memory writeback happens when all caches evict copies 23 Update-Based Coherence: Dragon Four states: (E)xclusive: Only copy, clean (Sm): shared, modified (Owner state in MOESI) (Sc): shared, clean (with respect to Sm, not memory) (M)odified: only copy, dirty Use of updates allows multiple copies of dirty data No I state: there is no invalidate (only ordinary evictions) Invariant: at most one Sm in a sharer set If a cache is Sm, it is authoritative, and Sc caches are clean relative to this, not clean to memory McCreight, E. “The Dragon computer system: an early overview.” Tech Report, Xerox Corp., Sept 1984. (cited in Culler/Singh96) 24 Dragon State Machine PrRd/-- PrRd/-- BusUpd/Update PrRdMiss/BusRd(S’) BusRd/-- PrRdMiss/BusRd(S) E Sc PrWr/BusUpd(S) PrWr/-- PrWr/BusUpd(S’) BusUpd/Update PrWrMiss/ (BusRd(S); BusUpd) PrWrMiss/BusRd(S’) Sm BusRd/Flush M PrRd/-- PrRd/-- PrWr/BusUpd(S’) PrWr/-- PrWr/BusUpd(S) BusRd/Flush [Culler/Singh96] 25 Update-Protocol Tradeoffs Shared-Modified state vs. keeping memory up-to-date Equivalent to write-through cache when there are multiple sharers Immediate vs. lazy notification of Sc-block eviction Immediate: if

18-742 Parallel Computer Architecture Lecture 5: Cache Coherence

Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures Abstract 1 Introduction

Detailed Cache Coherence Characterization for Openmp Benchmarks

Analysis and Optimization of I/O Cache Coherency Strategies for Soc-FPGA Device

Chapter 5 Thread-Level Parallelism

Reducing Cache Coherence Traffic with a NUMA-Aware Runtime Approach

Parallel Processors and Cache Coherency

Comparative Modeling and Evaluation of CC-NUMA and COMA on Hierarchical Ring Rchitectures

Why On-Chip Cache Coherence Is Here to Stay

Evaluating Cache Coherent Shared Virtual Memory for Heterogeneous Multicore Chips

A CMOS Vector Processor with a Custom Streaming Cache Greg Faanes [email protected] Silicon Graphics

Computer Architecture, Part 7

Mjb › Cs575 › Handouts › Cache.1Pp.Pdf Caching Issues in Multicore Performance