Page 1 Cache Coherence in More Detail

What is (Hardware) Shared Memory? • Take multiple microprocessors ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) • Implement a memory system with a single global physical address space (usually) Shared Memory MPs – Coherence & Snooping – Communication assist HW does the “magic” of cache coherence Copyright 2006 Daniel J. Sorin • Goal 1: Minimize memory latency Duke University – Use co-location & caches Slides are derived from work by • Goal 2: Maximize memory bandwidth Sarita Adve (Illinois), Babak Falsafi (CMU), – Use parallelism & caches Mark Hill (Wisconsin), Alvy Lebeck (Duke), Steve Reinhardt (Michigan), and J. P. Singh (Princeton). Thanks! (C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 3 Outline Some Memory System Options • Motivation for Cache-Coherent Shared Memory P1 Pn Switch P P1 n (Interleaved) First-level $ • Snooping Cache Coherence (Chapter 5) $ $ – Basic systems Bus (Interleaved) – Design tradeoffs Main memory Mem I/O devices • Implementing Snooping Systems (Chapter 6) (a) Shared cache (b) Bus-based shared memory P1 Pn P • Advanced Snooping Systems P1 n $ $ $ $ Mem Mem Interconnection network Interconnection network Mem Mem (c) Dancehall (d) Distributed-memory (C) 2006 Daniel J. Sorin from Adve, (C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 2 Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 4 Page 1 Cache Coherence In More Detail • According to Webster’s dictionary … • Efficient naming – Cache: a secure place of storage – Virtual to physical mapping with TLBs – Coherent: logically consistent – Ability to name relevant portions of objects • Easy and efficient caching • Cache Coherence: keep storage logically consistent – Caching is natural and well-understood – Coherence requires enforcement of 2 properties – Can be done in HW automatically • Low communication overhead – Low overhead since protection is built into memory system 1) Write propagation – Easy for HW (i.e., cache controller) to packetize requests and replies – All writes eventually become visible to other processors • Integration of latency tolerance 2) Write serialization – Demand-driven: consistency models, prefetching, multithreading – All processors see writes to same block in same order – Can extend to push data to PEs and use bulk transfer (C) 2006 Daniel J. Sorin from Adve, (C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 5 Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 7 Why Cache Coherent Shared Memory? Symmetric Multiprocessors (SMPs) • Pluses • Multiple microprocessors – For applications - looks like multitasking uniprocessor – For OS - only evolutionary extensions required • Each has a cache (or multiple caches in a hierarchy) – Easy to do communication without OS – Software can worry about correctness first and then performance • Minuses • Connect with logical bus (totally-ordered broadcast) – Proper synchronization is complex – Physical bus = set of shared wires – Communication is implicit so may be harder to optimize – Logical bus = functional equivalent of physical bus – More work for hardware designers (i.e., us!) • Result • Implement Snooping Cache Coherence Protocol – Symmetric Multiprocessors (SMPs) are the most successful parallel – Broadcast all cache misses on bus machines ever – All caches “snoop” bus and may act (e.g., respond with data) – And the first with multi-billion-dollar markets! – Memory responds otherwise (C) 2006 Daniel J. Sorin from Adve, (C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 6 Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 8 Page 2 Outline Cache Coherence Problem (Step 2) • Motivation for Cache-Coherent Shared Memory • Snooping Cache Coherence (Chapter 5) P1 P2 ld r2, x – Basic systems – Design tradeoffs ld r2, x Time • Implementing Snooping Systems (Chapter 6) Interconnection Network • Advanced Snooping Systems x Main Memory (C) 2006 Daniel J. Sorin from Adve, (C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 9 Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 11 Cache Coherence Problem (Step 1) Cache Coherence Problem (Step 3) P1 P2 ld r2, x P1 P2 ld r2, x ld r2, x add r1, r2, r4 Time Time st x, r1 Interconnection Network Interconnection Network x x Main Memory Main Memory (C) 2006 Daniel J. Sorin from Adve, (C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 10 Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 12 Page 3 Snooping Cache-Coherence Protocols Simple 2-State Invalidate Snooping Protocol • Bus provides serialization point (more on this later) • Write-through, Load / -- Store / OwnGETX no-write- OtherGETS / -- allocate cache • Each cache controller “snoops” all bus transactions – Transaction is relevant if it is for a block this cache contains Valid OtherGETX/ -- – Take action to ensure coherence • Proc actions: » Invalidate Load, Store » Update » Supply value to requestor if Owner Load / OwnGETS • Bus actions: – Actions depend on the state of the block and the protocol Invalid GETS, GETX • Main memory controller also snoops on bus Store / OwnGETX – If no cache is owner, then memory is owner OtherGETS / -- OtherGETX / -- • Simultaneous operation of independent controllers Notation: observed event / action taken (C) 2006 Daniel J. Sorin from Adve, (C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 13 Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 15 Snooping Design Choices A 3-State Write-Back Invalidation Protocol • Controller updates state of • 2-State Protocol blocks in response to processor Processor + Simple hardware and protocol and snoop events and ld/st Cache generates bus transactions – Uses lots of bandwidth (every write goes on bus!) • 3-State Protocol (MSI) State Tag Data • Often have duplicate cache tags – Modified . » One cache exclusively has valid (modified) copy Î Owner • Snooping protocol » Memory is stale – Set of states, set of events – Shared – State-transition diagram » >= 1 cache and memory have valid copy (memory = owner) – Actions Snoop (observed bus transaction) – Invalid (only memory has valid copy and memory is owner) • Must invalidate all other copies before entering • Basic Choices modified state – Write-through vs. write-back – Invalidate vs. update • Requires bus transaction (order and invalidate) – Which states to allow (C) 2006 Daniel J. Sorin from Adve, (C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 14 Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 16 Page 4 MSI Processor and Bus Actions An MSI Protocol Example • Processor: Proc Action P1 State P2 state P3 state Bus Act Data from – Load initially I I I – Store 1. P1 load u IÆS I I GETS Memory – Writeback on replacement of modified block 2. P3 load u S I IÆS GETS Memory • Bus 3. P3 store u SÆII SÆM GETX Memory or P1 (?) 4. P1 load u IÆS I MÆS GETS P3’s cache – GetShared (GETS): Get without intent to modify, data could come from memory or another cache 5. P2 load u S IÆS S GETS Memory – GetExclusive (GETX): Get with intent to modify, must invalidate all other caches’ copies • Single writer, multiple reader protocol – PutExclusive (PUTX): cache controller puts contents on bus and memory is updated • Why Modified to Shared in line 4? – Definition: cache-to-cache transfer occurs when another cache • What if not in any cache? Memory responds satisfies GETS or GETX request • Read then Write produces 2 bus transactions • Let’s draw it! – Slow and wasteful of bandwidth for a common sequence of actions (C) 2006 Daniel J. Sorin from Adve, (C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 17 Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 19 MSI State Diagram 4-State (MESI) Invalidation Protocol • Often called the Illinois protocol Load /-- Store / -- – Modified (dirty) – Exclusive (clean unshared) only copy, not dirty M – Shared – Invalid -/OtherGETS Store / OwnGETX -/OtherGETX • Requires shared signal to detect if other caches have S Store / OwnGETX Writeback / OwnPUTX a copy of block OtherBusRdX / -- – IÆS if shared signal is asserted by any other cache Writeback / -- – IÆE otherwise Load / OwnGETS Load / -- -/OtherGETS I • EÆM transition doesn’t require bus transaction – Why is this a good thing? Note: we never take any action on an OtherPUTX (C) 2006 Daniel J. Sorin from Adve, (C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 18 Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 20 Page 5 More Generally: MOESI Tradeoffs in Protocol Design (C&S 5.4) • [Sweazey & Smith, ISCA86] • Design issues: • M - Modified (dirty) – Which state transitions – What types of bus transactions • O - Owned (dirty but shared) WHY? – Cache block size • E - Exclusive (clean unshared) only copy, not dirty – Cache associativity • S - Shared – Write-back vs write-through caching – Etc. • I - Invalid O ownership • Methodology: count protocol state transitions • Variants validity S M – Can then compute bandwidth, miss rates, etc. – MSI – Results depend on workload (diff workload Æ diff transition rates) – MESI E exclusiveness – MOSI I – MOESI (C) 2006 Daniel J. Sorin from Adve, (C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 21

Page 1 Cache Coherence in More Detail

Directed Test Generation for Validation of Cache Coherence Protocols Yangdi Lyu, Xiaoke Qin, Mingsong Chen, Member, IEEE and Prabhat Mishra, Senior Member, IEEE

Introduction and Cache Coherency

Tightly-Coupled and Fault-Tolerant Communication in Parallel Systems

A Simulation Framework for Evaluating Location Consistency Based Cache

Verification of Hierarchical Cache Coherence Protocols for Futuristic Processors

Verification of Hierarchical Cache Coherence Protocols for Future Processors

Study and Performance Analysis of Cache-Coherence Protocols in Shared-Memory Multiprocessors

Models for Addressing Cache Coherence Problems 9 INTERNATIONAL JOURNAL of NATURAL and APPLIED SCIENCES (IJNAS), VOL

A Primer on Memory Consistency and CACHE COHERENCE CONSISTENCY on MEMORY a PRIMER and Cache Coherence Consistency and Daniel J

Parallel System Architectures 2017 — Lab Assignment 1: Cache Coherency —

Cache Coherence Protocols

Cache Coherence and Mesi Protocol