What is (Hardware) Shared Memory?

• Take multiple microprocessors ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) • Implement a memory system with a single global physical address space (usually) Shared Memory MPs – Coherence & Snooping – Communication assist HW does the “magic” of

Copyright 2006 Daniel J. Sorin • Goal 1: Minimize memory latency Duke University – Use co-location & caches

Slides are derived from work by • Goal 2: Maximize memory bandwidth Sarita Adve (Illinois), Babak Falsafi (CMU), – Use parallelism & caches Mark Hill (Wisconsin), Alvy Lebeck (Duke), Steve Reinhardt (Michigan), and J. P. Singh (Princeton). Thanks!

(C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 3

Outline Some Memory System Options

• Motivation for Cache-Coherent Shared Memory P1 Pn Switch P P1 n

(Interleaved) First-level $ • Snooping Cache Coherence (Chapter 5) $ $ – Basic systems Bus (Interleaved) – Design tradeoffs Main memory Mem I/O devices

• Implementing Snooping Systems (Chapter 6) (a) Shared cache (b) Bus-based shared memory

P1 Pn P • Advanced Snooping Systems P1 n $ $ $ $ Mem Mem Interconnection network

Interconnection network Mem Mem

(c) Dancehall (d) Distributed-memory

(C) 2006 Daniel J. Sorin from Adve, (C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 2 Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 4

Page 1 Cache Coherence In More Detail

• According to Webster’s dictionary … • Efficient naming – Cache: a secure place of storage – Virtual to physical mapping with TLBs – Coherent: logically consistent – Ability to name relevant portions of objects • Easy and efficient caching • Cache Coherence: keep storage logically consistent – Caching is natural and well-understood – Coherence requires enforcement of 2 properties – Can be done in HW automatically • Low communication overhead – Low overhead since protection is built into memory system 1) Write propagation – Easy for HW (i.e., cache controller) to packetize requests and replies – All writes eventually become visible to other processors • Integration of latency tolerance 2) Write serialization – Demand-driven: consistency models, prefetching, multithreading – All processors see writes to same block in same order – Can extend to push data to PEs and use bulk transfer

(C) 2006 Daniel J. Sorin from Adve, (C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 5 Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 7

Why Cache Coherent Shared Memory? Symmetric Multiprocessors (SMPs)

• Pluses • Multiple microprocessors – For applications - looks like multitasking uniprocessor – For OS - only evolutionary extensions required • Each has a cache (or multiple caches in a hierarchy) – Easy to do communication without OS – Software can worry about correctness first and then performance • Minuses • Connect with logical bus (totally-ordered broadcast) – Proper synchronization is complex – Physical bus = set of shared wires – Communication is implicit so may be harder to optimize – Logical bus = functional equivalent of physical bus – More work for hardware designers (i.e., us!) • Result • Implement Snooping Cache Coherence Protocol – Symmetric Multiprocessors (SMPs) are the most successful parallel – Broadcast all cache misses on bus machines ever – All caches “snoop” bus and may act (e.g., respond with data) – And the first with multi-billion-dollar markets! – Memory responds otherwise

(C) 2006 Daniel J. Sorin from Adve, (C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 6 Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 8

Page 2 Outline Cache Coherence Problem (Step 2)

• Motivation for Cache-Coherent Shared Memory

• Snooping Cache Coherence (Chapter 5) P1 P2 ld r2, x – Basic systems – Design tradeoffs ld r2, x Time • Implementing Snooping Systems (Chapter 6)

Interconnection Network • Advanced Snooping Systems

x

Main Memory

(C) 2006 Daniel J. Sorin from Adve, (C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 9 Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 11

Cache Coherence Problem (Step 1) Cache Coherence Problem (Step 3)

P1 P2 ld r2, x P1 P2 ld r2, x

ld r2, x add r1, r2, r4 Time Time st x, r1

Interconnection Network Interconnection Network

x x

Main Memory Main Memory

(C) 2006 Daniel J. Sorin from Adve, (C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 10 Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 12

Page 3 Snooping Cache-Coherence Protocols Simple 2-State Invalidate Snooping Protocol

• Bus provides serialization point (more on this later) • Write-through, Load / -- Store / OwnGETX no-write- OtherGETS / -- allocate cache • Each cache controller “snoops” all bus transactions – Transaction is relevant if it is for a block this cache contains Valid OtherGETX/ -- – Take action to ensure coherence • Proc actions: » Invalidate Load, Store » Update » Supply value to requestor if Owner Load / OwnGETS • Bus actions: – Actions depend on the state of the block and the protocol Invalid GETS, GETX • Main memory controller also snoops on bus Store / OwnGETX – If no cache is owner, then memory is owner OtherGETS / -- OtherGETX / --

• Simultaneous operation of independent controllers Notation: observed event / action taken

(C) 2006 Daniel J. Sorin from Adve, (C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 13 Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 15

Snooping Design Choices A 3-State Write-Back Invalidation Protocol

• Controller updates state of • 2-State Protocol blocks in response to processor Processor + Simple hardware and protocol and snoop events and ld/st Cache generates bus transactions – Uses lots of bandwidth (every write goes on bus!) • 3-State Protocol (MSI) State Tag Data • Often have duplicate cache tags – Modified . . . » One cache exclusively has valid (modified) copy Î Owner • Snooping protocol » Memory is stale – Set of states, set of events – Shared – State-transition diagram » >= 1 cache and memory have valid copy (memory = owner) – Actions Snoop (observed bus transaction) – Invalid (only memory has valid copy and memory is owner) • Must invalidate all other copies before entering • Basic Choices modified state – Write-through vs. write-back – Invalidate vs. update • Requires bus transaction (order and invalidate) – Which states to allow

(C) 2006 Daniel J. Sorin from Adve, (C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 14 Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 16

Page 4 MSI Processor and Bus Actions An MSI Protocol Example

• Processor: Proc Action P1 State P2 state P3 state Bus Act Data from – Load initially I I I – Store 1. P1 load u IÆS I I GETS Memory – Writeback on replacement of modified block 2. P3 load u S I IÆS GETS Memory • Bus 3. P3 store u SÆII SÆM GETX Memory or P1 (?) 4. P1 load u IÆS I MÆS GETS P3’s cache – GetShared (GETS): Get without intent to modify, data could come from memory or another cache 5. P2 load u S IÆS S GETS Memory – GetExclusive (GETX): Get with intent to modify, must invalidate all other caches’ copies • Single writer, multiple reader protocol – PutExclusive (PUTX): cache controller puts contents on bus and memory is updated • Why Modified to Shared in line 4? – Definition: cache-to-cache transfer occurs when another cache • What if not in any cache? Memory responds satisfies GETS or GETX request • Read then Write produces 2 bus transactions • Let’s draw it! – Slow and wasteful of bandwidth for a common sequence of actions

(C) 2006 Daniel J. Sorin from Adve, (C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 17 Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 19

MSI State Diagram 4-State (MESI) Invalidation Protocol

• Often called the Illinois protocol Load /-- Store / -- – Modified (dirty) – Exclusive (clean unshared) only copy, not dirty M – Shared – Invalid

-/OtherGETS Store / OwnGETX -/OtherGETX • Requires shared signal to detect if other caches have S Store / OwnGETX Writeback / OwnPUTX a copy of block OtherBusRdX / -- – IÆS if shared signal is asserted by any other cache Writeback / -- – IÆE otherwise Load / OwnGETS Load / -- -/OtherGETS I • EÆM transition doesn’t require bus transaction – Why is this a good thing? Note: we never take any action on an OtherPUTX

(C) 2006 Daniel J. Sorin from Adve, (C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 18 Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 20

Page 5 More Generally: MOESI Tradeoffs in Protocol Design (C&S 5.4)

• [Sweazey & Smith, ISCA86] • Design issues: • M - Modified (dirty) – Which state transitions – What types of bus transactions • O - Owned (dirty but shared) WHY? – Cache block size • E - Exclusive (clean unshared) only copy, not dirty – Cache associativity • S - Shared – Write-back vs write-through caching – Etc. • I - Invalid O ownership • Methodology: count protocol state transitions • Variants validity S M – Can then compute bandwidth, miss rates, etc. – MSI – Results depend on workload (diff workload Æ diff transition rates) – MESI E exclusiveness – MOSI I – MOESI

(C) 2006 Daniel J. Sorin from Adve, (C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 21 Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 23

4-State Write-back Update Protocol Computing Bandwidth

• Dragon (Xerox PARC) • Why bandwidth? • States – Exclusive (E): one copy, clean, memory is up-to-date • Monitor state transitions – Shared-Clean (SC): could be two or more copies, memory unknown – Count bus transactions – Shared-Modified (SM): could be two or more copies, memory stale – I know how many bytes each bus transaction requires – Modified (M) • On a write, send new data to update other caches • Adds Update Transaction – Do not confuse Update with Upgrade! • Adds Cache Controller Update operation • Must obtain bus before updating local copy – Bus is serialization point Æ important for consistency (later …)

(C) 2006 Daniel J. Sorin from Adve, (C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 22 Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 24

Page 6 MESI State Transitions and Bandwidth Study #2: MSI Upgrade vs. GETX

• MSI SÆM transition issues Upgrade FROM/TO NP I E S M – Could have block invalidated while waiting for Upgrade response – Adds complexity to detect this (and deal with it correctly) NP -- -- GETS GETS GETX 6+64 6+64 6+64 • Could instead just issue GETX I -- -- GETS GETS GETX – Pretend like block is in Invalid state 6+64 6+64 6+64 E ------• Upgrade achieves ~ 10% to 20% improvement S -- -- NA -- Upgr – Compared to just issuing GETX 6 – Application dependent (as are all studies in architecture!) M PUTX PUTX NA -- 6 + 64 6+64 6 + 64

(C) 2006 Daniel J. Sorin from Adve, (C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 25 Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 27

Study #1: Bandwidth of MSI vs. MESI Cache Block Size

• X MIPS/MFLOPS processor • Block size is unit of transfer and of coherence – Use with measured state transition counts to obtain transitions/sec – Doesn’t have to be, could have coherence smaller [Goodman] • Compute state transitions/sec • Uniprocessor 3Cs – Compulsory, Capacity, Conflict • Compute bus transactions/sec • Shared memory adds Coherence Miss Type (4th C) • Compute bytes/sec – True Sharing miss fetches data written by another processor • What is BW savings of MESI over MSI? – False Sharing miss results from independent data in same coherence block • Difference between protocols is Exclusive state • Increasing block size – MSI requires extra Upgrade for EÆM transition – Usually fewer 3C misses but more bandwidth • Result is very small benefit! Why? – Usually more false sharing misses – Small number of EÆM transitions (depends on workload!) • P.S. on increasing cache size – Upgrade consumes only 6 bytes on bus – Usually fewer capacity/conflict misses (& compulsory don’t matter) – No effect on true/false “coherence” misses (so may dominate)

(C) 2006 Daniel J. Sorin from Adve, (C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 26 Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 28

Page 7 Study #3: Invalidate vs. Update Invalidate vs. Update, cont.

• Pattern 1: • What about real workloads? for i = 1 to k – Update can generate too much traffic – Must selectively limit it P1(write, x); // one write before reads P2 to PN-1(read, x); end for i • Current assessment – Update very hard to implement correctly • Pattern 2: (because of consistency … discussion coming in a couple weeks) for i = 1 to k – Rarely done for j = 1 to m P1(write, x); // many writes before reads end for j • Future assessment – May be same as current or P2(read, x); – Chip multiprocessors may revive update protocols end for i » More intra-chip bandwidth » Easier to have predictable timing paths?

(C) 2006 Daniel J. Sorin from Adve, (C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 29 Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 31

Invalidate vs. Update, cont. Outline

• Pattern 1 (one write before reads) • Motivation for Cache-Coherent Shared Memory – N = 16, M = 10, K = 10 – Update » Iteration 1: N regular cache misses (70 bytes) • Snooping Cache Coherence (Chapter 5) » Remaining iterations: update per iteration (14 bytes; 6 cntrl, 8 data) • Implementing Snooping Systems (Chapter 6) – Total Update Traffic = 16*70 + 9*14 = 1,246 bytes » Book assumes 10 updates instead of 9… – Invalidate • Advanced Snooping Systems » Iteration 1: N regular cache misses (70 bytes) » Remaining: P1 generates upgrade (6), 15 others Read miss (70) – Total Invalidate Traffic = 16*70 + 9*6 + 15*9*17 = 10,624 bytes • Pattern 2 (many writes before reads) – Update = 1,400 bytes – Invalidate = 824 bytes

(C) 2006 Daniel J. Sorin from Adve, (C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 30 Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 32

Page 8 Review: Symmetric Multiprocesors (SMP) Implementation Issues

• Multiple (micro-)processors • How does memory know another cache will respond so it doesn’t have to? • Each has cache (today a cache hierarchy) • Is it okay if a cache miss is not an atomic event (check tags, queue for bus, get bus, etc.)? • What about L1/L2 caches & split transactions buses? • Connect with logical bus (totally-ordered broadcast) • Can we guarantee we won’t get deadlock? • What happens on a PTE update with multiple TLBs? • Implement Snooping Cache Coherence Protocol – Broadcast all cache “misses” on bus • Can one use virtual caches in SMPs? – All caches “snoop” bus and may act – Memory responds otherwise

This is why they pay architects the big bucks!

(C) 2006 Daniel J. Sorin from Adve, (C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 33 Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 35

Review: MSI State Diagram Outline for Implementing Snooping

• Coherence Control Implementation Load /-- Store / -- • Writebacks, Non-Atomicity M • Hierarchical Caches

--/OtherGETS Store / OwnGETX • Split Buses -- / OtherGETX S Store / OwnGETX Writeback / OwnPUTX • Deadlock, Livelock, & Starvation -- /OtherGETX Writeback / -- • Three Case Studies Load / OwnGETS Load / -- -/OtherGETS I • TLB Coherence • Virtual Cache Issues Note: we never take any action on an OtherPUTX

(C) 2006 Daniel J. Sorin from Adve, (C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 34 Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 36

Page 9 Snooping SMP Design Goals Cache Controllers and Tags

• Goals • On a miss in a uniprocessor: – Assert request for memory bus – Correctness – Wait for bus grant – High performance – Drive address and command lines – Simple hardware (reduced complexity & cost) – Wait for command to be accepted by relevant device – Transfer data

• Conflicts between goals – High performance Æ multiple outstanding low-level events • In snoop-based multiprocessor, cache controller must: Æ more complex interactions – Monitor bus and serve processor Æ more potential correctness bugs » Can view as two controllers: bus-side, and processor-side » With single-level cache: dual tags (not data) or dual-ported tag RAM » Synchronize tags on updates – Respond to bus transactions when necessary

(C) 2006 Daniel J. Sorin from Adve, (C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 37 Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 39

Base Cache Coherence Design Reporting Snoop Results: How?

• Single-level write-back cache • Collective response from caches must appear on bus • Invalidation protocol • One outstanding memory request per processor • Wired-OR signals • Atomic memory bus transactions – Shared: asserted if any cache has a copy (used for E state) – No interleaving of transactions – Dirty/Inhibit: asserted if some cache has a dirty copy • Atomic operations within a process » Don’t need to know which, since it will do what’s necessary – One finishes before next in program order – Snoop-valid: asserted when OK to check other two signals

• Now, we’re going to gradually add complexity • May require priority scheme for cache-to-cache – Why? Faster latencies and higher bandwidths! transfers – But we’ll stick with invalidation protocol (instead of update) – Which cache should supply data when in shared state? – Commercial implementations allow memory to provide data

(C) 2006 Daniel J. Sorin from Adve, (C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 38 Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 40

Page 10 Reporting Snoop Results: When? Base Organization

• Memory needs to know what, if anything, to do P Addr Cmd Data • Static delay: fixed number of clocks from address Tags Tags Processor- appearing on bus and and side state Cache data RAM state controller – Dual tags required to reduce contention with processor for for snoop P – Still must be conservative (update both on write: E Æ M) Bus- side – Pentium Pro, HP servers, Sun Enterprise (pre E-10K) controller To • Variable delay Comparator controller

– Memory assumes cache will supply data until all say “sorry” Tag Write-back buffer – Less conservative, more flexible, more complex To – Memory can fetch data early and hold (SGI Challenge) Comparator controller • Immediately: Bit-per-block state in memory – HW complexity in commodity main memory system Snoop state Addr Cmd Data buffer Addr Cmd

System bus

(C) 2006 Daniel J. Sorin from Adve, (C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 41 Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 43

Writebacks Optimization #1: Non-Atomic State Transitions

• Must allow processor to proceed on a miss • Operations involve multiple actions – Fetch the block – Look up cache tags – Perform writeback later – Bus arbitration – Check for outstanding writeback – Even if bus is atomic, overall set of actions is not • Need writeback buffer – Race conditions among multiple operations – Must handle bus transactions in writeback buffer » Snoop writeback buffer • Suppose P1 and P2 attempt to write cached block A – Must care about the order of reads and writes – Each decides to issue Upgrade to transition from S Æ M – Affects memory consistency model (yuck – trust me on this for now) • Issues – Handle requests for other blocks while waiting to acquire bus – Must handle requests for this block A

(C) 2006 Daniel J. Sorin from Adve, (C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 42 Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 44

Page 11 Non-Atomicity Æ Transient States Violations of Inclusion

Two types of states • The L1 and L2 may choose to replace different block

PrRd/— – Differences in reference history • Stable (e.g. MOESI) PrWr/— » Set-associative first-level cache with LRU replacement • Transient or Intermediate M – Split higher-level caches BusRdX/Flush Increases complexity BusRd/Flush » Instr & data blocks go in different caches at L1, but collide in L2 PrWr/— » What if L2 is set-associative? B us G ran t/B us U pg r E – Differences in block size S → M B u s G ran t/ BusG rant/BusRdX BusRd (S) BusRd/Flush PrWr/ PrRd/— BusReq BusRdX/Flush • But a common case works automatically S BusRdX/Flush′ BusGrant/ – L1 direct-mapped, and Note: Details I → M BusRd (S) BusR dX/Flush′ I → S,E PrRd/— – L1 has fewer sets than L2, and of this figure BusRd/Flush′ aren’t PrRd/BusReq – L1 and L2 have same block size important PrWr/BusReq I

(C) 2006 Daniel J. Sorin from Adve, (C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 45 Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 47

Optimization #2: Multi-level Cache Hierarchies Inclusion: To Be or Not To Be

• How to snoop with multi-level caches? • Most common inclusion solution – Ensure L2 holds superset of L1I and L1D – Independent at every level? – On L2 replacement or coherence request that must source data or – Maintain cache inclusion? invalidate, forward actions to L1 caches • Requirements for Inclusion – Can maintain bits in L2 cache to filter some actions from forwarding – Data in higher-level is subset of data in lower-level – Modified in higher-level Æ marked modified in lower-level • But inclusion may not be ideal • Now only need to snoop lowest-level cache – Restricted associativity in unified L2 can limit blocks in split L1s – Not that hard to always snoop L1s – If L2 says not present (modified), then not so in L1 – If L2 isn’t much bigger than L1, then inclusion is wasteful • Is inclusion automatically preserved? – Replacements: all higher-level misses go to lower level • Thus, many new designs don’t maintain inclusion – Exclusion: no block is in more than any one cache – Not Inclusive != Exclusive and Not Exclusive != Inclusive

(C) 2006 Daniel J. Sorin from Adve, (C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 46 Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 48

Page 12 Optimization #3: Split-transaction (Pipelined) Bus One Solution (like the SGI PowerPath-2)

• Supports multiple simultaneous transactions • NACK (Negative ACKnowledgment) for flow control – Higher throughput!! (perhaps worse latency) – Snooper can nack a transaction if it can’t buffer it

Atomic Transaction Bus • Out-of-order responses Req – Snoop results presented with data response Delay Response • Disallow multiple concurrent transactions to one line Time – Not necessary, but it can improve designer sanity

Split-transcation Bus

(C) 2006 Daniel J. Sorin from Adve, (C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 49 Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 51

Potential Problems Serialization Point in Split Transaction Buses

• Two transactions to same block (conflicting) • Is the bus still the serialization point? – Mid-transaction snoop hits – Yes! When a request wins the bus, it is serialized (unless nacked) – E.g., in S, going to M, observe OtherGETX – Data and snoop response can show up way later – Snoop decisions are made based on what’s been serialized • Buffering requests and responses – Need flow control to prevent deadlock • Example (allows multiple outstanding to same block) – Initially: block B is in Invalid in all caches – P1 issues GETX for B, waits for bus • Ordering of snoop responses – P2 issues GETX for B, waits for bus – When does snoop response appear with respect to data response? – P2’s request wins the bus (but no data from memory until later) – P1’s request wins the bus … who responds? – P2 will respond, since P2 is the owner (even before data arrives!) – P2 receives data from memory – P2 sends data to P1

(C) 2006 Daniel J. Sorin from Adve, (C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 50 Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 52

Page 13 A More General Split-transaction Bus Design Multi-level Caches with Split-Transaction Bus

• 4 Buses + Flow Control and Snoop Results • General structure uses queues between – Command (type of transaction) – Bus and L2 cache – Address – L2 cache and L1 cache – Tag (unique identifier for response) – Data (doesn’t require address) • Many potential deadlock problems • Classify all messages to break cyclic dependences • Forms of coherence transactions – Requests only generates responses – GETS, GETX (both are “request + response”) – Responses don’t generate any other messages – PUTX (“request + data”) • Requestor guarantees space for all responses – Upgrade (“request”) • Use separate request and response queues

• Per Processor Request Table Tracks All Transactions

(C) 2006 Daniel J. Sorin from Adve, (C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 53 Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 55

Multi-Level Caches with Split Bus More on Correctness

• Partial correctness (never wrong): Maintain coherence and consistency Processor Processor • Full correctness (always right): Prevent: Response Processor request

L1 $ L1 $ • Deadlock: B – All system activity ceases 8 1 – Cycle of resource dependences Response/ Response/ 5 request request 4 A from L2 to L1 from L1 to L2 • Livelock:

L2 $ L2 $ – No processor makes forward progress – Constant on-going transactions at hardware level – E.g. simultaneous writes in invalidation-based protocol Response/ Request/response 7 2 6 request to bus 3 from bus • Starvation: – Some processors make no forward progress Bus – E.g. interleaved memory system with NACK on bank busy

(C) 2006 Daniel J. Sorin from Adve, (C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 54 Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 56

Page 14 Deadlock, Livelock, Starvation SGI Challenge Overview SCSI-2 VME-64 HPPI • Deadlock: Can be caused by request-reply protocols Graphics R4400 CPUs and caches – When issuing requests, must service incoming transactions Interleaved memory: – E.g., cache awaiting bus grant must snoop & flush blocks 16 GB maximum – Else may not respond to request that will release bus: deadlock I/O subsystem

(a) A four-processor board Powerpath-2 bus (256 data, 40 address, 47.6 MHz) • Livelock: (b) Machine organization – Window of vulnerability problem [Kubi et al., MIT] • 36 MIPS R4400 (peak 2.7 GFLOPS, 4 per board) or 18 MIPS – Handling invalidations between obtaining ownership & write R8000 (peak 5.4 GFLOPS, 2 per board) – Solution: don’t let exclusive ownership be stolen before write • 8-way interleaved memory (up to 16 GB) • 1.2 GB/s Powerpath-2 bus @ 47.6 MHz, 16 slots, 329 signals • Starvation: • 128-Byte lines (1 + 4 cycles) – Solve by using fair arbitration on bus and FIFO buffers • Split-transaction with up to 8 outstanding reads – All transactions take five cycles • Miss latency nearly 1 us (mostly on CPU board, not bus…) (C) 2006 Daniel J. Sorin from Adve, (C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 57 Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 59

Deadlock Avoidance Processor and Memory Systems

L $ L $ L2 $ L $ • Responses are never delayed by requests waiting for 2 2 2 MIPS MIPS MIPS MIPS a response R4400 R4400 R4400 R4400

• Responses are guaranteed to be sunk CC-chip CC-chip CC-chip CC-chip • Requests will eventually be serviced since the Duplicate tags Duplicate tags Duplicate tags Duplicate tags number of responses is bounded by the number of D-chip D-chip D-chip D-chip A-chip outstanding requests slice 1 slice 2 slice 3 slice 4

• Must classify messages according to deadlock and Powerpath-2 bus coherence semantics • 4 MIPS R4400 processors per board share A / D chips – If type 1 messages (requests) spawn type 2 messages (responses), then type 2 messages can’t be allowed to spawn type 1 messages • A chip has address bus interface, request table, control logic – More generally, must avoid cyclic dependences with messages • CC chip per processor has duplicate set of tags » We will see that directory protocols often have 3 message types • Processor requests go from CC chip to A chip to bus » Request, ForwardedRequest, Response • 4 bit-sliced D chips interface CC chip to bus

(C) 2006 Daniel J. Sorin from Adve, (C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 58 Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 60

Page 15 SGI Powerpath-2 Bus Bus Design (continued)

• Each of request and response phase is 5 bus cycles • Non-multiplexed (i.e., separate A and D), 256-data/40- – Response: 4 cycles for data (128 bytes, 256-bit bus), 1 turnaround address, 47.6 MHz, 8 outstanding requests – Request phase: arbitration, resolution, address, decode, ack • Wide Æ more interface chips so higher latency, but more – Request-response transaction takes 3 or more of these bandwidth at slower clock Time • Large block size also calls for wider bus Arb Rslv Addr Dcd Ack Arb Rslv Addr Dcd Ack Arb Rslv Addr Dcd Ack

Address Addr Grant Addr Addr Addr Addr Addr • Uses Illinois MESI protocol (cache-to-cache sharing) bus req ack req ack

Data Data Tag Data Ta g • More detail in chapter At least one arbitration req check req check

2. Resolution requestor Data D0 D1 D2 D3 D0 bus Read operation 1 Read operation 2 No requestors 1. Arbitration 3. Address Cache tags looked up in decode; extend ack cycle if not possible • Determine who will respond, if any • Actual response comes later, with re-arbitration 4. Decode 5. Acknowledge Write-backs have request phase only: arbitrate both data+addr buses (C) 2006 Daniel J. Sorin from Adve, (C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 61 Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 63

Bus Design and Request-Response Matching Bus Design (continued) • Flow-control through negative acknowledgement • Essentially two separate buses, arbitrated (NACK) independently • No conflicting requests for same block allowed on bus – “Request” bus for command and address – 8 outstanding requests total, makes conflict detection tractable – “Response” bus for data – Eight-entry “request table” in each cache controller • Out-of-order responses imply need for matching – New request on bus added to all at same index, determined by tag request with corresponding response – Request gets 3-bit tag when wins arbitration (8 outstanding max) – Entry holds address, request type, state in that cache (if determined already), ... – Response includes data as well as corresponding request tag – Tags allow response to not use address bus, leaving it free – All entries checked on bus or processor accesses for match, so fully associative • Separate bus lines for arbitration and for snoop – Entry freed when response appears, so tag can be reassigned by bus results

(C) 2006 Daniel J. Sorin from Adve, (C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 62 Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 64

Page 16 Bus Interface with Request Table Challenge I/O Subsystem

Request + Snoop state response from $ queue Data to/from $ HIO HIO HIO HIO HIO Personality Peripheral SCSI VME HPPI graphics ASICs

HIO bus (320 MB/s) 0 Request System bus to HIO bus buffer Address Address map Datapath interface

Issue + System address bus Tag

Address merge Originator check Response queue My response information System data bus (1.2 GB/s) Miscellaneous 7 Request table • Multiple I/O cards on sys bus, each w/320MB/s HIO bus Tag Write-back buffer – Personality ASICs connect these to devices (standard and graphics)

Comparator • Proprietary HIO bus To control Responses Write backs Write – 64-bit multiplexed address/data, split read trans., up to 4 per device

Snoop Data buffer state Addr + cmd Tag Addr + cmd Tag – Pipelined, but centralized arbitration, with several transaction lengths – Address translation via mapping RAM in system bus interface Addr + cmd bus

Data + tag bus • I/O board acts like a processor to memory system

(C) 2006 Daniel J. Sorin from Adve, (C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 65 Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 67

Memory Access Latency SUN Enterprise 6000 Overview

• 250ns access time from address on bus to data on P P CPU/Mem bus $ $ Cards

$2 $2 mem ctrl Bus Interface • But overall latency seen by processor is 1000ns! Bus Interf ace / Switch I/O Cards – 300 ns for request to get from processor to bus GigaplaneTM bus (256 data, 41 address, 83 MHz) » Down through cache hierarchy, CC chip and A chip – 400ns later, data gets to D chips • Up to 30 UltraSPARC processors, MOESI protocol » 3 bus cycles to address phase of request transaction, 12 to • GigaplaneTM bus has peak bw 2.67 GB/s, 300 ns access main memory, 5 to deliver data across bus to D chips latency – 300ns more for data to get to processor chip • Up to 112 outstanding transactions (max 7 per board) » Up through D chips, CC chip, and 64-bit wide interface to • 16 bus slots, for processing or I/O boards processor chip, load data into primary cache, restart – 2 CPUs and 1GB memory per board pipeline » Memory distributed, but protocol treats as centralized (UMA)

(C) 2006 Daniel J. Sorin from Adve, (C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 66 Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 68

Page 17 Sun Gigaplane Bus Enterprise I/O System • Non-multiplexed, split-transaction, 256-data/41- address, 83.5 MHz (plus 32 ECC lines, 7 tag, 18 • I/O board has same bus interface ASICs as processor arbitration, etc. Total 388) boards • Cards plug in on both sides: 8 per side • But internal bus half as wide, and no memory path • 112 outstanding transactions, up to 7 from each board • Only cache block sized transactions, like processing – Designed for multiple outstanding transactions per processor boards • Emphasis on reducing latency, unlike Challenge – Uniformity simplifies design – Speculative arbitration if address bus not scheduled from prev. cycle – ASICs implement single-block cache, follows coherence protocol – Else regular 1-cycle arbitration, and 7-bit tag assigned in next cycle • Two independent 64-bit, 25 MHz Sbuses • Snoop result associated with request (5 cycles later) – One for two dedicated FiberChannel modules connected to disk • Main memory can stake claim to data bus 3 cycles into – One for Ethernet and fast wide SCSI this, and start memory access speculatively – Can also support three SBUS interface cards for arbitrary – Two cycles later, asserts tag bus to inform others of coming transfer peripherals • MOESI protocol • Performance and cost of I/O scale with # of I/O boards

(C) 2006 Daniel J. Sorin from Adve, (C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 69 Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 71

Enterprise Processor and Memory System Sun Enterprise 10000 (aka E10K or Starfire)

• 2 procs / board, ext. L2 caches, 2 mem banks w/ x-bar • How far can you go with snooping coherence? • Data lines buffered through UDB to drive internal 1.3 • Quadruple request/snoop bandwidth using four GB/s UPA bus “logical” address buses • Wide path to memory so full 64-byte line in 2 bus cycles – Each handles 1/4 of physical address space – Impose logical ordering for consistency: for writes on same cycle, those on bus 0 occur “before” bus 1, etc. FiberChannel 10/100 Memory (16 × 72-bit SIMMS) module (2) SBUS slots Ethernet • Get rid of data bandwidth problem: use a network Fast wide L2 $ Tags L2 $ Tags SCSI – E10k uses 16x16 crossbar between CPU boards & memory boards – Each CPU board has up to 4 CPUs: max 64 CPUs total UltraSparc UltraSparc SBUS 25 MHz 64 • 10.7 GB/s max BW, 468 ns unloaded miss latency SysIO SysIO UDB UDB • We will discuss a paper on E10K later 144 576 72

D-tags Address controller Data controller (crossbar) Address controller Data controller (crossbar) Control Address Data 288 Control Address Data 288 Gigaplane connector Gigaplane connector

(C) 2006 Daniel J. Sorin from Adve, (C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 70 Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 72

Page 18 Outline for Implementing Snooping The TLB Coherence Problem

• Coherence Control Implementation • Since TLB is a cache, must be kept coherent • Writebacks, Non-Atomicity, & Serialization/Order • Change of PTE on one processor must be seen by all • Hierarchical Cache processors

• Split Buses • Why might a PTE be cached in more than 1 TLB? • Deadlock, Livelock, & Starvation – Actual sharing of data – Process migration • Three Case Studies • Changes are infrequent • TLB Coherence – Get OS to do it • Virtual Cache Issues

(C) 2006 Daniel J. Sorin from Adve, (C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 73 Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 75

Translation Lookaside Buffer TLB Shootdown

• Cache of page table entries • To modify TLB entry, modifying processor must • Page table maps virtual page to physical frame – LOCK page table, – Flush TLB entries, – Queue TLB operations, Virtual Address Space Physical Address Space – Send inter-processor interrupt, – Spin until other processors are done – UNLOCK page table 0 3 • SLOW! 4 4 • But most common solution today

7 7

(C) 2006 Daniel J. Sorin from Adve, (C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 74 Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 76

Page 19 TLB Shootdown Improvements Wang et al. [ISCA89]

• Evolutionary Changes • Basic Idea – Keep track of which processor even had the mapping – Extended Goodman one-level cache idea [ASPLOS87] & only shoot them down – Virtual L1 and physical L2 – Defer shootdowns on “upgrade” changes – Do coherence on physical addresses (e.g., page from read-only to read-write) – Each L2 block maintains pointer to corresponding L1 block (if any) – SGI Origin “poison” bit for also deferring downgrades (requires log2 #L1_blocks - log2 (page_size / block_size) – Never allow block to be simultaneously cached under synonyms

• Revolutionary Changes • Example where V0 & V1 map to P2 – “Invalidate TLB entry” instruction (e.g., PowerPC) – Initially V1 in L1 and P2 in L2 points to V1 – No TLB (e.g., Berkeley SPUR) – Processor references V0 » Use virtual L1 caches so address translation only on miss – L1 miss » On miss, walk PTE (which will often be cached normally) – L2 detects synonym in L1 » PTE changes kept coherent by normal cache coherence – Change L1 tag and L2 pointer so that L1 has V0 instead of V1 – Resume

(C) 2006 Daniel J. Sorin from Adve, (C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 77 Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 79

Virtual Caches & Synonyms Virtual Caches & Homonyms

• Problem • Homonym – Synonyms: V0 & V1 map to P1 – “Pool” of water and “pool” the game – When doing coherence on block in P1, how do you find V0 & V1? – V0 of one process maps to P2, while V0 of other process maps to P3

• Don’t do virtual caches (most common today) • Flush cache on context switch – Simple but performs poorly • Don’t allow synonyms • Address-space IDs (ASIDs) – Probably use a segmented global address space – In architecture & part of context state – E.g., Berkeley SPUR had process pick 4 of 256 1GB segments • Mapping-valid bit of Wang et al. – Still requires reverse address translation – Add mapping-valid as a “second” valid bit on L1 cache block • Allow virtual cache & synonyms – On context switch do “flash clear” of mapping-valid bits – How do we implement reverse address translation? – Interesting case is valid block with mapping invalid – See Wang et al. next » On processor access, re-validate mapping » On replacement (i.e., writeback) treat as valid block

(C) 2006 Daniel J. Sorin from Adve, (C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 78 Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 80

Page 20 Outline for Implementing Snooping Sun UltraEnterprise 10000 (Starfire)

• Coherence Control Implementation • Shared-wire bus is bottleneck in snooping systems – Tough to implement at high speed • Writebacks, Non-Atomicity, & Serialization/Order – Centralized shared resource • Hierarchical Cache • Solution: multiple “logical buses” • Split Buses • PRESENTATION • Deadlock, Livelock, & Starvation

• Three Case Studies

• TLB Coherence

• Virtual Cache Issues

(C) 2006 Daniel J. Sorin from Adve, (C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 81 Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 83

Outline Multicast Snooping

• Motivation for Cache-Coherent Shared Memory • Bus is bottleneck in snooping systems – But why broadcast requests when we can multicast? • Snooping Cache Coherence • PRESENTATION • Implementing Snooping Systems

• Advanced Snooping Systems – Sun UltraEnterprise 10000 – Multicast Snooping (Wisconsin)

(C) 2006 Daniel J. Sorin from Adve, (C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 82 Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 84

Page 21 Outline

• Motivation for Cache-Coherent Shared Memory

• Snooping Cache Coherence

• Implementing Snooping Systems

• Advanced Snooping Systems

(C) 2006 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 85

Page 22