“Scalable” Cache Coherence Scalable Cache Coherence Directory Coherence Protocols MSI Directory Protocol MSI Directory Proto

“Scalable” Cache Coherence Scalable Cache Coherence LdM/StM CPU($) CPU($) Mem R R Mem I Mem R R Mem CPU($) CPU($) Part I: bus bandwidth • Point-to-point interconnects Replace non-scalable bandwidth substrate (bus)… • Glueless MP: no need for additional “glue” chips …with scalable one (point-to-point network, e.g., mesh) + Can be arbitrarily large: 1000’s of processors • Massively parallel processors (MPPs) • Only government (DoD) has cache-coherent MPPs… Part II: processor snooping bandwidth • Companies have much smaller systems: 32–64 processors • Scalable multi-processors • Most snoops result in no action • AMD Opteron/Phenom – point-to-point, glueless, broadcast • Replace non-scalable broadcast protocol (spam everyone)… • Distributed memory: non-uniform memory architecture (NUMA) …with scalable directory protocol (only notify processors that care) • Multicore: on-chip mesh interconnection networks 57 58 57 58 Directory Coherence Protocols MSI Directory Protocol Observe: address space statically partitioned LdMiss, • Processor side + Can easily determine which memory module holds a given line StMiss • Directory follows its own protocol (obvious • That memory module sometimes called “home” I in principle) – Can’t easily determine which processors have line in their caches • Similar to bus-based MSI • Bus-based protocol: broadcast events to all processors/caches • Same three states ± Simple and fast, but non-scalable Store • Same five actions (keep BR/BW names) Directories: non-broadcast coherence protocol • Extend memory to track caching information • Minus red arcs/actions • For each physical cache line whose home this is, track: • Events that would not trigger action anyway • Owner: which processor has a dirty copy (i.e., M state) +Directory won’t bother you unless you need • Sharers: which processors have clean copies (i.e., S state) to act • Processor sends coherence event to home directory StMiss, WB Store • Home directory only sends events to processors that care M S • For multicore w/ shared L3, put directory info in cache tags LdMiss Load, Store Load, LdMiss 59 60 59 60 MSI Directory Protocol Directory Flip Side: Latency P0 P1 Directory Processor 0 Processor 1 • Directory protocols 0: addi r1,acctsr3 –:–:500 + Lower bandwidth consumption more scalable 1: ld 0(r3),r4 S:500 S:0:500 – Longer latencies 2 hop miss 3 hop miss 2: blt r4,r2,done 3: sub r4,r2r4 M:400 M:0:500 • Two read miss situations P0 P0 P1 4: st r4,0(r3) (stale) 0: addi r1,acctsr3 1: ld 0(r3),r4 S:400 S:400 S:0,1:400 • Unshared: get data from memory Dir Dir 2: blt r4,r2,done • Snooping: 2 hops (P0memoryP0) 3: sub r4,r2r4 • Directory: 2 hops (P0memoryP0) 4: st r4,0(r3) I: M:300 M:1:400 • Shared or exclusive: get data from other processor (P1) • Assume cache-to-cache transfer optimization ld by P1 sends BR to directory • Snooping: 2 hops (P0P1P0) • Directory sends BR to P0, P0 sends P1 data, does WB, goes to S st by P1 sends BW to directory – Directory: 3 hops (P0memoryP1P0) • Directory sends BW to P0, P0 goes to I • Common, many processors high probability someone has it 61 62 61 62 Directory Flip Side: Complexity Coherence on Real Machines • Latency is not the only issue for directories • Many uniprocessors designed with on-chip snooping logic • Subtle correctness issues as well • Can be easily combined to form multi-processors • Stem from unordered nature of underlying inter-connect • e.g., Intel Pentium4 Xeon • Individual requests to single cache must be ordered • And multicore, of course • Bus-based snooping: all processors see all requests in same order • Larger scale (directory) systems built from smaller MPs • Ordering automatic • e.g., Sun Wildfire, NUMA-Q, IBM Summit • Point-to-point network: requests may arrive in different orders • Directory has to enforce ordering explicitly • Some shared memory machines are not cache coherent • Cannot initiate actions on request B… • e.g., CRAY-T3D/E ..until all relevant processors complete actions on request A • Shared data is uncachable • Requires directory to collect acks, queue requests, etc. • If you want to cache shared data, copy it to private data • Directory protocols section • Obvious in principle • Basically, cache coherence implemented in software – Complicated in practice • Have to really know what you are doing as a programmer 63 64 63 64 Roadmap Checkpoint Tricky Shared Memory Examples • Answer the following questions: App App App • Thread-level parallelism (TLP) • Initially: all variables zero (that is, x is 0, y is 0, flag is 0, A is 0) System software • Shared memory model • What value pairs can be read by the two loads? (x, y) pairs: • Multiplexed uniprocessor thread 1 thread 2 Mem CPUCPU I/O CPUCPU • Hardware multithreading load x store 1 → y CPUCPU • Multiprocessing load y store 1 → x • Synchronization • (0,0) and (1,1) easy to see • Lock implementation • load x, store 1 → y, load y, store 1 → x gives (0,1) • Locking gotchas • Is it possible to get (1,0)? • Cache coherence • Bus-based protocols • Directory protocols • Memory consistency models 65 66 65 66 Tricky Shared Memory Examples Memory Consistency • Answer the following questions: • Memory coherence • Initially: all variables zero (that is, x is 0, y is 0, flag is 0, A is 0) • Creates globally uniform (consistent) view… • What value pairs can be read by the two loads? (x, y) pairs: …of a single memory location (in other words: cache line) thread 1 thread 2 – Not enough → load x store 1 y • Cache lines A and B can be individually consistent… load y store 1 → x …but inconsistent with respect to each other • What value pairs can be read by the two loads? (x, y) pairs: thread 1 thread 2 • Memory consistency store 1 → y store 1 → x load x load y • Creates globally uniform (consistent) view… • What value can be read by “Load A” below? …of all memory locations relative to each other thread 1 thread 2 • Who cares? Programmers store 1 → A while(flag == 0) { } store 1 → flag load A – Globally inconsistent memory creates mystifying behavior 67 68 67 68 Hiding Store Miss Latency Write Misses and Store Buffers Read miss? • Recall (back from caching unit) • Load can’t go on without the datamust stall • Hiding store miss latency Write miss? Processor Technically, no one needs datawhy stall? • How? Store buffer • SB Store buffer: a small buffer • Said it would complicate multiprocessors • Stores put addr/value to write buffer, keep going • Yes, it does! • Store buffer writes stores to D$ in the background Cache • Loads must search store buffer (in addition to D$) + Eliminates stalls on write misses (mostly) WBB – Creates some problems Next-level cache 69 70 69 70 Store Buffers & Consistency Coherence vs. Consistency A=flag=0; A=0 flag=0 Processor 0 Processor 1 Processor 0 Processor 1 A=1; while (!flag); // spin A=1; while (!flag); // spin flag=1; print A; flag=1; print A; • Intuition says: P1 prints A=1 Consider the following execution: • Coherence says: absolutely nothing • Processor 0’s write to A, misses the cache. Put in store buffer • P1 can see P0’s write of flag before write of A!!! How? • Processor 0 keeps going • P0 has a coalescing store buffer that reorders writes • Processor 0 write “1” to flag hits, completes • Or out-of-order execution • Processor 1 reads flag… sees the value “1” • Or compiler re-orders instructions • Processor 1 exits loop • Imagine trying to figure out why this code sometimes “works” and sometimes doesn’t • Processor 1 prints “0” for A • Real systems act in this strange manner Ramification: store buffers can cause “strange” behavior • What is allowed is defined as part of the ISA of the processor • How strange depends on lots of things 71 72 71 72 Memory Consistency Models Restoring Order • Sequential consistency (SC) (MIPS, PA-RISC) • Sometimes we need ordering (mostly we don’t) • Formal definition of memory view programmers expect • Prime example: ordering between “lock” and data • Processors see their own loads and stores in program order • How? insert Fences (memory barriers) + Provided naturally, even with out-of-order execution • Special instructions, part of ISA • But also: processors see others’ loads and stores in program order • Example • And finally: all processors see same global load/store ordering – Last two conditions not naturally enforced by coherence • Ensure that loads/stores don’t cross lock acquire/release operation • Corresponds to some sequential interleaving of uniprocessor orders acquire • Indistinguishable from multi-programmed uni-processor fence • Processor consistency (PC) (x86, SPARC) critical section • Allows a in-order store buffer fence • Stores can be deferred, but must be put into the cache in order release • Release consistency (RC) (ARM, Itanium, PowerPC) • How do fences work? • Allows an un-ordered store buffer • They stall execution until write buffers are empty • Stores can be put into cache in any order • Makes lock acquisition and release slow(er) • Use synchronization library, don’t write your own 73 74 73 74 Shared Memory Summary Flynn’s Taxonomy • Synchronization: regulated access to shared data • Proposed by Michael Flynn in 1966 • Key feature: atomic lock acquisition operation (e.g., t&s) • SISD – single instruction, single data • Performance optimizations: test-and-test-and-set, queue • Traditional uniprocessor locks • SIMD – single instruction, multiple data • Coherence: consistent view of individual cache lines • Execute the same instruction on many data elements • Absolute coherence not needed, relative coherence OK • Vector machines, graphics engines • VI

Load more