“Scalable” Cache Coherence Scalable Cache Coherence Directory Coherence Protocols MSI Directory Protocol MSI Directory Proto

“Scalable” Cache Coherence Scalable Cache Coherence LdM/StM CPU($) CPU($) Mem R R Mem I Mem R R Mem CPU($) CPU($) Part I: bus bandwidth • Point-to-point interconnects Replace non-scalable bandwidth substrate (bus)… • Glueless MP: no need for additional “glue” chips …with scalable one (point-to-point network, e.g., mesh) + Can be arbitrarily large: 1000’s of processors • Massively parallel processors (MPPs) • Only government (DoD) has cache-coherent MPPs… Part II: processor snooping bandwidth • Companies have much smaller systems: 32–64 processors • Scalable multi-processors • Most snoops result in no action • AMD Opteron/Phenom – point-to-point, glueless, broadcast • Replace non-scalable broadcast protocol (spam everyone)… • Distributed memory: non-uniform memory architecture (NUMA) …with scalable directory protocol (only notify processors that care) • Multicore: on-chip mesh interconnection networks 57 58 57 58 Directory Coherence Protocols MSI Directory Protocol Observe: address space statically partitioned LdMiss, • Processor side + Can easily determine which memory module holds a given line StMiss • Directory follows its own protocol (obvious • That memory module sometimes called “home” I in principle) – Can’t easily determine which processors have line in their caches • Similar to bus-based MSI • Bus-based protocol: broadcast events to all processors/caches • Same three states ± Simple and fast, but non-scalable Store • Same five actions (keep BR/BW names) Directories: non-broadcast coherence protocol • Extend memory to track caching information • Minus red arcs/actions • For each physical cache line whose home this is, track: • Events that would not trigger action anyway • Owner: which processor has a dirty copy (i.e., M state) +Directory won’t bother you unless you need • Sharers: which processors have clean copies (i.e., S state) to act • Processor sends coherence event to home directory StMiss, WB Store • Home directory only sends events to processors that care M S • For multicore w/ shared L3, put directory info in cache tags LdMiss Load, Store Load, LdMiss 59 60 59 60 MSI Directory Protocol Directory Flip Side: Latency P0 P1 Directory Processor 0 Processor 1 • Directory protocols 0: addi r1,acctsr3 –:–:500 + Lower bandwidth consumption more scalable 1: ld 0(r3),r4 S:500 S:0:500 – Longer latencies 2 hop miss 3 hop miss 2: blt r4,r2,done 3: sub r4,r2r4 M:400 M:0:500 • Two read miss situations P0 P0 P1 4: st r4,0(r3) (stale) 0: addi r1,acctsr3 1: ld 0(r3),r4 S:400 S:400 S:0,1:400 • Unshared: get data from memory Dir Dir 2: blt r4,r2,done • Snooping: 2 hops (P0memoryP0) 3: sub r4,r2r4 • Directory: 2 hops (P0memoryP0) 4: st r4,0(r3) I: M:300 M:1:400 • Shared or exclusive: get data from other processor (P1) • Assume cache-to-cache transfer optimization ld by P1 sends BR to directory • Snooping: 2 hops (P0P1P0) • Directory sends BR to P0, P0 sends P1 data, does WB, goes to S st by P1 sends BW to directory – Directory: 3 hops (P0memoryP1P0) • Directory sends BW to P0, P0 goes to I • Common, many processors high probability someone has it 61 62 61 62 Directory Flip Side: Complexity Coherence on Real Machines • Latency is not the only issue for directories • Many uniprocessors designed with on-chip snooping logic • Subtle correctness issues as well • Can be easily combined to form multi-processors • Stem from unordered nature of underlying inter-connect • e.g., Intel Pentium4 Xeon • Individual requests to single cache must be ordered • And multicore, of course • Bus-based snooping: all processors see all requests in same order • Larger scale (directory) systems built from smaller MPs • Ordering automatic • e.g., Sun Wildfire, NUMA-Q, IBM Summit • Point-to-point network: requests may arrive in different orders • Directory has to enforce ordering explicitly • Some shared memory machines are not cache coherent • Cannot initiate actions on request B… • e.g., CRAY-T3D/E ..until all relevant processors complete actions on request A • Shared data is uncachable • Requires directory to collect acks, queue requests, etc. • If you want to cache shared data, copy it to private data • Directory protocols section • Obvious in principle • Basically, cache coherence implemented in software – Complicated in practice • Have to really know what you are doing as a programmer 63 64 63 64 Roadmap Checkpoint Tricky Shared Memory Examples • Answer the following questions: App App App • Thread-level parallelism (TLP) • Initially: all variables zero (that is, x is 0, y is 0, flag is 0, A is 0) System software • Shared memory model • What value pairs can be read by the two loads? (x, y) pairs: • Multiplexed uniprocessor thread 1 thread 2 Mem CPUCPU I/O CPUCPU • Hardware multithreading load x store 1 → y CPUCPU • Multiprocessing load y store 1 → x • Synchronization • (0,0) and (1,1) easy to see • Lock implementation • load x, store 1 → y, load y, store 1 → x gives (0,1) • Locking gotchas • Is it possible to get (1,0)? • Cache coherence • Bus-based protocols • Directory protocols • Memory consistency models 65 66 65 66 Tricky Shared Memory Examples Memory Consistency • Answer the following questions: • Memory coherence • Initially: all variables zero (that is, x is 0, y is 0, flag is 0, A is 0) • Creates globally uniform (consistent) view… • What value pairs can be read by the two loads? (x, y) pairs: …of a single memory location (in other words: cache line) thread 1 thread 2 – Not enough → load x store 1 y • Cache lines A and B can be individually consistent… load y store 1 → x …but inconsistent with respect to each other • What value pairs can be read by the two loads? (x, y) pairs: thread 1 thread 2 • Memory consistency store 1 → y store 1 → x load x load y • Creates globally uniform (consistent) view… • What value can be read by “Load A” below? …of all memory locations relative to each other thread 1 thread 2 • Who cares? Programmers store 1 → A while(flag == 0) { } store 1 → flag load A – Globally inconsistent memory creates mystifying behavior 67 68 67 68 Hiding Store Miss Latency Write Misses and Store Buffers Read miss? • Recall (back from caching unit) • Load can’t go on without the datamust stall • Hiding store miss latency Write miss? Processor Technically, no one needs datawhy stall? • How? Store buffer • SB Store buffer: a small buffer • Said it would complicate multiprocessors • Stores put addr/value to write buffer, keep going • Yes, it does! • Store buffer writes stores to D$ in the background Cache • Loads must search store buffer (in addition to D$) + Eliminates stalls on write misses (mostly) WBB – Creates some problems Next-level cache 69 70 69 70 Store Buffers & Consistency Coherence vs. Consistency A=flag=0; A=0 flag=0 Processor 0 Processor 1 Processor 0 Processor 1 A=1; while (!flag); // spin A=1; while (!flag); // spin flag=1; print A; flag=1; print A; • Intuition says: P1 prints A=1 Consider the following execution: • Coherence says: absolutely nothing • Processor 0’s write to A, misses the cache. Put in store buffer • P1 can see P0’s write of flag before write of A!!! How? • Processor 0 keeps going • P0 has a coalescing store buffer that reorders writes • Processor 0 write “1” to flag hits, completes • Or out-of-order execution • Processor 1 reads flag… sees the value “1” • Or compiler re-orders instructions • Processor 1 exits loop • Imagine trying to figure out why this code sometimes “works” and sometimes doesn’t • Processor 1 prints “0” for A • Real systems act in this strange manner Ramification: store buffers can cause “strange” behavior • What is allowed is defined as part of the ISA of the processor • How strange depends on lots of things 71 72 71 72 Memory Consistency Models Restoring Order • Sequential consistency (SC) (MIPS, PA-RISC) • Sometimes we need ordering (mostly we don’t) • Formal definition of memory view programmers expect • Prime example: ordering between “lock” and data • Processors see their own loads and stores in program order • How? insert Fences (memory barriers) + Provided naturally, even with out-of-order execution • Special instructions, part of ISA • But also: processors see others’ loads and stores in program order • Example • And finally: all processors see same global load/store ordering – Last two conditions not naturally enforced by coherence • Ensure that loads/stores don’t cross lock acquire/release operation • Corresponds to some sequential interleaving of uniprocessor orders acquire • Indistinguishable from multi-programmed uni-processor fence • Processor consistency (PC) (x86, SPARC) critical section • Allows a in-order store buffer fence • Stores can be deferred, but must be put into the cache in order release • Release consistency (RC) (ARM, Itanium, PowerPC) • How do fences work? • Allows an un-ordered store buffer • They stall execution until write buffers are empty • Stores can be put into cache in any order • Makes lock acquisition and release slow(er) • Use synchronization library, don’t write your own 73 74 73 74 Shared Memory Summary Flynn’s Taxonomy • Synchronization: regulated access to shared data • Proposed by Michael Flynn in 1966 • Key feature: atomic lock acquisition operation (e.g., t&s) • SISD – single instruction, single data • Performance optimizations: test-and-test-and-set, queue • Traditional uniprocessor locks • SIMD – single instruction, multiple data • Coherence: consistent view of individual cache lines • Execute the same instruction on many data elements • Absolute coherence not needed, relative coherence OK • Vector machines, graphics engines • VI

“Scalable” Cache Coherence Scalable Cache Coherence Directory Coherence Protocols MSI Directory Protocol MSI Directory Proto

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support