“Scalable” Coherence Scalable Cache Coherence

LdM/StM CPU($) CPU($) Mem R R Mem I Mem R R Mem CPU($) CPU($) Part I: bus bandwidth • Point-to-point interconnects Replace non-scalable bandwidth substrate (bus)… • Glueless MP: no need for additional “glue” chips …with scalable one (point-to-point network, e.g., mesh) + Can be arbitrarily large: 1000’s of processors • processors (MPPs) • Only government (DoD) has cache-coherent MPPs… Part II: snooping bandwidth • Companies have much smaller systems: 32–64 processors • Scalable multi-processors • Most snoops result in no action • AMD Opteron/Phenom – point-to-point, glueless, broadcast • Replace non-scalable broadcast protocol (spam everyone)… • : non-uniform memory architecture (NUMA) …with scalable directory protocol (only notify processors that care) • Multicore: on-chip mesh interconnection networks

57 58

57 58

Directory Coherence Protocols MSI Directory Protocol Observe: address space statically partitioned LdMiss, • Processor side + Can easily determine which memory module holds a given line StMiss • Directory follows its own protocol (obvious • That memory module sometimes called “home” I in principle) – Can’t easily determine which processors have line in their caches • Similar to bus-based MSI • Bus-based protocol: broadcast events to all processors/caches • Same three states ± Simple and fast, but non-scalable Store • Same five actions (keep BR/BW names) Directories: non-broadcast coherence protocol • Extend memory to track caching information • Minus red arcs/actions • For each physical cache line whose home this is, track: • Events that would not trigger action anyway • Owner: which processor has a dirty copy (i.e., M state) +Directory won’t bother you unless you need • Sharers: which processors have clean copies (i.e., S state) to act • Processor sends coherence event to home directory WB StMiss, Store • Home directory only sends events to processors that care M S • For multicore w/ shared L3, put directory info in cache tags LdMiss

Load, Store Load, LdMiss

59 60

59 60

MSI Directory Protocol Directory Flip Side: Latency P0 P1 Directory Processor 0 Processor 1 • Directory protocols 0: addi r1,acctsr3 –:–:500 + Lower bandwidth consumption  more scalable 1: ld 0(r3),r4 S:500 S:0:500 – Longer latencies 2 hop miss 3 hop miss 2: blt r4,r2,done 3: sub r4,r2r4 M:400 M:0:500 • Two read miss situations P0 P0 P1 4: st r4,0(r3) (stale) 0: addi r1,acctsr3 1: ld 0(r3),r4 S:400 S:400 S:0,1:400 • Unshared: get data from memory Dir Dir 2: blt r4,r2,done • Snooping: 2 hops (P0memoryP0) 3: sub r4,r2r4 • Directory: 2 hops (P0memoryP0) 4: st r4,0(r3) I: M:300 M:1:400 • Shared or exclusive: get data from other processor (P1) • Assume cache-to-cache transfer optimization ld by P1 sends BR to directory • Snooping: 2 hops (P0P1P0) • Directory sends BR to P0, P0 sends P1 data, does WB, goes to S st by P1 sends BW to directory – Directory: 3 hops (P0memoryP1P0) • Directory sends BW to P0, P0 goes to I • Common, many processors  high probability someone has it

61 62

61 62 Directory Flip Side: Complexity Coherence on Real Machines • Latency is not the only issue for directories • Many uniprocessors designed with on-chip snooping logic • Subtle correctness issues as well • Can be easily combined to form multi-processors • Stem from unordered nature of underlying inter-connect • e.g., Intel Pentium4 Xeon • Individual requests to single cache must be ordered • And multicore, of course • Bus-based snooping: all processors see all requests in same order • Larger scale (directory) systems built from smaller MPs • Ordering automatic • e.g., Sun Wildfire, NUMA-Q, IBM Summit • Point-to-point network: requests may arrive in different orders • Directory has to enforce ordering explicitly • Some machines are not cache coherent • Cannot initiate actions on request B… • e.g., CRAY-T3D/E ..until all relevant processors complete actions on request A • Shared data is uncachable • Requires directory to collect acks, queue requests, etc. • If you want to cache shared data, copy it to private data • Directory protocols section • Obvious in principle • Basically, cache coherence implemented in software – Complicated in practice • Have to really know what you are doing as a programmer

63 64

63 64

Roadmap Checkpoint Tricky Shared Memory Examples • Answer the following questions: App App App • -level parallelism (TLP) • Initially: all variables zero (that is, x is 0, y is 0, flag is 0, A is 0) System software • Shared memory model • What value pairs can be read by the two loads? (x, y) pairs: • Multiplexed uniprocessor thread 1 thread 2 Mem CPUCPU I/O CPUCPU • Hardware multithreading load x store 1 → y CPUCPU • load y store 1 → x • Synchronization • (0,0) and (1,1) easy to see • Lock implementation • load x, store 1 → y, load y, store 1 → x gives (0,1) • Locking gotchas • Is it possible to get (1,0)? • Cache coherence • Bus-based protocols • Directory protocols • Memory consistency models

65 66

65 66

Tricky Shared Memory Examples Memory Consistency • Answer the following questions: • • Initially: all variables zero (that is, x is 0, y is 0, flag is 0, A is 0) • Creates globally uniform (consistent) view… • What value pairs can be read by the two loads? (x, y) pairs: …of a single memory location (in other words: cache line) thread 1 thread 2 – Not enough → load x store 1 y • Cache lines A and B can be individually consistent… load y store 1 → x …but inconsistent with respect to each other • What value pairs can be read by the two loads? (x, y) pairs: thread 1 thread 2 • Memory consistency store 1 → y store 1 → x load x load y • Creates globally uniform (consistent) view… • What value can be read by “Load A” below? …of all memory locations relative to each other thread 1 thread 2 • Who cares? Programmers store 1 → A while(flag == 0) { } store 1 → flag load A – Globally inconsistent memory creates mystifying behavior

67 68

67 68 Hiding Store Miss Latency Write Misses and Store Buffers Read miss? • Recall (back from caching unit) • Load can’t go on without the datamust stall • Hiding store miss latency Write miss? Processor Technically, no one needs datawhy stall? • How? Store buffer • SB Store buffer: a small buffer • Said it would complicate multiprocessors • Stores put addr/value to write buffer, keep going • Yes, it does! • Store buffer writes stores to D$ in the background Cache • Loads must search store buffer (in addition to D$) + Eliminates stalls on write misses (mostly) WBB – Creates some problems

Next-level cache

69 70

69 70

Store Buffers & Consistency Coherence vs. Consistency

A=flag=0; A=0 flag=0 Processor 0 Processor 1 Processor 0 Processor 1 A=1; while (!flag); // spin A=1; while (!flag); // spin flag=1; print A; flag=1; print A; • Intuition says: P1 prints A=1 Consider the following execution: • Coherence says: absolutely nothing • Processor 0’s write to A, misses the cache. Put in store buffer • P1 can see P0’s write of flag before write of A!!! How? • Processor 0 keeps going • P0 has a coalescing store buffer that reorders writes • Processor 0 write “1” to flag hits, completes • Or out-of-order execution • Processor 1 reads flag… sees the value “1” • Or compiler re-orders instructions • Processor 1 exits loop • Imagine trying to figure out why this code sometimes “works” and sometimes doesn’t • Processor 1 prints “0” for A • Real systems act in this strange manner Ramification: store buffers can cause “strange” behavior • What is allowed is defined as part of the ISA of the processor • How strange depends on lots of things

71 72

71 72

Memory Consistency Models Restoring Order • Sequential consistency (SC) (MIPS, PA-RISC) • Sometimes we need ordering (mostly we don’t) • Formal definition of memory view programmers expect • Prime example: ordering between “lock” and data • Processors see their own loads and stores in program order • How? insert Fences (memory barriers) + Provided naturally, even with out-of-order execution • Special instructions, part of ISA • But also: processors see others’ loads and stores in program order • Example • And finally: all processors see same global load/store ordering – Last two conditions not naturally enforced by coherence • Ensure that loads/stores don’t cross lock acquire/release operation • Corresponds to some sequential interleaving of uniprocessor orders acquire • Indistinguishable from multi-programmed uni-processor fence • Processor consistency (PC) (, SPARC) critical section • Allows a in-order store buffer fence • Stores can be deferred, but must be put into the cache in order release • Release consistency (RC) (ARM, , PowerPC) • How do fences work? • Allows an un-ordered store buffer • They stall execution until write buffers are empty • Stores can be put into cache in any order • Makes lock acquisition and release slow(er) • Use synchronization library, don’t write your own

73 74

73 74 Shared Memory Summary Flynn’s Taxonomy • Synchronization: regulated access to shared data • Proposed by Michael Flynn in 1966 • Key feature: atomic lock acquisition operation (e.g., t&s) • SISD – single instruction, single data • Performance optimizations: test-and-test-and-set, queue • Traditional uniprocessor locks • SIMD – single instruction, multiple data • Coherence: consistent view of individual cache lines • Execute the same instruction on many data elements • Absolute coherence not needed, relative coherence OK • Vector machines, graphics engines • VI and MSI protocols, cache-to-cache transfer optimization • MIMD – multiple instruction, multiple data • Implementation? snooping, directories • Each processor executes its own instructions • Consistency: consistent view of all memory locations • Multicores are all built this way • Programmers intuitively expect sequential consistency (SC) • SPMD – single program, multiple data (extension • Global interleaving of individual processor access streams proposed by Frederica Darema) – Not always naturally provided, may prevent optimizations • MIMD machine, each node is executing the same code • Weaker ordering: consistency only for synchronization points • MISD – multiple instruction, single data • 75 76

75 76

Summary

App App App • Thread-level parallelism (TLP) System software • Shared memory model • Multiplexed uniprocessor Mem CPUCPU I/O CPUCPU • Hardware multithreading CPUCPU • Multiprocessing • Synchronization • Lock implementation • Locking gotchas • Cache coherence • Bus-based protocols • Directory protocols • Memory consistency models

77

77