Shared-Memory Multiprocessors Gates & Transistors

This Unit: Shared Memory Multiprocessors Application • Three issues OS • Cache coherence Compiler Firmware CIS 501 • Synchronization • Memory consistency Introduction To Computer Architecture CPU I/O Memory • Two cache coherence approaches • “Snooping” (SMPs): < 16 processors Digital Circuits • “Directory”/Scalable: lots of processors Unit 11: Shared-Memory Multiprocessors Gates & Transistors CIS 501 (Martin/Roth): Shared Memory Multiprocessors 1 CIS 501 (Martin/Roth): Shared Memory Multiprocessors 2 Readings Thread-Level Parallelism struct acct_t { int bal; }; • H+P shared struct acct_t accts[MAX_ACCT]; • Chapter 6 int id,amt; 0: addi r1,accts,r3 if (accts[id].bal >= amt) 1: ld 0(r3),r4 { 2: blt r4,r2,6 accts[id].bal -= amt; 3: sub r4,r2,r4 spew_cash(); 4: st r4,0(r3) } 5: call spew_cash • Thread-level parallelism (TLP) • Collection of asynchronous tasks: not started and stopped together • Data shared loosely, dynamically • Example: database/web server (each query is a thread) • accts is shared, can’t register allocate even if it were scalar • id and amt are private variables, register allocated to r1, r2 • Focus on this CIS 501 (Martin/Roth): Shared Memory Multiprocessors 3 CIS 501 (Martin/Roth): Shared Memory Multiprocessors 4 Shared Memory Shared-Memory Multiprocessors • Shared memory • Provide a shared-memory abstraction • Multiple execution contexts sharing a single address space • Familiar and efficient for programmers • Multiple programs (MIMD) • Or more frequently: multiple copies of one program (SPMD) P1 P2 P3 P4 • Implicit (automatic) communication via loads and stores + Simple software • No need for messages, communication happens naturally – Maybe too naturally • Supports irregular, dynamic communication patterns Memory System • Both DLP and TLP – Complex hardware • Must create a uniform view of memory • Several aspects to this as we will see CIS 501 (Martin/Roth): Shared Memory Multiprocessors 5 CIS 501 (Martin/Roth): Shared Memory Multiprocessors 6 Shared-Memory Multiprocessors Paired vs. Separate Processor/Memory? • Provide a shared-memory abstraction • Separate processor/memory • Familiar and efficient for programmers • Uniform memory access (UMA): equal latency to all memory + Simple software, doesn’t matter where you put data P P P P – Lower peak performance 1 2 3 4 • Bus-based UMAs common: symmetric multi-processors (SMP) • Paired processor/memory Cache Cache Cache Cache • Non-uniform memory access (NUMA): faster to local memory M1 M2 M3 M4 – More complex software: where you put data matters Interface Interface Interface Interface + Higher peak performance: assuming proper data placement CPU($) CPU($) CPU($) CPU($) Interconnection Network CPU($) CPU($) CPU($) CPU($) Mem R Mem R Mem R Mem R Mem Mem Mem Mem CIS 501 (Martin/Roth): Shared Memory Multiprocessors 7 CIS 501 (Martin/Roth): Shared Memory Multiprocessors 8 Shared vs. Point-to-Point Networks Organizing Point-To-Point Networks • Shared network: e.g., bus (left) • Network topology: organization of network + Low latency • Tradeoff performance (connectivity, latency, bandwidth) ! cost – Low bandwidth: doesn’t scale beyond ~16 processors • Router chips + Shared property simplifies cache coherence protocols (later) • Networks that require separate router chips are indirect • Point-to-point network: e.g., mesh or ring (right) • Networks that use processor/memory/router packages are direct – Longer latency: may need multiple “hops” to communicate + Fewer components, “Glueless MP” + Higher bandwidth: scales to 1000s of processors • Point-to-point network examples – Cache coherence protocols are complex • Indirect tree (left) • Direct mesh or ring (right) CPU($) CPU($) CPU($) CPU($) CPU($) CPU($) R CPU($) CPU($) Mem R Mem R Mem R Mem R Mem R R Mem Mem R R Mem R R Mem R R Mem Mem R Mem R Mem R Mem R Mem R R Mem CPU($) CPU($) CPU($) CPU($) CPU($) CPU($) CPU($) CPU($) CIS 501 (Martin/Roth): Shared Memory Multiprocessors 9 CIS 501 (Martin/Roth): Shared Memory Multiprocessors 10 Implementation #1: Snooping Bus MP Implementation #1: Scalable MP CPU($) CPU($) CPU($) CPU($) CPU($) CPU($) Mem R R Mem Mem R R Mem Mem Mem Mem Mem CPU($) CPU($) • Two basic implementations • General point-to-point network-based systems • Typically processor/memory/router blocks (NUMA) • Bus-based systems • Glueless MP: no need for additional “glue” chips • Typically small: 2–8 (maybe 16) processors • Can be arbitrarily large: 1000’s of processors • Typically processors split from memories (UMA) • Massively parallel processors (MPPs) • Sometimes multiple processors on single chip (CMP) • In reality only government (DoD) has MPPs… • Symmetric multiprocessors (SMPs) • Companies have much smaller systems: 32–64 processors • Common, I use one everyday • Scalable multi-processors CIS 501 (Martin/Roth): Shared Memory Multiprocessors 11 CIS 501 (Martin/Roth): Shared Memory Multiprocessors 12 Issues for Shared Memory Systems An Example Execution • Three in particular Processor 0 Processor 1 0: addi r1,accts,r3 CPU0 CPU1 Mem • Cache coherence 1: ld 0(r3),r4 • Synchronization 2: blt r4,r2,6 • Memory consistency model 3: sub r4,r2,r4 4: st r4,0(r3) • Not unrelated to each other 5: call spew_cash 0: addi r1,accts,r3 1: ld 0(r3),r4 2: blt r4,r2,6 • Different solutions for SMPs and MPPs 3: sub r4,r2,r4 4: st r4,0(r3) 5: call spew_cash • Two $100 withdrawals from account #241 at two ATMs • Each transaction maps to thread on different processor • Track accts[241].bal (address is in r3) CIS 501 (Martin/Roth): Shared Memory Multiprocessors 13 CIS 501 (Martin/Roth): Shared Memory Multiprocessors 14 No-Cache, No-Problem Cache Incoherence Processor 0 Processor 1 Processor 0 Processor 1 0: addi r1,accts,r3 500 0: addi r1,accts,r3 500 1: ld 0(r3),r4 500 1: ld 0(r3),r4 V:500 500 2: blt r4,r2,6 2: blt r4,r2,6 3: sub r4,r2,r4 3: sub r4,r2,r4 4: st r4,0(r3) 400 4: st r4,0(r3) D:400 500 5: call spew_cash 0: addi r1,accts,r3 5: call spew_cash 0: addi r1,accts,r3 1: ld 0(r3),r4 400 1: ld 0(r3),r4 D:400 V:500 500 2: blt r4,r2,6 2: blt r4,r2,6 3: sub r4,r2,r4 3: sub r4,r2,r4 4: st r4,0(r3) 300 4: st r4,0(r3) D:400 D:400 500 5: call spew_cash 5: call spew_cash • Scenario I: processors have no caches • Scenario II: processors have write-back caches • No problem • Potentially 3 copies of accts[241].bal: memory, p0$, p1$ • Can get incoherent (inconsistent) CIS 501 (Martin/Roth): Shared Memory Multiprocessors 15 CIS 501 (Martin/Roth): Shared Memory Multiprocessors 16 Write-Thru Doesn’t Help What To Do? Processor 0 Processor 1 • Have already seen this problem before with DMA 0: addi r1,accts,r3 500 • DMA controller acts as another processor 1: ld 0(r3),r4 V:500 500 2: blt r4,r2,6 • Changes cached versions behind processor’s back 3: sub r4,r2,r4 4: st r4,0(r3) V:400 400 • Possible solutions 5: call spew_cash 0: addi r1,accts,r3 1: ld 0(r3),r4 V:400 V:400 400 • No caches? Slow 2: blt r4,r2,6 • Make shared data uncachable? Still slow 3: sub r4,r2,r4 • Timely flushes and wholesale invalidations? Slow, but OK for DMA 4: st r4,0(r3) V:400 V:300 300 • Hardware cache coherence 5: call spew_cash + Minimal flushing, maximum caching " best performance • Scenario II: processors have write-thru caches • This time only 2 (different) copies of accts[241].bal • No problem? What if another withdrawal happens on processor 0? CIS 501 (Martin/Roth): Shared Memory Multiprocessors 17 CIS 501 (Martin/Roth): Shared Memory Multiprocessors 18 Hardware Cache Coherence Bus-Based Coherence Protocols • Absolute coherence • Bus-based coherence protocols CPU • All copies have same data at all times • Also called snooping or broadcast – Hard to implement and slow • ALL controllers see ALL transactions IN SAME ORDER + Not strictly necessary • Protocol relies on this • Three processor-side events • Relative coherence • R: read CC • Temporary incoherence OK (e.g., write-back) • W: write D$ tags D$ data • As long as all loads get right values • WB: write-back • Two bus-side events • Coherence controller: • BR: bus-read, read miss on another processor bus • Examines bus traffic (addresses and data) • BW: bus-write, write miss or write-back on another processor • Executes coherence protocol • What to do with local copy when you see • Point-to-point network protocols also exist different things happening on bus • Called directories CIS 501 (Martin/Roth): Shared Memory Multiprocessors 19 CIS 501 (Martin/Roth): Shared Memory Multiprocessors 20 VI (MI) Coherence Protocol VI Protocol (Write-Back Cache) BR/BW • VI (valid-invalid) protocol: aka MI Processor 0 Processor 1 • Two states (per block) 0: addi r1,accts,r3 500 1: ld 0(r3),r4 V:500 500 I • V (valid): have block 2: blt r4,r2,6 • I (invalid): don’t have block 3: sub r4,r2,r4 # + Can implement with valid bit 4: st r4,0(r3) V:400 500 W B • Protocol diagram (left) 5: call spew_cash 0: addi r1,&accts,r3 B, WB # 1: ld 0(r3),r4 I:WB V:400 400 W • Convention: event#generated-event 2: blt r4,r2,6 # • Summary R, W 3: sub r4,r2,r4 B • If anyone wants to read/write block 4: st r4,0(r3) V:300 400 # • Give it up: transition to I state R BR/BW 5: call spew_cash • Write-back if your own copy is dirty • This is an invalidate protocol • ld by processor 1 generates a BR V • Update protocol: copy data, don’t invalidate • processor 0 responds by WB its dirty copy, transitioning to I R/W • Sounds good, but wastes a lot of bandwidth CIS 501 (Martin/Roth): Shared Memory Multiprocessors 21 CIS 501 (Martin/Roth): Shared Memory Multiprocessors 22 VI " MSI MSI Protocol (Write-Back Cache) BR/BW • VI protocol is inefficient Processor 0

Shared-Memory Multiprocessors Gates & Transistors

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support