Shared-Memory Multiprocessors Gates & Transistors

This Unit: Shared Memory Multiprocessors Application • Three issues OS • Cache coherence Compiler Firmware CIS 501 • Synchronization • Memory consistency Introduction To Computer Architecture CPU I/O Memory • Two cache coherence approaches • “Snooping” (SMPs): < 16 processors Digital Circuits • “Directory”/Scalable: lots of processors Unit 11: Shared-Memory Multiprocessors Gates & Transistors CIS 501 (Martin/Roth): Shared Memory Multiprocessors 1 CIS 501 (Martin/Roth): Shared Memory Multiprocessors 2 Readings Thread-Level Parallelism struct acct_t { int bal; }; • H+P shared struct acct_t accts[MAX_ACCT]; • Chapter 6 int id,amt; 0: addi r1,accts,r3 if (accts[id].bal >= amt) 1: ld 0(r3),r4 { 2: blt r4,r2,6 accts[id].bal -= amt; 3: sub r4,r2,r4 spew_cash(); 4: st r4,0(r3) } 5: call spew_cash • Thread-level parallelism (TLP) • Collection of asynchronous tasks: not started and stopped together • Data shared loosely, dynamically • Example: database/web server (each query is a thread) • accts is shared, can’t register allocate even if it were scalar • id and amt are private variables, register allocated to r1, r2 • Focus on this CIS 501 (Martin/Roth): Shared Memory Multiprocessors 3 CIS 501 (Martin/Roth): Shared Memory Multiprocessors 4 Shared Memory Shared-Memory Multiprocessors • Shared memory • Provide a shared-memory abstraction • Multiple execution contexts sharing a single address space • Familiar and efficient for programmers • Multiple programs (MIMD) • Or more frequently: multiple copies of one program (SPMD) P1 P2 P3 P4 • Implicit (automatic) communication via loads and stores + Simple software • No need for messages, communication happens naturally – Maybe too naturally • Supports irregular, dynamic communication patterns Memory System • Both DLP and TLP – Complex hardware • Must create a uniform view of memory • Several aspects to this as we will see CIS 501 (Martin/Roth): Shared Memory Multiprocessors 5 CIS 501 (Martin/Roth): Shared Memory Multiprocessors 6 Shared-Memory Multiprocessors Paired vs. Separate Processor/Memory? • Provide a shared-memory abstraction • Separate processor/memory • Familiar and efficient for programmers • Uniform memory access (UMA): equal latency to all memory + Simple software, doesn’t matter where you put data P P P P – Lower peak performance 1 2 3 4 • Bus-based UMAs common: symmetric multi-processors (SMP) • Paired processor/memory Cache Cache Cache Cache • Non-uniform memory access (NUMA): faster to local memory M1 M2 M3 M4 – More complex software: where you put data matters Interface Interface Interface Interface + Higher peak performance: assuming proper data placement CPU($) CPU($) CPU($) CPU($) Interconnection Network CPU($) CPU($) CPU($) CPU($) Mem R Mem R Mem R Mem R Mem Mem Mem Mem CIS 501 (Martin/Roth): Shared Memory Multiprocessors 7 CIS 501 (Martin/Roth): Shared Memory Multiprocessors 8 Shared vs. Point-to-Point Networks Organizing Point-To-Point Networks • Shared network: e.g., bus (left) • Network topology: organization of network + Low latency • Tradeoff performance (connectivity, latency, bandwidth) ! cost – Low bandwidth: doesn’t scale beyond ~16 processors • Router chips + Shared property simplifies cache coherence protocols (later) • Networks that require separate router chips are indirect • Point-to-point network: e.g., mesh or ring (right) • Networks that use processor/memory/router packages are direct – Longer latency: may need multiple “hops” to communicate + Fewer components, “Glueless MP” + Higher bandwidth: scales to 1000s of processors • Point-to-point network examples – Cache coherence protocols are complex • Indirect tree (left) • Direct mesh or ring (right) CPU($) CPU($) CPU($) CPU($) CPU($) CPU($) R CPU($) CPU($) Mem R Mem R Mem R Mem R Mem R R Mem Mem R R Mem R R Mem R R Mem Mem R Mem R Mem R Mem R Mem R R Mem CPU($) CPU($) CPU($) CPU($) CPU($) CPU($) CPU($) CPU($) CIS 501 (Martin/Roth): Shared Memory Multiprocessors 9 CIS 501 (Martin/Roth): Shared Memory Multiprocessors 10 Implementation #1: Snooping Bus MP Implementation #1: Scalable MP CPU($) CPU($) CPU($) CPU($) CPU($) CPU($) Mem R R Mem Mem R R Mem Mem Mem Mem Mem CPU($) CPU($) • Two basic implementations • General point-to-point network-based systems • Typically processor/memory/router blocks (NUMA) • Bus-based systems • Glueless MP: no need for additional “glue” chips • Typically small: 2–8 (maybe 16) processors • Can be arbitrarily large: 1000’s of processors • Typically processors split from memories (UMA) • Massively parallel processors (MPPs) • Sometimes multiple processors on single chip (CMP) • In reality only government (DoD) has MPPs… • Symmetric multiprocessors (SMPs) • Companies have much smaller systems: 32–64 processors • Common, I use one everyday • Scalable multi-processors CIS 501 (Martin/Roth): Shared Memory Multiprocessors 11 CIS 501 (Martin/Roth): Shared Memory Multiprocessors 12 Issues for Shared Memory Systems An Example Execution • Three in particular Processor 0 Processor 1 0: addi r1,accts,r3 CPU0 CPU1 Mem • Cache coherence 1: ld 0(r3),r4 • Synchronization 2: blt r4,r2,6 • Memory consistency model 3: sub r4,r2,r4 4: st r4,0(r3) • Not unrelated to each other 5: call spew_cash 0: addi r1,accts,r3 1: ld 0(r3),r4 2: blt r4,r2,6 • Different solutions for SMPs and MPPs 3: sub r4,r2,r4 4: st r4,0(r3) 5: call spew_cash • Two $100 withdrawals from account #241 at two ATMs • Each transaction maps to thread on different processor • Track accts[241].bal (address is in r3) CIS 501 (Martin/Roth): Shared Memory Multiprocessors 13 CIS 501 (Martin/Roth): Shared Memory Multiprocessors 14 No-Cache, No-Problem Cache Incoherence Processor 0 Processor 1 Processor 0 Processor 1 0: addi r1,accts,r3 500 0: addi r1,accts,r3 500 1: ld 0(r3),r4 500 1: ld 0(r3),r4 V:500 500 2: blt r4,r2,6 2: blt r4,r2,6 3: sub r4,r2,r4 3: sub r4,r2,r4 4: st r4,0(r3) 400 4: st r4,0(r3) D:400 500 5: call spew_cash 0: addi r1,accts,r3 5: call spew_cash 0: addi r1,accts,r3 1: ld 0(r3),r4 400 1: ld 0(r3),r4 D:400 V:500 500 2: blt r4,r2,6 2: blt r4,r2,6 3: sub r4,r2,r4 3: sub r4,r2,r4 4: st r4,0(r3) 300 4: st r4,0(r3) D:400 D:400 500 5: call spew_cash 5: call spew_cash • Scenario I: processors have no caches • Scenario II: processors have write-back caches • No problem • Potentially 3 copies of accts[241].bal: memory, p0$, p1$ • Can get incoherent (inconsistent) CIS 501 (Martin/Roth): Shared Memory Multiprocessors 15 CIS 501 (Martin/Roth): Shared Memory Multiprocessors 16 Write-Thru Doesn’t Help What To Do? Processor 0 Processor 1 • Have already seen this problem before with DMA 0: addi r1,accts,r3 500 • DMA controller acts as another processor 1: ld 0(r3),r4 V:500 500 2: blt r4,r2,6 • Changes cached versions behind processor’s back 3: sub r4,r2,r4 4: st r4,0(r3) V:400 400 • Possible solutions 5: call spew_cash 0: addi r1,accts,r3 1: ld 0(r3),r4 V:400 V:400 400 • No caches? Slow 2: blt r4,r2,6 • Make shared data uncachable? Still slow 3: sub r4,r2,r4 • Timely flushes and wholesale invalidations? Slow, but OK for DMA 4: st r4,0(r3) V:400 V:300 300 • Hardware cache coherence 5: call spew_cash + Minimal flushing, maximum caching " best performance • Scenario II: processors have write-thru caches • This time only 2 (different) copies of accts[241].bal • No problem? What if another withdrawal happens on processor 0? CIS 501 (Martin/Roth): Shared Memory Multiprocessors 17 CIS 501 (Martin/Roth): Shared Memory Multiprocessors 18 Hardware Cache Coherence Bus-Based Coherence Protocols • Absolute coherence • Bus-based coherence protocols CPU • All copies have same data at all times • Also called snooping or broadcast – Hard to implement and slow • ALL controllers see ALL transactions IN SAME ORDER + Not strictly necessary • Protocol relies on this • Three processor-side events • Relative coherence • R: read CC • Temporary incoherence OK (e.g., write-back) • W: write D$ tags D$ data • As long as all loads get right values • WB: write-back • Two bus-side events • Coherence controller: • BR: bus-read, read miss on another processor bus • Examines bus traffic (addresses and data) • BW: bus-write, write miss or write-back on another processor • Executes coherence protocol • What to do with local copy when you see • Point-to-point network protocols also exist different things happening on bus • Called directories CIS 501 (Martin/Roth): Shared Memory Multiprocessors 19 CIS 501 (Martin/Roth): Shared Memory Multiprocessors 20 VI (MI) Coherence Protocol VI Protocol (Write-Back Cache) BR/BW • VI (valid-invalid) protocol: aka MI Processor 0 Processor 1 • Two states (per block) 0: addi r1,accts,r3 500 1: ld 0(r3),r4 V:500 500 I • V (valid): have block 2: blt r4,r2,6 • I (invalid): don’t have block 3: sub r4,r2,r4 # + Can implement with valid bit 4: st r4,0(r3) V:400 500 W B • Protocol diagram (left) 5: call spew_cash 0: addi r1,&accts,r3 B, WB # 1: ld 0(r3),r4 I:WB V:400 400 W • Convention: event#generated-event 2: blt r4,r2,6 # • Summary R, W 3: sub r4,r2,r4 B • If anyone wants to read/write block 4: st r4,0(r3) V:300 400 # • Give it up: transition to I state R BR/BW 5: call spew_cash • Write-back if your own copy is dirty • This is an invalidate protocol • ld by processor 1 generates a BR V • Update protocol: copy data, don’t invalidate • processor 0 responds by WB its dirty copy, transitioning to I R/W • Sounds good, but wastes a lot of bandwidth CIS 501 (Martin/Roth): Shared Memory Multiprocessors 21 CIS 501 (Martin/Roth): Shared Memory Multiprocessors 22 VI " MSI MSI Protocol (Write-Back Cache) BR/BW • VI protocol is inefficient Processor 0

Shared-Memory Multiprocessors Gates & Transistors

Parallel Computer Architecture

Reducing Cache Coherence Traffic with a NUMA-Aware Runtime Approach

Thread-Level Parallelism – Part 1

Efficient Synchronization Mechanisms for Scalable GPU Architectures

CIS 501 Computer Architecture This Unit: Shared Memory

Tardis: Time Traveling Coherence Algorithm for Distributed Shared Memory

Gpgpu & Accelerators

CIS 501 Computer Architecture This Unit: Shared Memory Multiprocessors Readings Multiplying Performance

Snooping Cache Coherence I

Memory Coherence in Shared Virtual Systems

Introduction and Cache Coherency

1 Parallel Computer Architecture Concepts