Shared Memory Multiprocessors Readings Multiplying Performance

This Unit: Shared Memory Multiprocessors App App App •! Thread-level parallelism (TLP) System software •! Shared memory model •! Multiplexed uniprocessor CIS 371 Mem CPUCPU I/O CPUCPU •! Hardware multihreading CPUCPU Computer Organization and Design •! Multiprocessing •! Synchronization •! Lock implementation Unit 9: Multicore •! Locking gotchas (Shared Memory Multiprocessors) •! Cache coherence •! Bus-based protocols •! Directory protocols •! Memory consistency models CIS 371 (Martin/Roth): Multicore 1 CIS 371 (Martin/Roth): Multicore 2 Readings Multiplying Performance •! P&H •! A single processor can only be so fast •! Chapter 7.1-7.3 •! Limited clock frequency •! Limited instruction-level parallelism •! Limited cache hierarchy •! What if we need even more computing power? •! Use multiple processors! •! But how? •! High-end example: Sun Ultra Enterprise 25k •! 72 UltraSPARC IV+ processors, 1.5Ghz •! 1024 GBs of memory •! Niche: large database servers •! $$$ CIS 371 (Martin/Roth): Multicore 3 CIS 371 (Martin/Roth): Multicore 4 Multicore: Mainstream Multiprocessors Application Domains for Multiprocessors •! Multicore chips •! Scientific computing/supercomputing •! IBM Power5 •! Examples: weather simulation, aerodynamics, protein folding Core 1 Core 2 •! Two 2+GHz PowerPC cores •! Large grids, integrating changes over time •! Shared 1.5 MB L2, L3 tags •! Each processor computes for a part of the grid •! AMD Quad Phenom •! Server workloads •! Four 2+ GHz cores •! Example: airline reservation database 1.5MB L2 •! Per-core 512KB L2 cache •! Many concurrent updates, searches, lookups, queries •! Shared 2MB L3 cache •! Processors handle different requests •! Intel Core i7 Quad •! Media workloads •! Four cores, private L2s •! Processors compress/decompress different parts of image/frames L3 tags •! Shared 6 MB L3 •! Desktop workloads… •! Sun Niagara •! Gaming workloads… Why multicore? What else would •! 8 cores, each 4-way threaded But software must be written to expose parallelism you do with 500 million transistors? •! Shared 2MB L2, shared FP •! For servers, not desktop CIS 371 (Martin/Roth): Multicore 5 CIS 371 (Martin/Roth): Multicore 6 But First, Uniprocessor Concurrency Multithreaded Programming Model •! Software “thread” •! Programmer explicitly creates multiple threads •! Independent flow of execution •! Context state: PC, registers •! All loads & stores to a single shared memory space •! Threads generally share the same memory space •! Each thread has a private stack frame for local variables •! “Process” like a thread, but different memory space •! A “thread switch” can occur at any time •! Java has thread support built in, C/C++ supports P-threads library •! Pre-emptive multithreading by OS •! Generally, system software (the O.S.) manages threads •! Common uses: •! “Thread scheduling”, “context switching” •! Handling user interaction (GUI programming) •! All threads share the one processor •! Handling I/O latency (send network message, wait for response) •! Hardware timer interrupt occasionally triggers O.S. •! Expressing parallel work via Thread-Level Parallelism (TLP) •! Quickly swapping threads gives illusion of concurrent execution •! Much more in CIS380 CIS 371 (Martin/Roth): Multicore 7 CIS 371 (Martin/Roth): Multicore 8 Aside: Hardware Multithreading Simplest Multiprocessor PC PC Regfile0 Regfile PC I$ D$ Regfile1 I$ D$ THR PC •! Hardware Multithreading (MT) Regfile •! Multiple threads dynamically share a single pipeline (caches) • Replicate thread contexts: PC and register file ! •! Replicate entire processor pipeline! •! Coarse-grain MT: switch on L2 misses Why? •! Instead of replicating just register file & PC •! Simultaneous MT: no explicit switching, fine-grain interleaving •! Exception: share caches (we’ll address this bottleneck later) •! Pentium4 is 2-way hyper-threaded, leverages out-of-order core +! MT Improves utilization and throughput •! Same “shared memory” or “multithreaded” model •! Single programs utilize <50% of pipeline (branch, cache miss) •! Loads and stores from two processors are interleaved •! MT does not improve single-thread performance •! Advantages/disadvantages over hardware multithreading? •! Individual threads run as fast or even slower CIS 371 (Martin/Roth): Multicore 9 CIS 371 (Martin/Roth): Multicore 10 Shared Memory Implementations Shared Memory Issues •! Multiplexed uniprocessor •! Three in particular, not unrelated to each other •! Runtime system and/or OS occasionally pre-empt & swap threads •! Interleaved, but no parallelism •! Synchronization •! How to regulate access to shared data? •! Hardware multithreading •! How to implement critical sections? •! Tolerate pipeline latencies, higher efficiency •! Same interleaved shared-memory model •! Cache coherence •! How to make writes to one cache “show up” in others? •! Multiprocessing •! Multiply execution resources, higher peak performance •! Memory consistency model •! Same interleaved shared-memory model • How to keep programmer sane while letting hardware optimize? •! Foreshadowing: allow private caches, further disentangle cores ! •! How to reconcile shared memory with out-of-order execution? •! All have same shared memory programming model CIS 371 (Martin/Roth): Multicore 11 CIS 371 (Martin/Roth): Multicore 12 Example: Parallelizing Matrix Multiply Example: Thread-Level Parallelism struct acct_t { int bal; }; shared struct acct_t accts[MAX_ACCT]; X = int id, amt; 0: addi r1,accts,r3 A B C if (accts[id].bal >= amt) 1: ld 0(r3),r4 { 2: blt r4,r2,6 for (I = 0; I < 100; I++) accts[id].bal -= amt; 3: sub r4,r2,r4 for (J = 0; J < 100; J++) give_cash(); 4: st r4,0(r3) for (K = 0; K < 100; K++) } 5: call give_cash C[I][J] += A[I][K] * B[K][J]; •! How to parallelize matrix multiply over 100 processors? •! Thread-level parallelism (TLP) •! One possibility: give each processor an iteration of I •! Collection of asynchronous tasks: not started and stopped together for (J = 0; J < 100; J++) •! Data shared “loosely” (sometimes yes, mostly no), dynamically for (K = 0; K < 100; K++) • Example: database/web server (each query is a thread) C[my_id()][J] += A[my_id()][K] * B[K][J]; ! •! accts is shared, can’t register allocate even if it were scalar •! Each processor runs copy of loop above •! id and amt are private variables, register allocated to r1, r2 •! my_id() function gives each processor ID from 0 to N •! Running example •! Parallel processing library provides this function CIS 371 (Martin/Roth): Multicore 13 CIS 371 (Martin/Roth): Multicore 14 An Example Execution A Problem Execution Time Time Thread 0 Thread 1 Mem Thread 0 Thread 1 Mem 0: addi r1,accts,r3 0: addi r1,accts,r3 500 500 1: ld 0(r3),r4 1: ld 0(r3),r4 2: blt r4,r2,6 2: blt r4,r2,6 3: sub r4,r2,r4 3: sub r4,r2,r4 4: st r4,0(r3) 400 <<< Interrupt >>> 5: call give_cash 0: addi r1,accts,r3 0: addi r1,accts,r3 1: ld 0(r3),r4 1: ld 0(r3),r4 2: blt r4,r2,6 2: blt r4,r2,6 3: sub r4,r2,r4 3: sub r4,r2,r4 4: st r4,0(r3) 300 4: st r4,0(r3) 400 5: call give_cash 5: call give_cash 4: st r4,0(r3) 400 •! Two $100 withdrawals from account #241 at two ATMs 5: call give_cash •! Each transaction maps to thread on different processor •! Problem: wrong account balance! Why? •! Track accts[241].bal (address is in r3) •! Solution: synchronize access to account balance CIS 371 (Martin/Roth): Multicore 15 CIS 371 (Martin/Roth): Multicore 16 Synchronization A Synchronized Execution Time •! Synchronization: a key issue for shared memory Thread 0 Thread 1 Mem call acquire(lock) •! Regulate access to shared data (mutual exclusion) 500 •! Software constructs: semaphore, monitor, mutex 0: addi r1,accts,r3 1: ld 0(r3),r4 •! Low-level primitive: lock 2: blt r4,r2,6 •! Operations: acquire(lock)and release(lock) 3: sub r4,r2,r4 •! Region between acquire and release is a critical section <<< Interrupt >>> call acquire(lock) Spins! •! Must interleave acquire and release <<< Interrupt >>> •! Interfering acquire will block 4: st r4,0(r3) 400 call release(lock) struct acct_t { int bal; }; shared struct acct_t accts[MAX_ACCT]; 5: call give_cash (still in acquire) shared int lock; 0: addi r1,accts,r3 int id, amt; 1: ld 0(r3),r4 acquire(lock); •! Fixed, but how do 2: blt r4,r2,6 if (accts[id].bal >= amt) { accts[id].bal -= amt; // critical section we implement 3: sub r4,r2,r4 give_cash(); } acquire & release? 4: st r4,0(r3) 300 release(lock); 5: call give_cash CIS 371 (Martin/Roth): Multicore 17 CIS 371 (Martin/Roth): Multicore 18 Strawman Lock (Incorrect) Strawman Lock (Incorrect) Time •! Spin lock: software lock implementation Thread 0 Thread 1 Mem A0: ld 0(&lock),r6 •! acquire(lock): while (lock != 0) {} lock = 1; 0 A1: bnez r6,#A0 A0: ld r6,0(&lock) •! “Spin” while lock is 1, wait for it to turn 0 A2: addi r6,1,r6 A1: bnez r6,#A0 A0: ld 0(&lock),r6 A3: st r6,0(&lock) A2: addi r6,1,r6 1 A1: bnez r6,A0 CRITICAL_SECTION A3: st r6,0(&lock) 1 A2: addi r6,1,r6 CRITICAL_SECTION A3: st r6,0(&lock) •! release(lock): lock = 0; •! Spin lock makes intuitive sense, but doesn’t actually work R0: st r0,0(&lock) // r0 holds 0 •! Loads/stores of two acquire sequences can be interleaved •! Lock acquire sequence also not atomic •! Same problem as before! •! Note, release is trivially atomic CIS 371 (Martin/Roth): Multicore 19 CIS 371 (Martin/Roth): Multicore 20 A Correct Implementation: SYSCALL Lock Better Spin Lock: Use Atomic Swap ACQUIRE_LOCK: •! ISA provides an atomic lock acquisition instruction A1: disable_interrupts atomic •! Example: atomic swap A2: ld r6,0(&lock) swap r1,0(&lock) mov r1->r2 A3: bnez r6,#A0 •! Atomically

Shared Memory Multiprocessors Readings Multiplying Performance

2.5 Classification of Parallel Computers

Chapter 5 Multiprocessors and Thread-Level Parallelism

Chapter 5 Thread-Level Parallelism

Trafficdb: HERE's High Performance Shared-Memory Data Store

Scheduling of Shared Memory with Multi - Core Performance in Dynamic Allocator Using Arm Processor

The Impact of Exploiting Instruction-Level Parallelism on Shared-Memory Multiprocessors

Lecture 17: Multiprocessors

Use of Shared Memory in the Context of Embedded Multi-Core Processors: Exploration of the Technology and Its Limits

GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use

Instruction Level Parallelism Example

CIS 501 Computer Architecture This Unit: Shared Memory

Programming with Shared Memory PART I