CIS 501 Computer Architecture This Unit: Shared Memory

This Unit: Shared Memory Multiprocessors App App App • Thread-level parallelism (TLP) System software • Shared memory model • Multiplexed uniprocessor CIS 501 Mem CPUCPU I/O CPUCPU • Hardware multihreading CPUCPU Computer Architecture • Multiprocessing • Synchronization • Lock implementation Unit 9: Multicore • Locking gotchas (Shared Memory Multiprocessors) • Cache coherence Slides originally developed by Amir Roth with contributions by Milo Martin • Bus-based protocols at University of Pennsylvania with sources that included University of • Directory protocols Wisconsin slides by Mark Hill, Guri Sohi, Jim Smith, and David Wood. • Memory consistency models CIS 501 (Martin): Multicore 1 CIS 501 (Martin): Multicore 2 Readings Beyond Implicit Parallelism • Textbook (MA:FSPTCM) • Consider “daxpy”: • Sections 7.0, 7.1.3, 7.2-7.4 daxpy(double *x, double *y, double *z, double a): for (i = 0; i < SIZE; i++) • Section 8.2 Z[i] = a*x[i] + y[i]; • Lots of instruction-level parallelism (ILP) • Great! • But how much can we really exploit? 4 wide? 8 wide? • Limits to (efficient) super-scalar execution • But, if SIZE is 10,000, the loop has 10,000-way parallelism! • How do we exploit it? CIS 501 (Martin): Multicore 3 CIS 501 (Martin): Multicore 4 Explicit Parallelism Multiplying Performance • Consider “daxpy”: • A single processor can only be so fast daxpy(double *x, double *y, double *z, double a): • Limited clock frequency for (i = 0; i < SIZE; i++) • Limited instruction-level parallelism Z[i] = a*x[i] + y[i]; • Limited cache hierarchy • Break it up into N “chunks” on N cores! • Done by the programmer (or maybe a really smart compiler) • What if we need even more computing power? • Use multiple processors! daxpy(int chunk_id, double *x, double *y, *z, double a): chuck_size = SIZE / N • But how? SIZE = 400, N=4 my_start = chuck_id * chuck_size Chunk ID Start End my_end = my_start + chuck_size • High-end example: Sun Ultra Enterprise 25k 0 0 99 • 72 UltraSPARC IV+ processors, 1.5Ghz for (i = my_start; i < my_end; i++) 1 100 199 z[i] = a*x[i] + y[i] 2 200 299 • 1024 GBs of memory • Assumes 3 300 399 • Niche: large database servers • Local variables are “private” and x, y, and z are “shared” • $$$ • Assumes SIZE is a multiple of N (that is, SIZE % N == 0) CIS 501 (Martin): Multicore 5 CIS 501 (Martin): Multicore 6 Multicore: Mainstream Multiprocessors Sun Niagara II • Multicore chips • IBM Power5 Core 1 Core 2 • Two 2+GHz PowerPC cores • Shared 1.5 MB L2, L3 tags • AMD Quad Phenom • Four 2+ GHz cores 1.5MB L2 • Per-core 512KB L2 cache • Shared 2MB L3 cache • Intel Core i7 Quad • Four cores, private L2s L3 tags • Shared 6 MB L3 • Sun Niagara Why multicore? What else would • 8 cores, each 4-way threaded you do with 1 billion transistors? • Shared 2MB L2, shared FP • For servers, not desktop CIS 501 (Martin): Multicore 7 CIS 501 (Martin): Multicore 8 Intel Quad-Core “Core i7” Application Domains for Multiprocessors • Scientific computing/supercomputing • Examples: weather simulation, aerodynamics, protein folding • Large grids, integrating changes over time • Each processor computes for a part of the grid • Server workloads • Example: airline reservation database • Many concurrent updates, searches, lookups, queries • Processors handle different requests • Media workloads • Processors compress/decompress different parts of image/frames • Desktop workloads… • Gaming workloads… But software must be written to expose parallelism CIS 501 (Martin): Multicore 9 CIS 501 (Martin): Multicore 10 First, Uniprocessor Concurrency • Software “thread”: Independent flows of execution • “private” per-thread state • Context state: PC, registers • Stack (per-thread local variables) • “shared” state: Globals, heap, etc. • Threads generally share the same memory space • “Process” like a thread, but different memory space • Java has thread support built in, C/C++ supports P-threads library • Generally, system software (the O.S.) manages threads “THREADING” & • “Thread scheduling”, “context switching” • In single-core system, all threads share the one processor SHARED MEMORY • Hardware timer interrupt occasionally triggers O.S. • Quickly swapping threads gives illusion of concurrent execution EXECUTION MODEL • Much more in an operating systems course CIS 501 (Martin): Multicore 11 CIS 501 (Martin): Multicore 12 Multithreaded Programming Model Simplest Multiprocessor • Programmer explicitly creates multiple threads PC Regfile • All loads & stores to a single shared memory space • Each thread has a private stack frame for local variables I$ D$ PC • A “thread switch” can occur at any time Regfile • Pre-emptive multithreading by OS • Common uses: • Replicate entire processor pipeline! • Handling user interaction (GUI programming) • Instead of replicating just register file & PC • Exception: share the caches (we’ll address this bottleneck later) • Handling I/O latency (send network message, wait for response) • Expressing parallel work via Thread-Level Parallelism (TLP) • Multiple threads execute • “Shared memory” programming model • This is our focus! • Operations (loads and stores) are interleaved at random • Loads returns the value written by most recent store to location CIS 501 (Martin): Multicore 13 CIS 501 (Martin): Multicore 14 Alternative: Hardware Multithreading Shared Memory Implementations PC • Multiplexed uniprocessor Regfile0 • Runtime system and/or OS occasionally pre-empt & swap threads PC I$ D$ • Interleaved, but no parallelism Regfile1 THR • Multiprocessing • Hardware Multithreading (MT) • Multiply execution resources, higher peak performance • Multiple threads dynamically share a single pipeline • Same interleaved shared-memory model • Replicate only per-thread structures: program counter & registers • Foreshadowing: allow private caches, further disentangle cores • Hardware interleaves instructions + Multithreading improves utilization and throughput • Hardware multithreading • Single programs utilize <50% of pipeline (branch, cache miss) • Tolerate pipeline latencies, higher efficiency • Multithreading does not improve single-thread performance • Same interleaved shared-memory model • Individual threads run as fast or even slower • Coarse-grain MT: switch on L2 misses Why? • All support the shared memory programming model • Simultaneous MT: no explicit switching, fine-grain interleaving CIS 501 (Martin): Multicore 15 CIS 501 (Martin): Multicore 16 Four Shared Memory Issues 1. Parallel programming • How does the programmer express the parallelism? 2. Synchronization • How to regulate access to shared data? • How to implement “locks”? 3. Cache coherence • If cores have private (non-shared) caches • How to make writes to one cache “show up” in others? PARALLEL PROGRAMMING 4. Memory consistency models • How to keep programmer sane while letting hardware optimize? • How to reconcile shared memory with store buffers? CIS 501 (Martin): Multicore 17 CIS 501 (Martin): Multicore 18 Parallel Programming Example: Parallelizing Matrix Multiply • One use of multiprocessors: multiprogramming = X • Running multiple programs with no interaction between them C A B • Works great for a few cores, but what next? for (I = 0; I < 100; I++) • Or, programmers must explicitly express parallelism for (J = 0; J < 100; J++) • “Coarse” parallelism beyond what the hardware can extract implicitly for (K = 0; K < 100; K++) C[I][J] += A[I][K] * B[K][J]; • Even the compiler can’t extract it in most cases • How to parallelize matrix multiply? • How? • Replace outer “for” loop with “parallel_for” • Call libraries that perform well-known computations in parallel • Support by many parallel programming environments • Example: a matrix multiply routine, etc. • Implementation: give each of N processors loop iterations • Parallel “for” loops, task-based parallelism, … int start = (100/N) * my_id(); for (I = start; I < start + 100/N; I++) • Add code annotations (“this loop is parallel”), OpenMP for (J = 0; J < 100; J++) for (K = 0; K < 100; K++) • Explicitly spawn “threads”, OS schedules them on the cores C[I][J] += A[I][K] * B[K][J]; • Parallel programming: key challenge in multicore revolution • Each processor runs copy of loop above • Library provides my_id() function CIS 501 (Martin): Multicore 19 CIS 501 (Martin): Multicore 20 Example: Bank Accounts Example: Bank Accounts struct acct_t { int bal; … }; • Consider shared struct acct_t accts[MAX_ACCT]; struct acct_t { int balance; … }; void debit(int id, int amt) { 0: addi r1,accts,r3 struct acct_t accounts[MAX_ACCT]; // current balances if (accts[id].bal >= amt) 1: ld 0(r3),r4 { 2: blt r4,r2,done struct trans_t { int id; int amount; }; accts[id].bal -= amt; 3: sub r4,r2,r4 struct trans_t transactions[MAX_TRANS]; // debit amounts } 4: st r4,0(r3) } for (i = 0; i < MAX_TRANS; i++) { debit(transactions[i].id, transactions[i].amount); } • Example of Thread-level parallelism (TLP) void debit(int id, int amount) { • Collection of asynchronous tasks: not started and stopped together if (accounts[id].balance >= amount) { • Data shared “loosely” (sometimes yes, mostly no), dynamically accounts[id].balance -= amount; } • Example: database/web server (each query is a thread) } • accts is global and thus shared, can’t register allocate • Can we do these “debit” operations in parallel? • id and amt are private variables, register allocated to r1, r2 • Does the order matter? • Running example CIS 501 (Martin): Multicore 21 CIS 501 (Martin): Multicore 22 An Example Execution A Problem Execution Time Time Thread 0 Thread 1 Mem Thread 0 Thread 1 Mem

CIS 501 Computer Architecture This Unit: Shared Memory

2.5 Classification of Parallel Computers

The Importance of Data

Chapter 5 Multiprocessors and Thread-Level Parallelism

Chapter 5 Thread-Level Parallelism

(CITIUS) Phd DISSERTATION

2.2 Adaptive Routing Algorithms and Router Design 20

Parallel Computer Architecture

Intel® Oneapi Programming Guide

Adaptive Parallelism for Coupled, Multithreaded Message-Passing Programs Samuel K

Detecting False Sharing Efficiently and Effectively

Trafficdb: HERE's High Performance Shared-Memory Data Store

Executable Modelling for Highly Parallel Accelerators