CIS 501 Computer Architecture This Unit: Shared Memory

This Unit: Shared Memory Multiprocessors App App App • Thread-level parallelism (TLP) System software • Shared memory model • Multiplexed uniprocessor CIS 501 Mem CPUCPU I/O CPUCPU • Hardware multihreading CPUCPU Computer Architecture • Multiprocessing • Synchronization • Lock implementation Unit 9: Multicore • Locking gotchas (Shared Memory Multiprocessors) • Cache coherence Slides originally developed by Amir Roth with contributions by Milo Martin • Bus-based protocols at University of Pennsylvania with sources that included University of • Directory protocols Wisconsin slides by Mark Hill, Guri Sohi, Jim Smith, and David Wood. • Memory consistency models CIS 501 (Martin): Multicore 1 CIS 501 (Martin): Multicore 2 Readings Beyond Implicit Parallelism • Textbook (MA:FSPTCM) • Consider “daxpy”: • Sections 7.0, 7.1.3, 7.2-7.4 daxpy(double *x, double *y, double *z, double a): for (i = 0; i < SIZE; i++) • Section 8.2 Z[i] = a*x[i] + y[i]; • Lots of instruction-level parallelism (ILP) • Great! • But how much can we really exploit? 4 wide? 8 wide? • Limits to (efficient) super-scalar execution • But, if SIZE is 10,000, the loop has 10,000-way parallelism! • How do we exploit it? CIS 501 (Martin): Multicore 3 CIS 501 (Martin): Multicore 4 Explicit Parallelism Multiplying Performance • Consider “daxpy”: • A single processor can only be so fast daxpy(double *x, double *y, double *z, double a): • Limited clock frequency for (i = 0; i < SIZE; i++) • Limited instruction-level parallelism Z[i] = a*x[i] + y[i]; • Limited cache hierarchy • Break it up into N “chunks” on N cores! • Done by the programmer (or maybe a really smart compiler) • What if we need even more computing power? • Use multiple processors! daxpy(int chunk_id, double *x, double *y, *z, double a): chuck_size = SIZE / N • But how? SIZE = 400, N=4 my_start = chuck_id * chuck_size Chunk ID Start End my_end = my_start + chuck_size • High-end example: Sun Ultra Enterprise 25k 0 0 99 • 72 UltraSPARC IV+ processors, 1.5Ghz for (i = my_start; i < my_end; i++) 1 100 199 z[i] = a*x[i] + y[i] 2 200 299 • 1024 GBs of memory • Assumes 3 300 399 • Niche: large database servers • Local variables are “private” and x, y, and z are “shared” • $$$ • Assumes SIZE is a multiple of N (that is, SIZE % N == 0) CIS 501 (Martin): Multicore 5 CIS 501 (Martin): Multicore 6 Multicore: Mainstream Multiprocessors Sun Niagara II • Multicore chips • IBM Power5 Core 1 Core 2 • Two 2+GHz PowerPC cores • Shared 1.5 MB L2, L3 tags • AMD Quad Phenom • Four 2+ GHz cores 1.5MB L2 • Per-core 512KB L2 cache • Shared 2MB L3 cache • Intel Core i7 Quad • Four cores, private L2s L3 tags • Shared 6 MB L3 • Sun Niagara Why multicore? What else would • 8 cores, each 4-way threaded you do with 1 billion transistors? • Shared 2MB L2, shared FP • For servers, not desktop CIS 501 (Martin): Multicore 7 CIS 501 (Martin): Multicore 8 Intel Quad-Core “Core i7” Application Domains for Multiprocessors • Scientific computing/supercomputing • Examples: weather simulation, aerodynamics, protein folding • Large grids, integrating changes over time • Each processor computes for a part of the grid • Server workloads • Example: airline reservation database • Many concurrent updates, searches, lookups, queries • Processors handle different requests • Media workloads • Processors compress/decompress different parts of image/frames • Desktop workloads… • Gaming workloads… But software must be written to expose parallelism CIS 501 (Martin): Multicore 9 CIS 501 (Martin): Multicore 10 First, Uniprocessor Concurrency • Software “thread”: Independent flows of execution • “private” per-thread state • Context state: PC, registers • Stack (per-thread local variables) • “shared” state: Globals, heap, etc. • Threads generally share the same memory space • “Process” like a thread, but different memory space • Java has thread support built in, C/C++ supports P-threads library • Generally, system software (the O.S.) manages threads “THREADING” & • “Thread scheduling”, “context switching” • In single-core system, all threads share the one processor SHARED MEMORY • Hardware timer interrupt occasionally triggers O.S. • Quickly swapping threads gives illusion of concurrent execution EXECUTION MODEL • Much more in an operating systems course CIS 501 (Martin): Multicore 11 CIS 501 (Martin): Multicore 12 Multithreaded Programming Model Simplest Multiprocessor • Programmer explicitly creates multiple threads PC Regfile • All loads & stores to a single shared memory space • Each thread has a private stack frame for local variables I$ D$ PC • A “thread switch” can occur at any time Regfile • Pre-emptive multithreading by OS • Common uses: • Replicate entire processor pipeline! • Handling user interaction (GUI programming) • Instead of replicating just register file & PC • Exception: share the caches (we’ll address this bottleneck later) • Handling I/O latency (send network message, wait for response) • Expressing parallel work via Thread-Level Parallelism (TLP) • Multiple threads execute • “Shared memory” programming model • This is our focus! • Operations (loads and stores) are interleaved at random • Loads returns the value written by most recent store to location CIS 501 (Martin): Multicore 13 CIS 501 (Martin): Multicore 14 Alternative: Hardware Multithreading Shared Memory Implementations PC • Multiplexed uniprocessor Regfile0 • Runtime system and/or OS occasionally pre-empt & swap threads PC I$ D$ • Interleaved, but no parallelism Regfile1 THR • Multiprocessing • Hardware Multithreading (MT) • Multiply execution resources, higher peak performance • Multiple threads dynamically share a single pipeline • Same interleaved shared-memory model • Replicate only per-thread structures: program counter & registers • Foreshadowing: allow private caches, further disentangle cores • Hardware interleaves instructions + Multithreading improves utilization and throughput • Hardware multithreading • Single programs utilize <50% of pipeline (branch, cache miss) • Tolerate pipeline latencies, higher efficiency • Multithreading does not improve single-thread performance • Same interleaved shared-memory model • Individual threads run as fast or even slower • Coarse-grain MT: switch on L2 misses Why? • All support the shared memory programming model • Simultaneous MT: no explicit switching, fine-grain interleaving CIS 501 (Martin): Multicore 15 CIS 501 (Martin): Multicore 16 Four Shared Memory Issues 1. Parallel programming • How does the programmer express the parallelism? 2. Synchronization • How to regulate access to shared data? • How to implement “locks”? 3. Cache coherence • If cores have private (non-shared) caches • How to make writes to one cache “show up” in others? PARALLEL PROGRAMMING 4. Memory consistency models • How to keep programmer sane while letting hardware optimize? • How to reconcile shared memory with store buffers? CIS 501 (Martin): Multicore 17 CIS 501 (Martin): Multicore 18 Parallel Programming Example: Parallelizing Matrix Multiply • One use of multiprocessors: multiprogramming = X • Running multiple programs with no interaction between them C A B • Works great for a few cores, but what next? for (I = 0; I < 100; I++) • Or, programmers must explicitly express parallelism for (J = 0; J < 100; J++) • “Coarse” parallelism beyond what the hardware can extract implicitly for (K = 0; K < 100; K++) C[I][J] += A[I][K] * B[K][J]; • Even the compiler can’t extract it in most cases • How to parallelize matrix multiply? • How? • Replace outer “for” loop with “parallel_for” • Call libraries that perform well-known computations in parallel • Support by many parallel programming environments • Example: a matrix multiply routine, etc. • Implementation: give each of N processors loop iterations • Parallel “for” loops, task-based parallelism, … int start = (100/N) * my_id(); for (I = start; I < start + 100/N; I++) • Add code annotations (“this loop is parallel”), OpenMP for (J = 0; J < 100; J++) for (K = 0; K < 100; K++) • Explicitly spawn “threads”, OS schedules them on the cores C[I][J] += A[I][K] * B[K][J]; • Parallel programming: key challenge in multicore revolution • Each processor runs copy of loop above • Library provides my_id() function CIS 501 (Martin): Multicore 19 CIS 501 (Martin): Multicore 20 Example: Bank Accounts Example: Bank Accounts struct acct_t { int bal; … }; • Consider shared struct acct_t accts[MAX_ACCT]; struct acct_t { int balance; … }; void debit(int id, int amt) { 0: addi r1,accts,r3 struct acct_t accounts[MAX_ACCT]; // current balances if (accts[id].bal >= amt) 1: ld 0(r3),r4 { 2: blt r4,r2,done struct trans_t { int id; int amount; }; accts[id].bal -= amt; 3: sub r4,r2,r4 struct trans_t transactions[MAX_TRANS]; // debit amounts } 4: st r4,0(r3) } for (i = 0; i < MAX_TRANS; i++) { debit(transactions[i].id, transactions[i].amount); } • Example of Thread-level parallelism (TLP) void debit(int id, int amount) { • Collection of asynchronous tasks: not started and stopped together if (accounts[id].balance >= amount) { • Data shared “loosely” (sometimes yes, mostly no), dynamically accounts[id].balance -= amount; } • Example: database/web server (each query is a thread) } • accts is global and thus shared, can’t register allocate • Can we do these “debit” operations in parallel? • id and amt are private variables, register allocated to r1, r2 • Does the order matter? • Running example CIS 501 (Martin): Multicore 21 CIS 501 (Martin): Multicore 22 An Example Execution A Problem Execution Time Time Thread 0 Thread 1 Mem Thread 0 Thread 1 Mem

CIS 501 Computer Architecture This Unit: Shared Memory

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support