Introduction to Parallel Machines and Programming Models Lecture 3

Outline •"Overview of parallel machines (~hardware) and CS 267: programming models (~software) •" Shared memory Introduction to Parallel Machines •" Shared address space and Programming Models •" Message passing Lecture 3 •" Data parallel " •" Clusters of SMPs or GPUs •" Grid James Demmel •"Note: Parallel machine may or may not be tightly www.cs.berkeley.edu/~demmel/cs267_Spr15/! coupled to programming model ! •" Historically, tight coupling ! •" Today, portability is important CS267 Lecture 3! 1! 01/27/2015 CS267 Lecture 3! 2! 01/28/2011! A generic parallel architecture Parallel Programming Models •" Programming model is made up of the languages and libraries that create an abstract view of the machine Proc Proc Proc Proc Proc •" Control Proc •" How is parallelism created? •" What orderings exist between operations? Interconnection Network •" Data •" What data is private vs. shared? •" How is logically shared data accessed or communicated? Memory Memory Memory Memory Memory •" Synchronization •" What operations can be used to coordinate parallelism? •" What are the atomic (indivisible) operations? •" Where is the memory physically located? •" Cost •" Is it connected directly to processors? •" How do we account for the cost of each of the above? •" What is the connectivity of the network? 01/27/2015 CS267 Lecture 3! 3! 01/27/2015 CS267 Lecture 3! 4! CS267 Lecture 2 1 Simple Example Programming Model 1: Shared Memory •"Consider applying a function f to the elements •" Program is a collection of threads of control. of an array A and then computing its sum: •" Can be created dynamically, mid-execution, in some languages n−1 •" Each thread has a set of private variables, e.g., local stack variables ∑ f (A[i]) •" Also a set of shared variables, e.g., static variables, shared common •"Questions: i=0 blocks, or global heap. • Threads communicate implicitly by writing and reading shared •" Where does A live? All in single memory? " variables. Partitioned? •" Threads coordinate by synchronizing on shared variables •" What work will be done by each processors? •" They need to coordinate to get a single result, how? Shared memory A: s A = array of all data s = ... f y = ..s ... fA = f(A) fA: s = sum(fA) sum i: 2 i: 5 Private i: 8 s: memory P0 P1 Pn 01/27/2015 CS267 Lecture 3! 5! 01/27/2015 CS267 Lecture 3! 6! Simple Example Shared Memory “Code” for Computing a Sum •" Shared memory strategy: fork(sum,a[0:n/2-1]); •" small number p << n=size(A) processors sum(a[n/2,n-1]); static int s = 0; n−1 •" attached to single memory f (A[i]) ∑ Thread 1 Thread 2 • Parallel Decomposition: i=0 " •" Each evaluation and each partial sum is a task. for i = 0, n/2-1 for i = n/2, n-1 •" Assign n/p numbers to each of p procs s = s + f(A[i]) s = s + f(A[i]) •" Each computes independent “private” results and partial sum. •" Collect the p partial sums and compute a global sum. •" What is the problem with this program? Two Classes of Data: •" Logically Shared •" A race condition or data race occurs when: •" The original n numbers, the global sum. -"Two processors (or two threads) access the same •" Logically Private variable, and at least one does a write. •" The individual function evaluations. -"The accesses are concurrent (not synchronized) so •" What about the individual partial sums? they could happen simultaneously 01/27/2015 CS267 Lecture 3! 7! 01/27/2015 CS267 Lecture 3! 8! CS267 Lecture 2 2 Shared Memory “Code” for Computing a Sum Improved Code for Computing a Sum 2 static int s = 0; A= 3 5 f (x) = x Why not do lock static int s = 0; static lock lk; Inside loop? Thread 1 Thread 2 Thread 1 Thread 2 …. … compute f([A[i]) and put in reg0 9 compute f([A[i]) and put in reg0 25 local_s1= 0 local_s2 = 0 reg1 = s 0 reg1 = s 0 for i = 0, n/2-1 for i = n/2, n-1 reg1 = reg1 + reg0 9 reg1 = reg1 + reg0 25 local_s1 = local_s1 + f(A[i]) local_s2= local_s2 + f(A[i]) s = reg1 9 s = reg1 25 lock(lk); lock(lk); … … s = s + local_s1 s = s +local_s2 unlock(lk); unlock(lk); •" Assume A = [3,5], f(x) = x2, and s=0 initially •" For this program to work, s should be 32 + 52 = 34 at the end •" Since addition is associative, it’s OK to rearrange order •" but it may be 34,9, or 25 •" Most computation is on private variables •" The atomic operations are reads and writes -" Sharing frequency is also reduced, which might improve speed -" But there is still a race condition on the update of shared s •" Never see ½ of one number, but += operation is not atomic -" The race condition can be fixed by adding locks (only one •" All computations happen in (private) registers thread can hold a lock at a time; others wait for it) 01/27/2015 CS267 Lecture 3! 9! 01/27/2015 CS267 Lecture 3! 10! Review so far and plan for Lecture 3 Machine Model 1a: Shared Memory •" Processors all connected to a large shared memory. Programming Models Machine Models •" Typically called Symmetric Multiprocessors (SMPs) •" SGI, Sun, HP, Intel, IBM SMPs 1." Shared Memory 1a. Shared Memory •" Multicore chips, except that all caches are shared 1b. Multithreaded Procs. •" Advantage: uniform memory access (UMA) 1c. Distributed Shared Mem. •" Cost: much cheaper to access data in cache than main memory •" Difficulty scaling to large numbers of processors 2." Message Passing 2a. Distributed Memory •" <= 32 processors typical 2b. Internet & Grid Computing 2a. Global Address Space 2c. Global Address Space P1 P2 Pn $ $ $ 3. Data Parallel 3a. SIMD 3b. Vector bus 4. Hybrid 4. Hybrid shared $ Note: $ = cache memory 01/27/2015 CS267 Lecture 3! 11! 01/27/2015 CS267 Lecture 3! 12! What about GPU? What about Cloud? CS267 Lecture 2 3 Problems Scaling Shared Memory Hardware Example: Problem in Scaling Shared Memory •" Why not put more processors on (with larger memory?) •" The memory bus becomes a bottleneck •" Caches need to be kept coherent •" Performance degradation is a “smooth” function of •" Example from a Parallel Spectral Transform Shallow the number of processes. Water Model (PSTSWM) demonstrates the problem •" No shared data between them, so there should be •" Experimental results (and slide) from Pat Worley at ORNL perfect parallelism. •" This is an important kernel in atmospheric models •" 99% of the floating point operations are multiplies or adds, which generally run well on all processors •" (Code was run for a 18 •" But it does sweeps through memory with little reuse of vertical levels with a operands, so uses bus and shared memory frequently range of horizontal •" These experiments show performance per processor, with sizes.) one “copy” of the code running independently on varying numbers of procs •" The best case for shared memory: no sharing •" But the data doesn’t all fit in the registers/cache 01/27/2015 CS267 Lecture 3! 13! 01/27/2015 CS267 Lecture 3! From Pat Worley, ORNL" 14! Machine Model 1b: Multithreaded Processor Eldorado Processor (logical view) •" Multiple thread “contexts” without full processors •" Memory and some other state is shared Programs 1 2 3 4 running in •" Sun Niagra processor (for servers) parallel i = n i = n •" Up to 64 threads all running simultaneously (8 threads x 8 cores) Sub- . problem . Serial . i = 3 A . i = 1 Code Concurrent •" In addition to sharing memory, they share floating point units . threads of i = 2 Su b- i = 0 computation •" Why? Switch between threads for long-latency memory operations problem i = 1 B •" Cray MTA and Eldorado processors (for HPC) Subproblem A T0 T1 Tn Multithreaded . across multiple processors shared $, shared floating point units, etc. Memory Source: John Feo, Cray 01/27/2015 CS267 Lecture 3! 15! 01/27/2015 CS267 Lecture 3! 16! CS267 Lecture 2 4 Machine Model 1c: Distributed Shared Memory Review so far and plan for Lecture 3 •" Memory is logically shared, but physically distributed Programming Models Machine Models •" Any processor can access any address in memory •" Cache lines (or pages) are passed around machine 1." Shared Memory 1a. Shared Memory •" SGI is canonical example (+ research machines) 1b. Multithreaded Procs. •" Scales to 512 (SGI Altix (Columbia) at NASA/Ames) 1c. Distributed Shared Mem. •" Limitation is cache coherency protocols – how to 2." Message Passing 2a. Distributed Memory keep cached copies of the same address consistent 2b. Internet & Grid Computing P1 P2 Pn 2a. Global Address Space 2c. Global Address Space Cache lines (pages) $ $ $ must be large to 3. Data Parallel 3a. SIMD amortize overhead 3b. Vector network ! locality still critical 4. Hybrid 4. Hybrid memory memory memory to performance 01/27/2015 CS267 Lecture 3! 17! 01/27/2015 CS267 Lecture 3! 18! What about GPU? What about Cloud? Review so far and plan for Lecture 3 Programming Model 2: Message Passing •" Program consists of a collection of named processes. Programming Models Machine Models •" Usually fixed at program startup time •" Thread of control plus local address space -- NO shared data. 1." Shared Memory 1a. Shared Memory •" Logically shared data is partitioned over local processes. 1b. Multithreaded Procs. •" Processes communicate by explicit send/receive pairs 1c. Distributed Shared Mem. •" Coordination is implicit in every communication event. 2." Message Passing 2a. Distributed Memory •" MPI (Message Passing Interface) is the most commonly used SW Private 2b. Internet & Grid Computing memory 2a. Global Address Space 2c. Global Address Space s: 12 s: 14 s: 11 3. Data Parallel 3a. SIMD receive Pn,s 3b. Vector y = ..s ... i: 2 i: 3 i: 1 4. Hybrid 4. Hybrid P0 P1 send P1,s Pn Network 01/27/2015 CS267 Lecture 3! 19! 01/27/2015 CS267 Lecture 3! 20! What about

Introduction to Parallel Machines and Programming Models Lecture 3

Data-Level Parallelism

High Performance Computing Through Parallel and Distributed Processing

Introduction to Multi-Threading and Vectorization Matti Kortelainen Larsoft Workshop 2019 25 June 2019 Outline

Demystifying Parallel and Distributed Deep Learning: an In-Depth Concurrency Analysis

Parallel Programming: Performance

Exploiting Parallelism in Many-Core Architectures: Architectures G Bortolotti, M Caberletti, G Crimi Et Al

An Introduction to Gpus, CUDA and Opencl

Parallel Computing a Key to Performance

Computer Architecture: Parallel Processing Basics

1 Parallelism

Exploiting Parallelism Opportunities with Deep Learning Frameworks

An Introduction to CUDA/Opencl and Graphics Processors