Instruction and Data Streams
John von Neumann CS575 Parallel Processing Stored program architecture
One instruction stream, One data stream
Michael Flynn: Taxonomy of //computers Lecture two:Parallel Computer Models SIMD: Single instruction multiple data Wim Bohm data parallel
Colorado State University MIMD: Multiple instruction multiple data
multi processing Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 license
CS575 lecture 2 2
SISD: sequential machines How did instruction rate get so high?
Limiting factors
Instruction execution rate
Memory latency
~100* slower than instruction rate
Coping with low memory latency
Caches (level 1, 2, 3)
Pre-fetching
Memory interleaving
CS575 lecture 2 3 CS575 lecture 2 4
1 How did instruction rate get so high? How did instruction rate get so high?
Higher clock rates Higher clock rates
Instruction Level Parallelism (ILP)
‘RISC’ architectures
CS575 lecture 2 5 CS575 lecture 2 6
How did instruction rate get so high? SIMD
Higher clock rates One control unit Many processing elements Instruction Level Parallelism (ILP) All executing the same instruction ‘RISC’ architectures Local memories
Pipelining Conditional execution
Multiple functional units Where ocean perform ocean step Where land perform land step Reservation stations Gives rise to idle processors (optimized in GPUs) Speculation / branch prediction PE interconnection
Usually a 2D mesh
CS575 lecture 2 7 CS575 lecture 2 8
2 SIMD machines GPU programming model
Bit level host ILLIAC IV (very early research machine, Illinois) memcpy-s data MPP (Burroughs), DAP (ICL), launches kernels CM2 (Thinking Machines), MassPar on threads Coarser grain
old: CM5: Fat tree of SPARCs (distr. memory) new: GPUs Collection of Streaming Multiprocessors grid of thread blocks host
organized as a grid (collection) of thread blocks (SMs)
each SM has a shared memory and a large register file memcpy there is one global memory, and thread blocks can do vector global memory host memory reads and writes from / to it (coalescing)
CS575 lecture 2 9 CS575 lecture 2
GPU programming model questions...
How do threads / thread-blocks get shared allocated on stream multiprocessors? memory How do threads synchronize / thread communicate? block grid of thread blocks How do threads disambiguate memory accesses
global memory which thread reads / writes which memory location?
CS575 lecture 2 11 CS575 lecture 2 12
3 thread allocation / synchronization threads and memory access
A thread block gets allocated on any stream shared shared shared multiprocessors and thread blocks are memory memory memory completely independent of each other, ie cannot communicate with each other at all!! shared shared shared memory memory memory pro: now the computation can run on any number of stream processors • each thread block has 2D (x,y) indices in the grid con: this makes programming a GPU harder • each thread has 3D (p,q,r) indices in the block threads in one thread block can synchronize • so each thread has its own identity (x,y,p,q,r) _syncthreads() command • and can therefore decide which memory locations to access (responsibility of the programmer)
CS575 lecture 2 13 CS575 lecture 2 14
SIMD Vector machines MIMD
Multiple Processors
Each executing its own code Vector units added to CPU perform same instruction on all elements of a vector register Distributed memory
Complete ‘PE+Memory’ connected to network Chaining avoids writing/reading memory (register register op), eg. V4 = (V1+V2)*V3 Some interconnect network topology Cosmic cube, Ncube, Paragon, SP2 First vector machine: Cray1 (Seymour Cray) Extreme: Network of Workstations, Clusters
Vector now used in eg SSE units / instructions Data centers: racks with busses
in Pentiums Programming model: Message Passing
CS575 lecture 2 15 CS575 lecture 2 16
4 Shared memory MIMD Parallel Random Access Machine
Shared memory PRAM: A theoretical model for // computing
Symmetric Multiprocessor Idealized MIMD machine CPUs, bus, memory Unbounded number of processors Many vendors: SUN, HP, SGI Synchronous computation Memory Access Parallel construct: for all x in Y: statement UMA: Uniform memory access Shared memory (unbounded size) NUMA: Non Uniform Memory Access Memory Hierarchy All processors have constant access time to all
Potential for better performance memory locations
problem: memory (cache) coherence
CS575 lecture 2 17 CS575 lecture 2 18
Order of evaluation PRAM Subclasses: concurrent R/W
for all i in 1..n, for all j in 1..n EREW: Exclusive Read, Exclusive Write No concurrent read or write on one location S1: xij = exprij CREW: Concurrent Read, Exclusive Write S2: ... CRCW: Concurrent Read and Concurrent 2 n processors execute all S1ij first Write In one time step!! then all S2 CW: three categories COMMON: all writes must be equal
In S1 all exprij (RHSs) are evaluated first ARBITRARY: one write succeeds
SUM: Add all written quantities together and then assigned to xij (LHS)
CS575 lecture 2 19 CS575 lecture 2 20
5 What if we cannot write concurrently Example: minimum of n numbers (EREW)?
Algorithm 1: CRCW common Algorithm 2: Numbers in C[1] .. C[n] Using complete binary tree all internal nodes have 2 children
except for the rightmost in the lowest layer
for all i in 1..n: M[i] = 0 numbers stored in the leaves of the tree
for all i in 1..n, for all j in 1..n: if C[i] for all i in 1..n, for all j in 1..n: Where’s my left/right child? if(M[i]==0 and i # processors required? (to execute parallel steps in one go) Exercise: write an EREW program for this Time? #processors required? Time? CS575 lecture 2 21 CS575 lecture 2 22 Parallel Complexity Parallel sort on CRCW (sum) Which algorithm is “better”? Numbers in C[1] .. C[n] Parallel complexity: T.P for all i in 1..n: M[i] = 0 T: time complexity P: max # processors for all i in 1..n, for all j in 1..n: Algorithm 1, Algorithm 2 ? if (C[i]>C[j]) M[i] +=1 A parallel algorithm is optimal if: O(parallel) = O(best sequential) if (C[i]==C[j] and i >= j) M[i] +=1 Can you find a parallel algorithm that is optimal? for all i in 1,n: B[M[i]] = C[i] Hint: Cluster leaves into groups What is the parallel complexity? CS575 lecture 2 23 CS575 lecture 2 24 6 Partial sums, or Parallel Prefix Pointer jumping example: list ranking N numbers V1 to Vn stored in A[1] to A[n] Lisp style list of nodes, all pointing back Compute all partial sums (V1+..+Vk) to their parent (Prev). Root points at itself. d = 1 Find distance from root for all nodes. do log(n) times forall nodes k: for all i in 1..n: P(k) = Prev(k) if (i-d)>0 A[i] = A[i]+A[i-d] if (P(k) != k) dist(k) = 1 else dist(k) = 0 repeat m times: dist(k) += dist(P(k)) d *= 2 P(k) = P(P(k)) What is the value of m? Complexity? Optimality? CS575 lecture 2 25 7