<<

Instruction and Data Streams

 John von Neumann CS575 Parallel Processing  Stored program architecture

 One instruction stream, One data stream

 Michael Flynn: Taxonomy of //computers Lecture two:Parallel Computer Models  SIMD: Single instruction multiple data Wim Bohm  data parallel

Colorado State University  MIMD: Multiple instruction multiple data

 multi processing Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 license

CS575 lecture 2 2

SISD: sequential machines How did instruction rate get so high?

 Limiting factors

 Instruction execution rate

 Memory latency

 ~100* slower than instruction rate

 Coping with low memory latency

 Caches (level 1, 2, 3)

 Pre-fetching

 Memory interleaving

CS575 lecture 2 3 CS575 lecture 2 4

1 How did instruction rate get so high? How did instruction rate get so high?

 Higher clock rates  Higher clock rates

 Instruction Level Parallelism (ILP)

 ‘RISC’ architectures

CS575 lecture 2 5 CS575 lecture 2 6

How did instruction rate get so high? SIMD

 Higher clock rates  One  Many processing elements  Instruction Level Parallelism (ILP)  All executing the same instruction  ‘RISC’ architectures  Local memories

 Pipelining  Conditional execution

 Multiple functional units  Where ocean perform ocean step  Where land perform land step  Reservation stations  Gives rise to idle processors (optimized in GPUs)  Speculation / branch prediction  PE interconnection

 Usually a 2D mesh

CS575 lecture 2 7 CS575 lecture 2 8

2 SIMD machines GPU programming model

 Bit level host  ILLIAC IV (very early research machine, Illinois) memcpy-s data  MPP (Burroughs), DAP (ICL), launches kernels  CM2 (Thinking Machines), MassPar on threads  Coarser grain

 old: CM5: Fat tree of SPARCs (distr. memory)  new: GPUs Collection of Streaming Multiprocessors grid of blocks host

 organized as a grid (collection) of thread blocks (SMs)

 each SM has a and a large memcpy  there is one global memory, and thread blocks can do vector global memory host memory reads and writes from / to it (coalescing)

CS575 lecture 2 9 CS575 lecture 2

GPU programming model questions...

 How do threads / thread-blocks get shared allocated on stream multiprocessors? memory  How do threads synchronize / thread communicate? block grid of thread blocks  How do threads disambiguate memory accesses

global memory  which thread reads / writes which memory location?

CS575 lecture 2 11 CS575 lecture 2 12

3 thread allocation / synchronization threads and memory access

 A thread block gets allocated on any stream shared shared shared multiprocessors and thread blocks are memory memory memory completely independent of each other, ie cannot communicate with each other at all!! shared shared shared memory memory memory  pro: now the computation can run on any number of stream processors • each thread block has 2D (x,y) indices in the grid  con: this makes programming a GPU harder • each thread has 3D (p,q,r) indices in the block  threads in one thread block can synchronize • so each thread has its own identity (x,y,p,q,r)  _syncthreads() command • and can therefore decide which memory locations to access (responsibility of the programmer)

CS575 lecture 2 13 CS575 lecture 2 14

SIMD Vector machines MIMD

 Multiple Processors

 Each executing its own code  Vector units added to CPU perform same instruction on all elements of a vector register 

 Complete ‘PE+Memory’ connected to network  Chaining avoids writing/reading memory (register register op), eg. V4 = (V1+V2)*V3  Some interconnect network topology  Cosmic cube, Ncube, Paragon, SP2  First vector machine: Cray1 (Seymour Cray)  Extreme: Network of Workstations, Clusters

 Vector now used in eg SSE units / instructions  Data centers: racks with busses

in Pentiums  Programming model:

CS575 lecture 2 15 CS575 lecture 2 16

4 Shared memory MIMD Parallel Random Access Machine

 Shared memory  PRAM: A theoretical model for //

 Symmetric Multiprocessor  Idealized MIMD machine  CPUs, , memory  Unbounded number of processors  Many vendors: SUN, HP, SGI  Synchronous computation  Memory Access  Parallel construct: for all x in Y: statement  UMA:  Shared memory (unbounded size)  NUMA: Non Uniform Memory Access   All processors have constant access time to all

 Potential for better performance memory locations

 problem: memory () coherence

CS575 lecture 2 17 CS575 lecture 2 18

Order of evaluation PRAM Subclasses: concurrent R/W

for all i in 1..n, for all j in 1..n  EREW: Exclusive Read, Exclusive Write  No concurrent read or write on one location S1: xij = exprij  CREW: Concurrent Read, Exclusive Write S2: ...  CRCW: Concurrent Read and Concurrent 2 n processors execute all S1ij first Write In one time step!! then all S2  CW: three categories  COMMON: all writes must be equal

In S1 all exprij (RHSs) are evaluated first  ARBITRARY: one write succeeds

 SUM: Add all written quantities together and then assigned to xij (LHS)

CS575 lecture 2 19 CS575 lecture 2 20

5 What if we cannot write concurrently Example: minimum of n numbers (EREW)?

Algorithm 1: CRCW common  2: Numbers in C[1] .. C[n]  Using complete binary tree  all internal nodes have 2 children

 except for the rightmost in the lowest layer

for all i in 1..n: M[i] = 0  numbers stored in the leaves of the tree

for all i in 1..n, for all j in 1..n: if C[i]

for all i in 1..n, for all j in 1..n:  Where’s my left/right child? if(M[i]==0 and i

# processors required? (to execute parallel steps in one go)  Exercise: write an EREW program for this Time? #processors required? Time?

CS575 lecture 2 21 CS575 lecture 2 22

Parallel Complexity Parallel sort on CRCW (sum)

 Which algorithm is “better”? Numbers in C[1] .. C[n]  Parallel complexity: T.P for all i in 1..n: M[i] = 0  T:

 P: max # processors for all i in 1..n, for all j in 1..n:  Algorithm 1, Algorithm 2 ? if (C[i]>C[j]) M[i] +=1  A is optimal if: O(parallel) = O(best sequential) if (C[i]==C[j] and i >= j) M[i] +=1

 Can you find a parallel algorithm that is optimal? for all i in 1,n: B[M[i]] = C[i]

 Hint: Cluster leaves into groups

What is the parallel complexity?

CS575 lecture 2 23 CS575 lecture 2 24

6 Partial sums, or Parallel Prefix example: list ranking

N numbers V1 to Vn stored in A[1] to A[n]  Lisp style list of nodes, all pointing back Compute all partial sums (V1+..+Vk) to their parent (Prev). Root points at itself. d = 1  Find distance from root for all nodes. do log(n) times forall nodes k: for all i in 1..n: P(k) = Prev(k) if (i-d)>0 A[i] = A[i]+A[i-d] if (P(k) != k) dist(k) = 1 else dist(k) = 0 repeat m times: dist(k) += dist(P(k)) d *= 2 P(k) = P(P(k))  What is the value of m? Complexity? Optimality? CS575 lecture 2 25

7