<<

Parallel algorithms for clusters (CGM model) Parallel Machine Models

How to abstract parallel machines into a simplified model such that

• algorithm/application design is not hampered by too many details • algorithms and code are portable • calculated complexity predictions match the actually observed running times with sufficient accuracy Parallel Machine Models

Algorithm

Code

Model

PC Cluster SunFire CrayCray T3E SGI OriginSGI 2000

O(f(n,p)) Parallel Machine Models

Algorithm 1 Algorithm 2

Code 1 Code 2

Model

PC Cluster SunFire CrayCray T3E SGI OriginSGI 2000

O(f1(n,p)) T 1 T1 T1 T1 O(f2(n,p)) T 2 T2 T2 T2 Coarse Grained Multicomputer

S(p) • Coarse grained p = computation p) s ( • p S Coarse grained e

e communication d CGM u • Coarse grained memory p l (lower bounds on memory sica clas size)

p=n p Coarse Grained Memory

• ignore small n/p proc. mem. n/p . • lower bound on local proc. mem. n

memory n/p e

t n/p comm. o r k

o r

s

e.g. n/p > p w i

t proc. mem. c h n/p comm. . . .

proc. mem. n/p comm. Coarse Grained Computation

Compute in supersteps with barrier synchronization (as in BSP).

P P P P Coarse Grained Communication

• All comm. steps are h-relations, h=O(n/p) • No individual messages h h h

P - - - r r r e e e l l l

P a a a t t t i i i o o o

P n n n P

round 1 round 2 round 3 H-Relation (all-to-all personalized communication)

proc. out in mem comm. n/p

out

n proc. e

t in w mem

o n/p r comm. k

o r

s w i t c out h proc. in mem comm. n/p . . .

proc. out in mem comm. n/p h-Relation Performance

h-Relation CGM Performance Prediction

Performance Measures:

• number of rounds, r • local computation, t

• communication volume, C < r n • scalability, λ lower bound: n/p > p1/λ Number of Rounds, r

important performance parameter • goal: r = O(1),O(log p), … • independent of n, slowly growing with p • as n increases, the number of messages remains unchanged and only the message length increases • improved latency hiding • amortizes overheads Local Computation, t

• goal: t = Ts(n) / p

• typical: t = r Ts(n) / p optimal if r = O(1) Communication Volume, C

• typical: C = r n

• can sometimes be improved via h-relations with h < n/p Scalability, λ

• lower bound on n/p

• DEF.: n/p > p1/λ implies scalability λ

• λ = 1/2 n/p > p2 • λ = 1 n/p > p • λ = 2 n/p > p1/2 CGM Alg. Performance

• practical parallel algorithms • efficient & portable

S(p) p = p) s (

p S e e

d CGM u p l sica clas

p=n p CGM Algorithms for Parallel Sorting Det. Sample Sort

mem proc data

comm mem proc data

comm

mem proc data

comm

mem proc data

comm n/p Det. Sample Sort

mem CGM Algorithm: proc data p-sample comm • sort locally and create p- mem sample proc data p-sample comm

mem proc data p-sample comm

mem proc data

comm p-sample n/p Det. Sample Sort

CGM Algorithm: mem proc data

comm • send all p-samples to mem processor 1 proc data p-sample comm

mem proc data p-sample comm

mem proc data

comm p-sample n/p Det. Sample Sort

CGM Algorithm: mem proc data

comm • proc.1: sort all received mem samples and compute global proc data p-sample p-sample comm

mem proc data p-sample comm

mem proc data

comm p-sample n/p Det. Sample Sort

CGM Algorithm: mem proc data p-sample comm • broadcast global p-sample mem • bucket locally according to proc data global p-sample p-sample comm • send bucket i to proc.i mem • resort locally proc data p-sample comm

mem proc data

comm p-sample n/p Det. Sample Sort

Lemma: Each proc. receives most 2 n/p data items

n/p2 n/p2 g g l l o o b b a a l l s s a a m m p p l l e e Det. Sample Sort

n/p n/p n/p n/p n/p n/p n/p n/p

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8

Post-Processing: “Array Balancing” Det. Sample Sort

mem proc data p-sample comm • 5 MPI_AlltoAllv mem for n/p > p2 proc data • O(n/p log n) local comp. p-sample comm

mem proc data p-sample comm

mem proc data

comm p-sample n/p Performance: Det. Sample Sort