Parallel algorithms for clusters (CGM model) Parallel Machine Models
How to abstract parallel machines into a simplified model such that
• algorithm/application design is not hampered by too many details • algorithms and code are portable • calculated time complexity predictions match the actually observed running times with sufficient accuracy Parallel Machine Models
Algorithm
Code
Model
PC Cluster SunFire CrayCray T3E SGI OriginSGI 2000
O(f(n,p)) Parallel Machine Models
Algorithm 1 Algorithm 2
Code 1 Code 2
Model
PC Cluster SunFire CrayCray T3E SGI OriginSGI 2000
O(f1(n,p)) T 1 T1 T1 T1 O(f2(n,p)) T 2 T2 T2 T2 Coarse Grained Multicomputer
S(p) • Coarse grained p = computation p) s ( • p S Coarse grained e
e communication d CGM u • Coarse grained memory p l (lower bounds on memory sica clas size)
p=n p Coarse Grained Memory
• ignore small n/p proc. mem. n/p comm. • lower bound on local proc. mem. n
memory n/p e
t n/p w comm. o r k
o r
s
e.g. n/p > p w i
t proc. mem. c h n/p comm. . . .
proc. mem. n/p comm. Coarse Grained Computation
Compute in supersteps with barrier synchronization (as in BSP).
P P P P Coarse Grained Communication
• All comm. steps are h-relations, h=O(n/p) • No individual messages h h h
P - - - r r r e e e l l l
P a a a t t t i i i o o o
P n n n P
round 1 round 2 round 3 H-Relation (all-to-all personalized communication)
proc. out in mem comm. n/p
out
n proc. e
t in w mem
o n/p r comm. k
o r
s w i t c out h proc. in mem comm. n/p . . .
proc. out in mem comm. n/p h-Relation Performance
h-Relation CGM Performance Prediction
Performance Measures:
• number of rounds, r • local computation, t
• communication volume, C < r n • scalability, λ lower bound: n/p > p1/λ Number of Rounds, r
• most important performance parameter • goal: r = O(1),O(log p), … • independent of n, slowly growing with p • as n increases, the number of messages remains unchanged and only the message length increases • improved latency hiding • amortizes overheads Local Computation, t
• goal: t = Ts(n) / p
• typical: t = r Ts(n) / p optimal if r = O(1) Communication Volume, C
• typical: C = r n
• can sometimes be improved via h-relations with h < n/p Scalability, λ
• lower bound on n/p
• DEF.: n/p > p1/λ implies scalability λ
• λ = 1/2 n/p > p2 • λ = 1 n/p > p • λ = 2 n/p > p1/2 CGM Alg. Performance
• practical parallel algorithms • efficient & portable
S(p) p = p) s (
p S e e
d CGM u p l sica clas
p=n p CGM Algorithms for Parallel Sorting Det. Sample Sort
mem proc data
comm mem proc data
comm
mem proc data
comm
mem proc data
comm n/p Det. Sample Sort
mem CGM Algorithm: proc data p-sample comm • sort locally and create p- mem sample proc data p-sample comm
mem proc data p-sample comm
mem proc data
comm p-sample n/p Det. Sample Sort
CGM Algorithm: mem proc data
comm • send all p-samples to mem processor 1 proc data p-sample comm
mem proc data p-sample comm
mem proc data
comm p-sample n/p Det. Sample Sort
CGM Algorithm: mem proc data
comm • proc.1: sort all received mem samples and compute global proc data p-sample p-sample comm
mem proc data p-sample comm
mem proc data
comm p-sample n/p Det. Sample Sort
CGM Algorithm: mem proc data p-sample comm • broadcast global p-sample mem • bucket locally according to proc data global p-sample p-sample comm • send bucket i to proc.i mem • resort locally proc data p-sample comm
mem proc data
comm p-sample n/p Det. Sample Sort
Lemma: Each proc. receives at most 2 n/p data items
n/p2 n/p2 g g l l o o b b a a l l s s a a m m p p l l e e Det. Sample Sort
n/p n/p n/p n/p n/p n/p n/p n/p
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
Post-Processing: “Array Balancing” Det. Sample Sort
mem proc data p-sample comm • 5 MPI_AlltoAllv mem for n/p > p2 proc data • O(n/p log n) local comp. p-sample comm
mem proc data p-sample comm
mem proc data
comm p-sample n/p Performance: Det. Sample Sort