Det. Sample Sort

Parallel algorithms for clusters (CGM model) Parallel Machine Models How to abstract parallel machines into a simplified model such that • algorithm/application design is not hampered by too many details • algorithms and code are portable • calculated time complexity predictions match the actually observed running times with sufficient accuracy Parallel Machine Models Algorithm Code Model PC Cluster SunFire CrayCray T3E SGI OriginSGI 2000 O(f(n,p)) Parallel Machine Models Algorithm 1 Algorithm 2 Code 1 Code 2 Model PC Cluster SunFire CrayCray T3E SGI OriginSGI 2000 O(f1(n,p)) T 1 T1 T1 T1 O(f2(n,p)) T 2 T2 T2 T2 Coarse Grained Multicomputer S(p) • Coarse grained p = computation p) s ( • p S Coarse grained e e communication d CGM u • Coarse grained memory p l (lower bounds on memory sica clas size) p=n p Coarse Grained Memory • ignore small n/p proc. mem. n/p comm. • lower bound on local proc. mem. n memory n/p e t n/p w comm. o r k o r s e.g. n/p > p w i t proc. mem. c h n/p comm. proc. mem. n/p comm. Coarse Grained Computation Compute in supersteps with barrier synchronization (as in BSP). P P P P h-relation ) n/p round 3 =O( h h-relation -relations, -relations, round 2 h Communication h-relation round 1 All comm.All steps are messages individual No P P P P • • Coarse Grained H-Relation (all-to-all personalized communication) proc. out in mem comm. n/p out n proc. e t in w mem o n/p r comm. k o r s w i t c out h proc. in mem comm. n/p . proc. out in mem comm. n/p h-Relation Performance h-Relation CGM Performance Prediction Performance Measures: • number of rounds, r • local computation, t • communication volume, C < r n • scalability, λ lower bound: n/p > p1/λ Number of Rounds, r • most important performance parameter • goal: r = O(1),O(log p), … • independent of n, slowly growing with p • as n increases, the number of messages remains unchanged and only the message length increases • improved latency hiding • amortizes overheads Local Computation, t • goal: t = Ts(n) / p • typical: t = r Ts(n) / p optimal if r = O(1) Communication Volume, C • typical: C = r n • can sometimes be improved via h-relations with h < n/p Scalability, λ • lower bound on n/p • DEF.: n/p > p1/λ implies scalability λ • λ = 1/2 n/p > p2 • λ = 1 n/p > p • λ = 2 n/p > p1/2 CGM Alg. Performance • practical parallel algorithms • efficient & portable S(p) p = p) s ( p S e e d CGM u p l sica clas p=n p CGM Algorithms for Parallel Sorting Det. Sample Sort mem proc data comm mem proc data comm mem proc data comm mem proc data comm n/p Det. Sample Sort mem CGM Algorithm: proc data p-sample comm • sort locally and create p- mem sample proc data p-sample comm mem proc data p-sample comm mem proc data comm p-sample n/p Det. Sample Sort CGM Algorithm: mem proc data comm • send all p-samples to mem processor 1 proc data p-sample comm mem proc data p-sample comm mem proc data comm p-sample n/p Det. Sample Sort CGM Algorithm: mem proc data comm • proc.1: sort all received mem samples and compute global proc data p-sample p-sample comm mem proc data p-sample comm mem proc data comm p-sample n/p Det. Sample Sort CGM Algorithm: mem proc data p-sample comm • broadcast global p-sample mem • bucket locally according to proc data global p-sample p-sample comm • send bucket i to proc.i mem • resort locally proc data p-sample comm mem proc data comm p-sample n/p 2 global sample 2 n/p 2 n/p n/p data items Lemma: proc.Each receives at most sample global Det. Sample Sort Det. Sample Sort n/p n/p n/p n/p n/p n/p n/p n/p 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Post-Processing: “Array Balancing” Det. Sample Sort mem proc data p-sample comm • 5 MPI_AlltoAllv mem for n/p > p2 proc data • O(n/p log n) local comp. p-sample comm mem proc data p-sample comm mem proc data comm p-sample n/p Performance: Det. Sample Sort.

Load more