Portability Versus E Ciency? Parallel Applications on PVM and Parix

Pro cs ZEUS Workshop on Par Programming and Computation Linkoping Sweden Portabilityversus Eciency Parallel Applications on PVM and Parix Alexander Reinefeld Volker Schnecke Center for Parallel Computing MathematikInformatik Universitat Paderb orn Universitat Osnabruc k arunipaderb ornde volkerinformatikuniosnabrueckde Abstract Analogous to the shift from assembler language programming to the third generation languages in the early years of computer science we are currently wit nessing a paradigm change towards the use of p ortable programming mo dels in parallel highp erformance computing Like b efore the use of a highlevel program ming environmentmust b e paid for by a reduced system p erformance But howmuch do es p ortability cost in practice Is it worth paying that price What eect has the choice of the programming mo del on the algorithm architecture In this pap er we attempt to answer these questions by comparing two applica tions from the domain of combinatorial optimization that have b een implemented with the Parix and PVM programming mo dels Performance b enchmarks have b een run on three dierent systems a massively parallel transputer system with relatively slow Tpro cessors a mo derately parallel Parsytec GCPowerPlus sys tem with p owerful MFLOPS pro cessors and a UNIX workstation cluster con y a Mbps LAN While the Parix implementations clearly turned out to b e nected b fastest PVM gives p ortability at the cost of a small acceptable loss in p erformance Intro duction Contemp orary parallel computing systems are strikingly diverse in their architectures and op erating systems A varietyofinterconnection networks can b e found in to days most successful commercial machines hyp ercub e nCub e Dgrid Intel Paragon Parsytec GC ringcrossbar combination Convex SPP network IBM SP fat tree CM hierarchical rings KSR and Dtorus Cray TD All of them have their sp ecic advantages and it is still an op en question which top ologies will survive in the future Unfortunately the programming interfaces are also very dissimilar making it dicult to p ort a given co de from one hardware platform to another As it seems p ortability can only b e achieved bytheuseofvendor indep endent programming environments like PVM MPI PARMACS P Zip co de Express and Linda For an overview on programming environments see Portability and scalability are the keywords of this pap er Ideally a parallel imple mentation would meet b oth prop erties but in practice interdep endencies make this imp ossible We compared the p erformance of two application programs running on Parix PARallel extensions to UnIX and PVM Parallel Virtual Machine Three dif ferent hardware platforms have b een used in our exp eriments A no de transputer system a Parsytec GCPP with PowerPC pro cessors and a UNIX workstation cluster with SUN SparcStations Our exp eriments giveanswers to the following questions Is there any impact of the choice of the programming mo del on the archi tecture of the application program How fast is PVM as compared to Parix To what extent are PVM programs p ortable and scalable Applications As an application problem domain wehavechosen the class of discrete optimization problems which can b e dened in terms of nding a solution path in a tree or graph from an initial ro ot state to a goal no de Many applications in planning and scheduling can b e formulated with this mo del the cutting sto ck problem the bin packing problem vehicle routing VLSI o orplan optimization satisability problems the traveling salesman problem the N N puzzle and partial constraint satisfaction problems All of them are known to b e NPhard For our b enchmarks wehavechosen twotypical applications with dierent solution techniques the VLSI o orplan optimization problem which is solved with a depthrst branchandb ound strategy and the N N puzzle whichissolved with an iterative deep ening search Both are based on depthrst search DFSwhich traverses the decision tree in a topdown manner from left to right DFS is commonly used in combinatorial optimization problems b ecause it needs only O d w storage space in trees of depth d and width w VLSI Floorplan Optimization The oorplan area optimization is a stage in the design of VLSI chips Here the relative placements and areas of the building blo cks of a chip are known but their exact dimensions can still b e varied over a wide range A o orplan is represented bytwo dual p olar graphs G V E and H W F and a list of p otential implementations for each blo ck As shown in Figure the vertices in V and W representthevertical v2 w1 v1 v3 B1 w1 w2 B2 w3 B3 w2 B4 w3 w4 v1 v2 v3 w4 Figure A o orplan and the graphs G and H and horizontal line segments of the o orplan There exists an edge e v v in the graph G if there is a blo ck in the o orplan whose left and right edges lie on the corresp onding vertical line segments For a sp ecic conguration ie a o orplan with exact blo ck sizes the edges are weighted with the dimensions of the blo cks in this conguration The solution of the o orplan optimization problem is a conguration with minimum layout area given by the pro duct of the longest paths in the graphs G and H Our implementation builds a tree where the leaves are complete congurations and the inner no des at depth d represent partial o orplans consisting of blo cks B B d The depthrst branchandbound DFBB solution algorithm employs a heuristic cost function to eliminate unnecessary parts of the search space that are known not to contain an optimal solution When a new p ossibly nonoptimal solution has b een found the search continues with the improved costb ound now pruning all subtrees with costestimates higher than the new costb ound Newly established costb ounds are broadcasted so that all pro cessors share the b est available b ound at any stage in the search N N Puzzle Applications with bad initial costb ounds and many alternatives in the decision trees are solved with heuristic search algorithms likeAandIDA One such N puzzle Here a given initial conguration of tiles in a example is the N squared tray of size N N must b e rearranged with the fewest number of moves into a given goal conguration No eective upp er costb ounds are known for this problem and hence it cannot b e solved with DFBB Even the relatively small puzzle N has a search space of states While it would seem easy to obtain any solution nding an optimal shortest solution is NP complete In general it takes some hundred million no de expansions to solve a random problem instance of the puzzle using the Manhattan distance the sum of the minimum displacementofeach tile from its goal p osition as a heuristic estimate function Iterativedeepening A IDA simulates a b estrst searchby a series of depthrst searches with successively increased costb ounds Its low space overhead of O d makes it feasible in applications where A cannot b e used due to memory limitations With a nonoverestimating heuristic estimate function IDA is guaranteed to nd an optimal shortest solution In contrast to DFBB IDA halts after nding a rst solution b ecause optimality is guaranteed by the iterative approach with the minimal costb ound increments Parallel DepthFirst Search Depthrst search can b e p erformed in parallel by partitioning the search space into many small disjunct parts subtrees that can b e explored concurrently Wehave develop ed a scheme called searchfrontier splitting that can b e applied to breadth rst depthrst and b estrst search It partitions the search space into g subtrees that are taken from a searchfrontier containing no des n with the same cost value f n Searchfrontier splitting has two phases Initially all pro cessors generate synchronously the same searchfrontier and store the no des in their lo cal memories A search frontier is a level in the tree where all no des have the same costvalue Each no de represents an indivisible piece ork for the next phase Hence on a ppro cessor system the search frontier of w should b e chosen large enough so that it contains at least p no des In the asynchronous search phase each pro cessor selects a disjunct set of frontier no des and expands them in depthrst fashion When a pro cessor b ecomes idle it requests a work packet unpro cessed frontier no de from another pro cessor Work requests are forwarded from one pro cessor to the next until one of them has work to share When there are no work packets left the search space is exhausted and the search is terminated When a pro cessor nds a solution all others are informed by a broadcast Searchfrontier splitting is a general scheme It can b e employed in depthrst branch andb ound DFBB and in iterativedeep ening depthrst search IDA In DFBB the search continues until all no des have b een either explored or pruned due to inferiority Newly established b ounds are communicated to all pro cessors to improve the lo cal b ounds as so on as p ossible It seems natural to generate the searchfrontier iterativelyby incrementing the costvalue by the minimum amountuntil there are at least const p entries in the frontier no des array In our exp eriments wehavechosen const In the iterativedeep ening variant AIDA subsequent iterations are started on the previously established frontier no de arrays Work packets change ownership when they are sent to other pro cessors thereby improving the global loadbalance over the iterations Lightly loaded pro cessors whichhaveasked for work in the last iteration will b e b etter utilized in the next As a consequence the communication

Portability Versus E Ciency? Parallel Applications on PVM and Parix

The Helios Operating System

ISRA VISION at a Glance

Uva-DARE (Digital Academic Repository)

STATUS REPORT Sc? (1@’Rq» Lu? Ul __ V? 211 — Q CERN Participation in the GP MIMD ESPRIT Project

ALTERNATING-DIRECTION LINE-RELAXATION METHODS on MULTICOMPUTERS* J)RN HOFHAUS and ERIC E VAN DE VELDE$ Abstract

Implementation of an Environment for Monte Carlo Simulation of Fully 3-D Positron Tomography on a High-Performance Parallel Platform

Logp: Towards a Realistic Model of Parallel Computation

Uva-DARE (Digital Academic Repository)

TNMOC-Newsletter-Q1-2015.Pdf

A High-Performance, Portable Implementation of the MPI Message

The Helios File Server Installation Information

Versatility of Bulk Synchronous Parallel Computing: from the Heterogeneous Cluster to the System on Chip