Pro cs ZEUS Workshop on Par Programming and Computation Linkoping Sweden

Portabilityversus Eciency

Parallel Applications on PVM and Parix

Alexander Reinefeld Volker Schnecke

Center for Parallel Computing MathematikInformatik

Universitat Paderb orn Universitat Osnabruc k

arunipaderb ornde volkerinformatikuniosnabrueckde

Abstract

Analogous to the shift from assembler language programming to the third

generation languages in the early years of computer science we are currently wit

nessing a paradigm change towards the use of p ortable programming mo dels in

parallel highp erformance computing Like b efore the use of a highlevel program

ming environmentmust b e paid for by a reduced system p erformance

But howmuch do es p ortability cost in practice Is it worth paying that price

What eect has the choice of the programming mo del on the algorithm architecture

In this pap er we attempt to answer these questions by comparing two applica

tions from the domain of combinatorial optimization that have b een implemented

with the Parix and PVM programming mo dels Performance b enchmarks have

b een run on three dierent systems a massively parallel system with

relatively slow Tpro cessors a mo derately parallel Parsytec GCPowerPlus sys

tem with p owerful MFLOPS pro cessors and a UNIX workstation cluster con

y a Mbps LAN While the Parix implementations clearly turned out to b e nected b

fastest PVM gives p ortability at the cost of a small acceptable loss in p erformance

Intro duction

Contemp orary parallel computing systems are strikingly diverse in their architectures

and op erating systems A varietyofinterconnection networks can b e found in to days

most successful commercial machines hyp ercub e nCub e Dgrid Paragon

Parsytec GC ringcrossbar combination Convex SPP network IBM SP

fat tree CM hierarchical rings KSR and Dtorus Cray TD All of them have

their sp ecic advantages and it is still an op en question which top ologies will survive

in the future

Unfortunately the programming interfaces are also very dissimilar making it dicult

to p ort a given co de from one hardware platform to another As it seems p ortability

can only b e achieved bytheuseofvendor indep endent programming environments

like PVM MPI PARMACS P Zip co de Express and Linda For an

overview on programming environments see

Portability and scalability are the keywords of this pap er Ideally a parallel imple

mentation would meet b oth prop erties but in practice interdep endencies make this

imp ossible We compared the p erformance of two application programs running on

Parix PARallel extensions to UnIX and PVM Three dif

ferent hardware platforms have b een used in our exp eriments A no de transputer

system a Parsytec GCPP with PowerPC pro cessors and a UNIX workstation

cluster with SUN SparcStations Our exp eriments giveanswers to the following

questions Is there any impact of the choice of the programming mo del on the archi

tecture of the application program How fast is PVM as compared to Parix To what

extent are PVM programs p ortable and scalable

Applications

As an application problem domain wehavechosen the class of discrete optimization

problems which can b e dened in terms of nding a solution path in a tree or graph from

an initial ro ot state to a goal no de Many applications in planning and scheduling can

b e formulated with this mo del the cutting sto ck problem the bin packing problem

vehicle routing VLSI o orplan optimization satisability problems the traveling

salesman problem the N N puzzle and partial constraint satisfaction problems

All of them are known to b e NPhard

For our b enchmarks wehavechosen twotypical applications with dierent solution

techniques the VLSI o orplan optimization problem which is solved with a depthrst

branchandb ound strategy and the N N puzzle whichissolved with an iterative

deep ening search Both are based on depthrst search DFSwhich traverses the

decision tree in a topdown manner from left to right DFS is commonly used in

combinatorial optimization problems b ecause it needs only O d w storage space in

trees of depth d and width w

VLSI Floorplan Optimization

The oorplan area optimization is a stage in the design of VLSI chips Here

the relative placements and areas of the building blo cks of a chip are known but their

exact dimensions can still b e varied over a wide range A o orplan is represented bytwo

dual p olar graphs G V E and H W F and a list of p otential implementations

for each blo ck As shown in Figure the vertices in V and W representthevertical v2 w1 v1 v3

B1 w1 w2 B2 w3 B3 w2 B4 w3 w4 v1

v2 v3 w4

Figure A o orplan and the graphs G and H

and horizontal line segments of the o orplan There exists an edge e v v in



the graph G if there is a blo ck in the o orplan whose left and right edges lie on the

corresp onding vertical line segments For a sp ecic conguration ie a o orplan with

exact blo ck sizes the edges are weighted with the dimensions of the blo cks in this

conguration The solution of the o orplan optimization problem is a conguration

with minimum layout area given by the pro duct of the longest paths in the graphs G

and H

Our implementation builds a tree where the leaves are complete congurations and

the inner no des at depth d represent partial o orplans consisting of blo cks B B

d

The depthrst branchandbound DFBB solution algorithm employs a heuristic cost

function to eliminate unnecessary parts of the search space that are known not to

contain an optimal solution When a new p ossibly nonoptimal solution has b een

found the search continues with the improved costb ound now pruning all subtrees

with costestimates higher than the new costb ound Newly established costb ounds

are broadcasted so that all pro cessors share the b est available b ound at any stage in

the search

N N Puzzle

Applications with bad initial costb ounds and many alternatives in the decision trees

are solved with heuristic search algorithms likeAandIDA One such

N puzzle Here a given initial conguration of tiles in a example is the N

squared tray of size N N must b e rearranged with the fewest number of moves into

a given goal conguration No eective upp er costb ounds are known for this problem

and hence it cannot b e solved with DFBB Even the relatively small puzzle N



has a search space of states While it would seem easy to obtain any

solution nding an optimal shortest solution is NP complete In general it

takes some hundred million no de expansions to solve a random problem instance of the

puzzle using the Manhattan distance the sum of the minimum displacementofeach

tile from its goal p osition as a heuristic estimate function

Iterativedeepening A IDA simulates a b estrst searchby a series of depthrst

searches with successively increased costb ounds Its low space overhead of O d makes

it feasible in applications where A cannot b e used due to memory limitations

With a nonoverestimating heuristic estimate function IDA is guaranteed to nd an

optimal shortest solution In contrast to DFBB IDA halts after nding a rst

solution b ecause optimality is guaranteed by the iterative approach with the minimal

costb ound increments

Parallel DepthFirst Search

Depthrst search can b e p erformed in parallel by partitioning the search space into

many small disjunct parts subtrees that can b e explored concurrently Wehave

develop ed a scheme called searchfrontier splitting that can b e applied to breadth

rst depthrst and b estrst search It partitions the search space into g subtrees

that are taken from a searchfrontier containing no des n with the same cost value

f n Searchfrontier splitting has two phases

Initially all pro cessors generate synchronously the same searchfrontier and

store the no des in their lo cal memories A search frontier is a level in the tree

where all no des have the same costvalue Each no de represents an indivisible piece

ork for the next phase Hence on a ppro cessor system the search frontier of w

should b e chosen large enough so that it contains at least p no des

In the asynchronous search phase each pro cessor selects a disjunct set of frontier

no des and expands them in depthrst fashion When a pro cessor b ecomes idle

it requests a work packet unpro cessed frontier no de from another pro cessor

Work requests are forwarded from one pro cessor to the next until one of them has

work to share When there are no work packets left the search space is exhausted

and the search is terminated When a pro cessor nds a solution all others are

informed by a broadcast

Searchfrontier splitting is a general scheme It can b e employed in depthrst branch

andb ound DFBB and in iterativedeep ening depthrst search IDA In DFBB the

search continues until all no des have b een either explored or pruned due to inferiority

Newly established b ounds are communicated to all pro cessors to improve the lo cal

b ounds as so on as p ossible

It seems natural to generate the searchfrontier iterativelyby incrementing the costvalue by the

minimum amountuntil there are at least const p entries in the frontier no des array In our exp eriments

wehavechosen const

In the iterativedeep ening variant AIDA subsequent iterations are started on

the previously established frontier no de arrays Work packets change ownership when

they are sent to other pro cessors thereby improving the global loadbalance over the

iterations Lightly loaded pro cessors whichhaveasked for work in the last iteration

will b e b etter utilized in the next As a consequence the communication overhead

decreases during the search pro cess

Parallel Implementations on Parix and PVM

We implemented the searchfrontier splitting scheme on Parix and PVM using the VLSI

o orplan optimization problem and the puzzle as application domains

Parix

PARIX PARallel UnIX extensions is the native op erating system on Parsytec GC

systems It provides UNIX functionality at the frontend with library extensions for

the needs of the parallel system The Parix software package comprises comp onents

for the program developmentenvironment compilers to ols etc runtime environment

libraries multiuser administration and controlnet software Parix is available for a

variety of parallel Parsytec computers AtthePaderb orn Center for Parallel Comput

ing werunParix on

tercon a no de transputer system GCel where the Tno des are in

nected in a dimensional mesh top ology

a medium size PowerXplorer with PowerPC application pro cessors and T

for the communication

a highp erformance GCPowerPluswithtwoPowerPC application pro cessors

and four T communication pro cessors interconnected in a dimensional fat

mesh with links in each direction

A virtual topologies library provides an optimal mapping of the application

pro cess structure onto the underlying hardware interconnection structure Based on

the virtual top ologies common global communication functions like broadcast global

sum min max and barrier synchronization are available for groups of pro cesses

Algorithm Architecture on Parix While in principle anyoftheParix virtual top ologies

could have b een chosen for our implementation it is clear that the torus and ring

emb eddings yield lowest dilation and congestion on our D hardware grid

For the work distribution we used a packet forwarding scheme where the work re

quests are forwarded to neighb ored pro cessors until either some work is sentbackor work

sender receiver work sender receiver receiver sender worker request receiver sender

request

Figure Pro cess architecture on Parix

the message makes a full round through the system thereby indicating that no work is

available On a torus top ology requests are rst forwarded along the horizontal ring

If none of the pro cessors has work to share the request is then sent along the vertical

ring see Figure This yields a widespread work distribution while each pro cessor

p

communicates only with a subset of p pro cessors

The right part of Figure illustrates the pro cess architecture of our Parix implemen

tation Each pro cessor executes nine concurrent threads lightweighted pro cesses

All threads run in the same context that is they share the same global variables de

ned by the main program Hence the worker and receiver threads have direct memory

access to the pro cessors frontier no de array for retrieving new work packets Memory

contention is arbitrated by semaphores The sender and receiver threads serve incoming

messages and send work requests on demand

PVM

The Paral lel Virtual Machine PVM is one of the most p opular message passing mo d

els PVM is public domain and can b e obtained free of charge via ftp PVM makes

a heterogeneous network of parallel and serial UNIX computers to app ear as a sin

gle concurrent computational resource Controlled interaction of a heterogeneous set

of workstations is done via TCPIP so ckets or nativeMPPcommunication proto cols

with the p ossibility of dynamically adding more resources to the network as they are

needed

Applications view PVM as a general and exible parallel computing system that

supp orts the messagepassing paradigm PVM is not restricted to sp ecic architec

tures Application programs mayhave arbitrary control and dep endency structures

with arbitrary and even dynamical relationships among the parallel pro cesses This

allows the most general form of MIMD resp SPMD programming Task granular work work

communicator work request request

worker

request

Figure Pro cess architecture on PVM

ity communication frequency and load balancing are completely at the disp osal of the

programmer There exist library calls for blo cking and nonblo cking send and receive

op erations Groups of pro cesses can b e dened for broadcasts barrier synchroniza

tion and other group functions Release provides features for the ecient use of

architecturedep endentandlowlevel communication mo des on parallel systems

Algorithm Architecture on PVM Only few mo dications were necessary to adapt our

Parix application to the PVM programming mo del Both mo dels provide similar con

structs for the communication interfaces Compared to Parix PVM is lacking a com

mon execution context for sharing ob jects among tasks running on the same pro cessor

Therefore we had to implementa mechanism for sharing the frontier no des by explicit

message transfer In PVM all synchronization and information sharing is solely done

via messages even when running on the same pro cessor While this is certainly the most

abstract hardware indep endent approach it might cause unnecessary ineciencies

In our implementation Fig the frontier no des arrayismaintained by the the

communicator task When a worker task runs out of work it sends a message to

its communicator executing on the same no de When there are no work packets left

in the frontier no des array the communicator task sends as b efore a request to

it neighb oring pro cessors However unlike in the Parix approach work packets are

directly returned to the requester This do es not cause any extra programming eort

b ecause PVM implies a clique network directly connecting ev ery pro cessor to every

other

Performance Comparison

We run two application programs puzzle and o orplan optimization on three ma

chines GCel GCPP UNIX workstation cluster and two programming mo dels Parix

Parix PVM

p time sp eff time sp eff

Parix PVM

p time sp eff time sp eff

Figure GCPowerPlus

Figure GCel transputer system

and PVM Not counting the workstation exp eriments for which only PVM was avail

able we run a total of b enchmarks From the bulk of statistical data we selected

the most signicant results to b e presented in the following sections

Results for the Puzzle

Tables and show the results obtained with a subset of Korf s standard random

problem instances of the puzzle on the GCel and GCPPFor systems with p

pro cessors the total elapsed time the sp eedup sp and the corresp onding eciencies

eff are given All times are wallclo ck times measured in seconds Sp eedups and

eciencies are normalized to the fastest sequential Parix implementation

With the Parix pro cess mo del discussed ab ove Fig weachieved on b oth Parsytec

machines an almost p erfect scalability ranging from on the small systems to

eciency on the larger networks This is a remarkable result considering that the

largest GCel conguration comprises as many as pro cessors while the GCPP has

only PEs The latter howeverismorepowerful resulting in a much faster execution

sp eed of only seconds on the largest system Clearly for such short execution runs

the presented sp eedups can hardly b e improved

With its MFLOPS p eak p erformance a single GCPowerPlus no de is exp ected

to b e approximately times faster than a GCeltransputer no de with its MFLOPS

In this sp ecial application the sp eedup is even larger We measured a p erformance

ratio of On the other hand the communication bandwidth of the GCPP is only

times higher than that of the GCel Due to the worse communicationcomputation

p erformance ratio it is clear that the puzzle application do es not scale as well on

the larger GCPP systems

entedontopofParix PVM requires some additional CPUtime for Being implem

the b o okkeeping buercopying and function calls For the larger GCel system this

b est

s

p

avg

e

e

worst

d

u

p

pro cessors p

Figure PVM sp eedup on SUN SScluster Ethernet

results in

increased communication latency due to the larger numb er of hops

increased message trac due to the increased numb er of negrained work packets

increased computing overhead since the T pro cessors also p erform routing

Even so the PVM p erformance losses seem to b e acceptable for the medium sized

systems

PVMs availability and p ortabilityallowed us to develop our implementations on a

workstation cluster while the pro duction runs were p erformed on all kinds of paral

lel systems We also b enchmarked a heterogeneous PVM environmentthatallows a

collection of serial and parallel computers to app ear as a single virtual machine The

task managementisdoneby a PVM daemon running on the UNIX frontend of the

parallel system Since all communication is rst checked whether its destination lies

within or outside the MPP system the heterogeneous PVM cannot b e as ecient

as the homogeneous PVM discussed ab ove Performance results for the heterogeneous

PVM may b e found in

Figure illustrates the p erformance achieved on a cluster of SUN SparcStation

While all workstations are of the same mo del and sp eed the measurements w ere taken

on busy systems with dierent load factors and a busy Ethernet with several bridges

This results in widely varying execution sp eeds Figure shows the average the slowest

and the fastest of runs on the same sample problem set As can b e seen the

initially go o d sp eedup seems to level o when more than workstations are employed

The exact data indicates that this is due to the longer startup times and increased

communication latencies

GCel PVM

Xplorer PVM

Xplorer Parix

GCel Parix

sec

GCPP PVM

GCPP Parix

packet size in byte

Figure Communication sp eed b etween neighb ored no des

Communication Speed of Parix and PVM

t we b enchmarked latency times and communication through In a separate exp erimen

put of Parix and PVM on the GCel GCPP and PowerXplorer Figure shows the

communication p erformance b etween neighb ored pro cessors for each of the systems

The PVM times include complete packing and unpacking on b oth sides sender and the

receiver

The most apparent fact is that in contrast to the other p erformance graphs the

GCPP curves are not straight This is attributed to the xed initial overhead required

for multiplexing the message among the four links of the communication subsystem

The multiplexing is done by four dedicated T communication pro cessors In practice

the computing p erformance would greatly b enet by latency hiding techniques which

is not reected in these p erformance graphs Hence these graphs represent the worst

case

For the systems with PowerPC pro cessors GCPP and PowerXplorer the com

munication throughput of PVM and Parix are very similar Only for the GCel there is

a larger discrepancy b etween Parix and PVM communication p erformance illustrated

by the dotted line This is caused by the GCel transputer no des whichareby a factor

of slower than the PowerPC no des Hence it takes much longer on the GCel to

initiate a communication function calls data copying etc Once the initial setup

is done the same throughput is achieved on the GCel as on a PowerXplorer Due

P arix

s

PVM

p

e

e

d

u

p

pro cessors p

Figure VLSI o orplan optimization on GCPowerPlus

to the fat communication mesh consisting of links in each direction the GCPP

communication is faster

orplan Optimization Problem Results on the VLSI Flo

Figure shows the results obtained with the VLSI o orplan optimization algorithm

We used a standard VLSI b enchmark problem from with building blo cks and

 

six dierent implementations p er blo ck This gives a search space of no des

Compared to the puzzle application more communication is necessary in the o or

plan optimization b ecause the solution sp eed dep ends muchontheavailabilityofe

cient b ound values Each time a new b ound is found byany of the worker tasks it is

broadcasted to all others

The Parix implementation uses a torus top ology for submitting work requests and

for transferring work packets Our PVM implementation in contrast makes use of the

implicit clique top ology A master pro cess hands out work packets to the slaves which

search their assigned subtrees and broadcast newly found b ound values

As can b e seen in Figure this simple farming approach yields sucient p erformance

results on small systems with pro cessors Beyond a hierarchy of master pro cesses

should b e implemented to eliminate p otential communication b ottlenecks

Conclusions

Weevaluated two programming mo dels PVM and Parix on three hardware platforms

with two dierent application programs What are the lessons learned Recalling the

title of our pap er Portabilityversus Eciency wedraw the following conclusions

Portability PVM provides a p ortable programming mo del for implementing parallel

applications on a wide variety of messagebased MIMD systems Many imp ortant scien

tic and industrial applications have b een p orted to PVM with the exp ectation to gain

more indep endence from sp ecic hardware platforms and vendors First results from

the Esprit pro ject EUROPORT indicate that it is p ossible to execute large PVM appli

cations on various systems ranging from mo derately parallel symmetricmultipro cessor

systems to massively parallel systems with manyhundred no des

Hardware indep endencyhowever may tempt the programmer to use program fea

tures that might later b e found inecient on certain systems Timecritical applications

may need p ostoptimization to exploit sp ecic MPP system features As an example

PVMs cliquestructured communication pattern is tailored for busconnected work

station clusters Ethernet FDDI ATM As seen ab ove clique communication p oses

problems on massively parallel D systems where the communication latency dep ends

on the numb er of hops a message needs to reach the recipient Here nearest neighbor

communication should b e used instead

Eciency Executing on top of the native op erating system the PVM programmming

interface clearly induces additional overheads The slower the application pro cessor the

more time is sp ent in setting up the communication b etween pro cessors Compared to

Parix PVM needs more b o okkeeping for the managementofcommunication buers

for any necessary data conversions and for additional system calls

The p erformance losses are esp ecially pronounced in the results obtained on the large

transputer system Table Here the PVM implementation to ok seconds on a

no desystem while the Parix implementation computed the same solution in only

seconds

Due to unpredictable execution times timecritical interactive applications should

b etter not b e run on workstation clusters Dierent pro cessor capabilities may induce

load imbalances and the system load mightchange dynamically during the program

runtime While the dierent pro cessor sp eeds did not aect the p erformance of our

entations which includes a dynamical load balancing strategy we still PVM implem

observed widely varying execution times from one execution run to the next

Process Architecture The choice the b est pro cess architecture for a given application

may b e inuenced by the hardware architecture and the programming mo del

Compared to the Parix implementation PVM needs more eort to implement a mech

anism for sharing workpackets among the communicator and worker tasks executing

on the same no de In Parix all threads run in the same context and share the same

global variables dened by the main program Hence all threads have direct memory

access to the same work packet array maintained on a no de Memory contention is

arbitrated by semaphores In PVM in contrast all communication is p erformed via

explicit message transfer regardless whether it is a long distance communication or a

communication within tasks running on the same pro cessor

Typically the PVM pro cess architecture is designed without knowing the top ology

of the target hardware system Since PVM do es not provide information ab out the

lo cation of the tasks in the hardware network fast communication eg nearest

neighb or communication is hard to implement The virtual top ologies provided by

PARMACS and MPI allow more ecient mappings of tasks to pro cessors While such

mapping functions can also b e implemented by matching the Parix pro cess descriptors

to the PVM task identiers such programming tricks clearly violate our ultimate goal

of p ortability

Acknowledgements

Thanks to our lo cal PVM exp ert Axel Keller who help ed a lot in getting our exp eri

ments done within a tight timeframe Also thanks to Jing Li for implementing parts

of the PVM o orplan optimizer

References

S Arvindam V Kumar and V Rao Ecient paral lel algorithms for searching

problems Applications in VLSI CAD rd Symp Frontiers Mass Par Comp

Maryland

N Christodes and C Whitlo ck An algorithm for twodimensional cutting prob

lems Op erations Research

EC Freuder and RJ Wallace Partial constraint satisfaction Articial Intelligence

A Geist A Beguelin J Dongarra W Jiang R Manchek and V Sunderam PVM

Users Guide and Reference Manual Oak Ridge National Lab oratory Knoxville

TN Techn Rep ORNLTM May ftp csutkedu

R Hemp el AJG HeyOMcBryan and DW Walker eds Sp ecial Issue on

Message Passing InterfacesParallel Computing

M Held and RM Karp The traveling salesman problem and minimum spanning

trees Op erations Research

RE Korf Depthrst iterativedeepening An optimal admissible treesearch Art

Intell

V Kumar and V Rao Scalable paral lel formulations of depthrst search Kumar

Gopalakrishnan Kanal eds Par Alg for Mach Intell and Vision Springer

V Kumar A Grama A Gupta and G Karypis Introduction to Paral lel Com

puting Design and AnalysisofAlgorithms BenjaminCummings Publ Redwood

City CA

Message Passing Interface Forum MPI A messagepassing interface standard

Comp Sc Dept Univ Tennessee Knoxville TN CS April

NJ Nilsson Principles of Articial Intel ligen ce Tioga Publ Palo Alto CA

Parsytec Parix V PowerPC SoftwareDocumentation Dec

J Pearl Heuristics Intel ligent Search Strategies for Computer Problem Solving

AddisonWesley Reading MA

VN Rao V Kumar and K Ramesh Aparal lel implementation of iterative

deepening A AAAI

VN Rao and V Kumar On the eciency of paral lel backtracking IEEE Trans

Par Distr Systems

D Ratner and M Warmuth Finding a shortest solution for the N N extension

of the puzzle is intractable AAAI

A Reinefeld and TA Marsland Enhancediterativedeepening search IEEE Trans

Pattern Analysis Mach Intell IEEEPAMI July

aral lel depthrst A Reinefeld and V Schnecke Workload balancing in highly p

search Pro cs Scalable High Perf Comp Conf SHPCC Knoxville

A Reinefeld and V Schnecke Performance of PVM on a highly paral lel transputer

system First Europ ean PVM Users Group Meeting Rome Italy Oct

T Romke M Rottger U Schro eder and J Simon An Ecient Mapping Library

for Parix Pro cs ZEUS Workshop on Par Programming and Computation

Linkoping Sweden

M Rottger UPSchro eder and J Simon Virtual Topologies Library for PARIX

UniversityofPaderb orn Tech Rep trri available via ftp or www

PMA Slo ot A Ho ekstra and LO Hertzb erger Acomparison of the IServer

OCCAM Parix Express and PVM programming environments on a Parsytec

GCel W Gentzsch U Harms eds HPCN Munich Springer Lecture

Notes

L Sto ckmeyer Optimal orientations of cel ls in silicon oorplan designs Inform

and Control

VS Sunderam GA Geist J Dongarra and R Manchek The PVM concur

rent computing system Evolution experiences and trends Parallel Computing

KV Viswanathan and A Bagchi Bestrst search methods for constrained two

dimensional cutting stock problems Op erations Research

TC Wang and D F Wong An Optimal Algorithm for Floorplan Area Optimiza

tion ProcthACMIEEE Design Automation Conf

S Wimer I Koren and I Cederbaum Optimal aspect ratios of building blocks in

VLSI th ACMIEEE Design Automation Conference