Reducing I/O Complexity by Simulating Coarse Grained Parallel Algorithms

Frank Dehne, David Hutchinson, Anil Maheshwari

School of Computer Science, Carleton University g Ottawa, Canada K1S 5B6, fdehne, hutchins, maheshwa @scs.carleton.ca Wolfgang Dittrich Bosch Telecom GmbH, UC-ON/ERS Gerberstraûe 33, 71522 Backnang Germany

Abstract We ®rst review some de®nitions for the BSP and CGM

model; consult [11, 23, 13] for more details. A BSP algo- P

Block-wise access to data is a central theme in the de- rithm A on a ®xed problem instance and computer con®g-

N v 

sign of ef®cient external memory (EM) algorithms. A sec- uration C can be characterized by the parameters ( , , ,

L N v ond important issue, when more than one disk is present, g , ), where is the problem size (in problem items), is is fully parallel disk I/O. In this paper we present a de- the number of processors, g is the time required for a com- terministic simulation technique which transforms paral- munication operation, L is the minimum time required for

lel algorithms into (parallel) external memory algorithms. the processors to synchronize, and  is the number of BSP

P C Speci®cally, we present a deterministic simulation technique supersteps required by A on and . (The times are in num- which transforms Coarse Grained Multicomputer (CGM) ber of processor cycles.) A CGM algorithm is a special case

algorithms into external memory algorithms for the Parallel of a BSP algorithm where the communication part of each

N

 =

Disk Model. Our technique optimizes block-wise data ac- superstep consists of exactly one h-relation with h . v

cess and parallel disk I/O and, at the same time, utilizes mul- Such a superstep is called a round. The notion of an h- tiple processors connected via a communication network or relation is often used in the analysis of parallel algorithms

. based on BSP-like models. An h-relation is a communica-

We obtain new improved parallel external memory al- tion superstep in which each of the v processors sends and

gorithms for a large number of problems including sorting, receives at most h data items. An algorithm for a CGM with permutation, matrix transpose, several geometric and GIS multiple disks attached to each processor is referred to as an problems including 3D convex hulls (2D Voronoi diagrams), EM-CGM algorithm. and various graph problems. All of the (parallel) external The following outlines the results obtained. memory algorithms obtained via simulation are analyzed

with respect to the computation time, communication time

A  1. We show that any v processor CGM algorithm with and the number of I/O's. Our results answer to the challenge supersteps/rounds, local memory size , computation

posed by the ACM working group on storage I/O for large-

+ L g + L time , communication time and mes-

scale computing [8]. N

 

sage size 2 can be simulated, deterministically, as

v

0 A

a p-processor EM-CGM algorithm with computa-

v v

+ O  + L

tion time  , communication time

p p



v v v v

+ L G  O  + L

1 Introduction g , and I/O time for

p p p DB p

N

v  p M = N = vDB  B = O  

, , and 2 . v

Cormen and Goodrich [8] posed the challenge of com- G is the time (in processor cycles) for a parallel I/O op- bining BSP-like parallel algorithms with the requirements eration.

for parallel disk I/O. Solutions based on probabilistic meth-

N  LN  v N  N

ods were presented in [11] and [14]. In this paper, we present Let g , , be increasing functions of . c

deterministic solutions which are based on a deterministic If A is -optimal (see [16, 11] for de®nition) on the

 g N  L  LN  v  v N 

simulation of parallel algorithms for the Coarse Grained CGM for g , and ,then

0

c = ! 

Multicomputer (CGM) model and answer to that challenge. A is a -optimal EM-CGM algorithm for ,

p

g  g N  G = BD  o L  LN  

,  and . v

The analysis of the I/O complexity of our algorithms is done  0

as in the PDM model. In addition, we analyze the running A is work-optimal, communication-ef®cient, and I/O-

=  g  time and communication time. ef®cient (see [11] for de®nitions) if ,

p

g N  G = BD  O  L  LN  

, ,and . techniques map parallel algorithms designed for the PRAM v  model to the BSP model. Furthermore, by generating pro- While our parameter space is constrained to a coarse grams for a single processor computer from coarse grained grained scenario which is typical of the CGM con- parallel algorithms, our approach can also be used to control straints for parallel computation, we show that this pa- cache memory faults. This supports a suggestion of Vishkin rameter space is both interesting and appropriate for [24, 25]. EM computation. Once the expression for the general The parameter space for EM problems which we are lower bounds reported by [1, 26, 3] is specialized for proposing in this paper is both practical and interesting. The

the subset of the parameter space that we consider, it logarithmic term in the I/O complexity of sorting is bounded

N N becomes fully compatible with our results. This an- M

c

   M =

by a constant c if ,where . Since this

B v

swers questions of Cormen [7] and Vitter [27] on the B

B N c apparent contradictions between the results of [11] and constraint involves the parameters v , , , ,wehavea

the previously stated lower bounds. four-dimensional constraint space. For practical purposes,

3 10

the parameter B can be ®xed at about for disk I/O [22].

c1 c c1

= v B 2. We obtain new, simple, parallel EM algorithms for This reduces the constraint to a surface N in sorting, permutation,andmatrix transpose with I/O three dimensional space. Any point on or above the surface

N represents a valid set of parameters for the elimination of the

  complexity O . pD B logarithmic factor. For 100 processors or less, for instance,

3. We obtain parallel EM algorithms with I/O complex- any problem size greater than about 10 mega-items is suf®-

N cient.

  ity O for the following computational geome- pD B Due to page restrictions for the conference proceedings, try/GIS problems : (a) 3-dimensional convex hull and some details are omitted in this paper. An extended ver- planar Voronoi diagram (these results are probabilistic sion with additional ®gures and charts can be found at since the underlying CGM algorithms are probabilis-

http://www.scs.carleton.ca/ dehne. e tic), (b) lower envelope of line segments (here, N de-

notes the size of the input plus output), (c) area of union D of rectangles, (d) 3 -maxima, (e) nearest neighbour 2 Single Processor Target Machine problem for planar point set, (f) weighted dominance counting for planar point set, (g) uni-directional and In this section we describe a deterministic simulation multi-directional separability. technique that permits a CGM algorithm to be simulated as

4. We obtain parallel EM algorithms with I/O complex- an external memory algorithm on a single processor target

log N

N machine.We ®rst consider the simulation of a single com-

  ity O for the following computational geom- pD B pound superstep and, in particular, how the contexts and etry/GIS and graph problems: (a) trapezoidal decom- messages of the virtual processors can be stored on disk, and position (b) triangulation (c) segment tree construction, retrieved ef®ciently in the next superstep. The management (d) batched planar point location, of the contexts is straightforward. Since we know the size of the contexts of the processors, we can distribute the contexts 5. We obtain parallel EM algorithms with I/O complex-

deterministically.

N log v

  ity O for the following computational geome- pD B The main issue is how to organize the generated mes-

try/GIS and graph problems: (a) list ranking, (b) Euler sages on the D disks so that they can be accessed using tour of a tree, (c) connected components, (d) spanning blocked and fully parallel I/O operations. This task is sim- forest, (e) lowest common ancestor in a tree, (f) tree pler if the messages have a ®xed length. Although a CGM

contraction, (g) expression tree evaluation, (h) open ear N  algorithm has the property that  data is deemed to be decomposition, (i) biconnected components. exchanged by each processor in everyv superstep, there is no 6. In contrast to previous work, all of our methods are also guarantee on the size of individual messages. Algorithm BalancedRouting scalable with respect to the number of processors. gives us a technique for achieving ®xed size messages.

The results for Items (2), (3) and (4) are subject to the 2

2 Algorithm 1 BalancedRouting (from [4])

= vDB  N  v B + v v 1=2

conditions N , ,and

n v

 Input: Each of the processors has elements, which are

v

> v  > 1

N ,where is a constant that depends on the

n  divided into v messages, each of arbitrary length .Let

problem. The latter constraint arises in the CGM algorithm v

msg i

which we simulate. For the problems examined in this pa- ij denote the message to be sent from processor to pro-

j jmsg j

cessor ,andlet ij be the length of such a message.

 3 per,  .

Our results show that the EM-CGM is a good generic pro- Output: The v messages in each processor are delivered to

gramming model that facilitates the design of I/O-ef®cient their ®nal destinations in two balanced rounds of communi-  algorithms in the presence of multiple processors and mul- cation, and each processor then contains at most h data.

tiple disks. It has relatively few parameters, generalizes the

=0 v 1

PDM model, and answers to the challenge of [8]. Superstep A: For i to in parallel v Note that our results apply to a wide spectrum of paral- Processor i allocates local bins, one for each proces- lel algorithms. For example, several well known simulation sor

=0 v 1 j

For j to by Step 1. Superbin now contains the message which is to

j k

=0 jmsg j ` be sent from processor to processor in Step 4.

1For to ij

th

E

i ` msg

In a similar manner to the analysis of superstep A, let j

Processor allocates the word of ij

j

i + j + ` v

to local bin  mod be the number of extra elements in superbin after Step 1. sbin

Let min be the superbin which contains the minimum

=0 v 1

2Forj to

jsbin j

number of elements after Step 1, and hence min rep-

j j Processor i sends bin to processor resents the minimum message size in Step 4. When proces-

sor k is one of the processors which receives the maximum

P

=0 v 1

Superstep B: For j to in parallel



h h = v jsbin j + E j

data elements, we have min ,and

j j

3 Processor reorganizes the messages it received in P



v v 1

h v 1

jsbin j  E 2 min

since j ,wehave

j

v 2 Step 2 into bins according to each element's ®nal des- 2 tination

The notion of an h-relation is often used in the analysis

k k

4 Processor j routes the contents of bin to processor , of parallel algorithms based on BSP-like models (e.g. BSP,

 k  v 1 for 0 BSP*, CGM). An h-relation is a communication superstep

in which each of the v processors sends and receives at most bin

Observation 1 If min is the smallest bin created at a

h data items. It is typically used in bounding the communi-

v 1 processor in step (1) of Superstep A, then the other 

cation complexity in an asymptotic analysis. Based on this

v v 1

::: +v 1 =

bins can contain at most 1+2+ more h

2 usage of an -relation, we have: bin

elements than does min .

Corollary 1 An arbitrary h relation can be replaced

n

Theorem 1 We are given v processors, and data items.

n h

Each processor has exactly  data to be redistributed by two ªbalancedº relations whose message size is

v

v 1 v 1 h h +

bounded by and .

2 v 2

among the processors, and no processor is to be the recipient v 

of more than h data. The redistribution can be accomplished b

in two communication rounds of balanced communication: Lemma 1 An arbitrary minimum message size min can be

v 1 n

assured provided that

(A) Messages in the ®rst round are at least 2 , and

v 2

n v 1 +

at most 2 in size, and (B) Messages in the second

v 2

2

 

v v 1

h h v 1 v 1

2

+

 v b +

round are at least , and at most in size. N

min

2 v 2

v (1) 2 Proof. The proof of the maximum message sizes is given in [4]. In the following, we give a proof for the minimum where N is the total number of problem items summed over

message sizes. In Superstep A each processor initially has the v processors. n

 data. At the end of Superstep B, each processor will have v

 Proof. From Corollary 1, we can achieve a minimum mes-

at most h data.

N v 1

b b  2

min min

sage size provided that 2 . 2 First, we consider the minimum message size in Super- v step A. Due to the round robin allocation mechanism, a given bin after Step 1 will contain at most one more ele- Now we look at the deterministic simulation of CGM al- gorithms as EM-CGM algorithms. Not every CGM algo- ment of a message to processor j than does any other bin. rithm will require balancing, but Lemma 2 ensures that we

Let us ®x any processor i. Consider the bin sizes after all of the messages have been distributed among the bins by can obtain balanced message sizes when necessary by in-

creasing the number of supersteps by a factor of 2.

processor i. Clearly, all of the bins will contain at least as

bin e j

many elements as the smallest bin, min .Let be the v

number of extra elements (more than this minimum) in bin Lemma 2 Let A be a CGM algorithm with N data, pro-

  bin j cessors, and communication steps. The communication

at Step 2. The crucial observation is that if min is the

A 2

v 1

smallest bin, then the other  bins can hold at most steps of can be replaced by steps of balanced commu-

v v 1

B 

nication in which the minimum message size is  and

::: +v 1 =

1+2+ extra elements. Thus,

2

N

P P

2  N 

2

v 1

v the maximum message size is provided that

n

v

= v jbin j + e  e

2

j j

min .Since ,wehave

j j

v 2

v v 1

2

v b +

v 1 n

min

2

jbin j 2

min . 2 We now turnv to the message sizes in Superstep B. The el- Proof. The minimum and maximum message sizes follow

ements which arrive at processor j as a result of Step 2 are

N N

th

h = + 2

j from Corollary 1, with , and the constraint that v

the contents of the local bins formed in Step 1 at each v

2

th

v v 1

N v 1

v 1 j

of the processors 0 through . We can think of the

N   2 

2 .Thisistrueif , which is absorbed

2 v 2 local bin of each of the v processors as a component of a

by (1) from Lemma 1. 2 th single, global superbin, which is the union of the j local

bins of all v processors. Consider only the messages des- We will now turn to the actual simulation results, which

N

k k k  1

tined for a ®xed processor which are held by each proces- rely on a message size of 2 , for a known constant .

v

0  i  v 1

sor i, , prior to Step 1. These are allocated As we have seen, this is guaranteed by Lemma 2. Not every

i + k  v among the superbins, starting with superbin  mod CGM algorithm will require Lemma 2.

Lemma 3 A compound superstep of a v -processor CGM al- (d) Write the packets which were sent by virtual processor

 + L i D

gorithm A with computation time , communication to the disks in staggered format.

N N

g  O  k +L

time , message size 2 , for a known con-

v v

(e) Write the changed context of virtual processor i back

 1  stant k , and local memory size can be simulated to the D disks (in consecutive format).

in a compound superstep of a single processor EM-CGM

+ O v

algorithm in computation time v and I/O time We use the results of Lemma 2. Since messages are at

v

N N

h N

+   G  O  M   D = O 

  2

provided that , , 2

2

DB vB

DB most in size we can allocate ®xed sized slots

v v

N

 B = O 

O v

and 2 . for them on the disks while preserving an disk space

v

N 2

requirement. In many practical EM situations 2 will be The proof is given below, following the presentation and a signi®cant overestimate of the maximum messagev size, as

discussion of Algorithm 2. This algorithm expects the in- N

v << B 

2 . The assurance of minimum message size v put messages to the virtual processors in the current super- implies that I/O operations will be blocked1. step to be organized (by destination) in a parallel format on the disks, and it also writes the messages generated in the Details of Algorithm 2 current superstep to the disks in a parallel format. We use the phrase ªa parallel formatº to mean an arrangement of the Details of Steps (a) and (e): Since we know the size of data that permits fully parallel access to the disks, both for the contexts of the processors, we can distribute the con- writing the messages, and for reading them back in a differ-

texts deterministically. We reserve an area of total size v ent order in the next superstep. Two instances of a parallel on the disks, v blocks on each disk, where we store the

format are the consecutive and staggered formats. DB

V j

contexts. We split the context j of virtual processor

B i V

into blocks of size and store the -th block of j on disk

j k

 

Consecutive format: We say that a disk read/write oper-

i+j



B

th

i + j mod D

q 0 

D using track . Since the context of D

ation on blocks is consecutive when the block, B

q  D d + q  D

is read/written from/to disk  mod on track each processor is now in striped format on the disks, we can

T + bd + q =D c T

0 0

,where is the track used for the ®rst easily read and write the contexts using D disks in parallel d of the D blocks to be read/written, and is the disk offset for every I/O operation.

(from disk 0) for the ®rst of the D blocks to be read/written. Details of Step (b): The previous compound superstep guaranteed that the blocks which contain the messages des-

Staggered format: We say that a disk read/write operation tined for the current processor are stored in consecutive for-

0

n b

on D blocks involving messages, each of size at most mat on the disks. Therefore, we can use a similar technique

th 0

0  q  b 1

blocks, is staggered when the q block, to fetch the messages as we used to fetch the contexts.

th

0  j  n 1

of the j message, is read/written from/to Details of Step (d): After the Computation Phase, all

d + S + q  D T + bd + S + q =D c

disk mod on track 0 ,where messages sent by the current virtual processor have been

th

0 d T generated and stored in internal memory. The coarse- 0 is the track used for the blocks of the message, is

the disk offset (from disk 0) for the ®rst of the D blocks to grained nature of the underlying BSP-like algorithm results

S=D e be read/written, and d is the number of tracks by which in large messages, (see Lemma 2) which are as long or

consecutive messages are to be staggered (separated). longer than the block size B . We cut the messages into B

Algorithm 2 simulates a compound superstep of a v - blocks of size . Each block inherits the destination ad-

processor CGM on a single processor EM-CGM with D dress from its original message. In rounds, we write the BD

disks. blocks out to the disks, as described in detail below.

0 b Let b represent the maximum message size, and let rep-

Algorithm 2 : SeqCompoundSuperstep resent the maximum number of disk blocks per message.

b

0

b = d e msg

if0;:::;v 1g

Input: For each the blocks of the context Hence, .Let ij represent the message sent

B

v v j

are stored on the disks in consecutive format, and the arriv- from processor i to processor in one communication su- D ing messages of virtual processor i are spread over the perstep. We will store the messages destined for a ®xed

disks consecutive format. processor j in standard consecutive format, beginning with

msg msg

v

j p1;j

Output: (i) The (changed) contexts of the simulated pro- 0 and ending with . We ensure that the ®rst

0

msg b + b  D

+1 0

cessors are spread across the disks in consecutive format. block of i;j is assigned to disk mod ,for

0  j  p 2 b

(ii) The generated messages for each processor in the next ,where 0 is the disk number of the ®rst block msg superstep are stored in consecutive format on the disks. for ij . In other words, the starting block positions for

messages to consecutive processors are appropriately stag-

=0 v 1 For i to gered on the disks to ensure that we can write blocks of mes-

sages to consecutively numbered processors in a single par-

i

(a) Read the context of virtual processor from the disks 0

vb

0

e b D 6= 0 T = j d

allel I/O when mod .Let j be

into memory. D

msg j the track offset for 0 (the ®rst such message destined

(b) Read the packets received by virtual processor i from

the disks. 1Although a CGM algorithm may occasionally use smaller messages N

than B ,itischargedforan -relation in each superstep as if every pro-

v h (c) Simulate the local computation of virtual processor i. cessor sent and received data

0

v d = jb D j

for processor j ). Let mod be the disk offset

th

v + O v q

msg Thus, overall, the computation time is and j

(from disk 0) for the ®rst block of 0 .The block of

v

0

O G 

d + ib + q  D

msg the I/O-time is .

ij j

is assigned to disk mod on track DB

0

+ bd + ib + q =D c

T This concludes the proof of Lemma 3. j j . This storage scheme maintains what

we will call the messaging matrix across the parallel disks.

A 

The messages destined to a particular virtual processor are Theorem 2 A v processor CGM algorithm with super-

N

+ g  O  +L steps, local memory size , running time ,

stored in a band, or stripe of consecutive parallel tracks. v

N

 

Outgoing message blocks are placed in a FIFO queue for and message size 2 can be simulated as a single pro-

v

0

v + O v +

servicing by procedure DiskWrite. DiskWrite removes at cessor EM-CGM algorithm A with time

v N

most D blocks from the queue in each write cycle and writes

G  O  M =  B = O  N =  

for , 2 , and

DB v

them to the disks in a single write operation. Blocks are ser- 0

vDB  A c A  . In particular, algorithm is -optimal if is

viced strictly in FIFO order. Blocks will be added to the cur-

= !  G = DB  o  c-optimal, and .Further-

rent write cycle and removed from the queue until a block 

0 A

is encountered whose disk number con¯icts with that of an more, algorithm A is work-optimal and I/O-ef®cient if is

= 

earlier block in the current write cycle. work-optimal and communication-ef®cient, and

= DB  O  

Since the messages destined for any given processor are G .  stored in consecutive format on the disks, we can read the

messages received by a virtual processor using D disks in Proof Sketch. We use the results of Lemma 3. The com-

parallel for every I/O operation. Except possibly for the last, putation time required to simulate the computation steps of v

each parallel read performed by the simulation of proces- A is . The computational overhead associated with the

O v +O N 

v D

sor j will obtain message blocks. By staggering the ®rst I/O steps (Steps (a),(b),(d),(e)) is .Since

> N v +

message blocks for consecutive virtual processors across the v the total computation time is bounded by

O v 

disks, we can achieve fully parallel writes to the disks. .Whenc-optimality is required, we therefore need

N

 = ov  = !   =

The scheme just described requires two copies of the v  ,or . Note that when , v

messaging matrix because the messages generated by virtual N

= !   = ! 

we can substitute for .Forwork-

v k

processor i in compound superstep must be stored on disk

= O v  = 

optimality, we require that v  ,or .

v +1

before virtual processor i can the messages gen-

+  [O 

The I/O time (Steps (a),(b),(d),(e)) is G

DB

1

erated for it in compound superstep k .

v

N

O  G  O  c 

], which is bounded by .For- DB

We can avoid this extra space requirement, however, as D

v 

follows. optimality, we require the I/O time to be in o ,which

G = DB  o

means that . For I/O-ef®ciency, we re- 

Observation 2 By alternating between fconsecutive

v  G =

quire the I/O time to be in O , which means that f

reads, staggered writesg and staggered reads, consecutive

 2 DB  O 

g . writes in successive compound supersteps, the simulation  can achieve fully parallel I/O on all message blocks with a single copy of the messaging matrix. 3 Multiple Processor Target Machine

We now begin with the proof of Lemma 3.

 1 Algorithm SeqCompoundSuperstep loads each virtual For the case of p processors on the EM-CGM

machine we simulate a compound superstep of a CGM al-

  processor into the real memory, requiring that M . Since the messages sent or received in a superstep by a vir- gorithm A using the algorithm ParCompoundSuperstep,

N shown below. Unlike in the case of a single real processor,

h =  tual processor are  in total size, we require that

v we are now forced to perform real communication between

N

D  =  to ensure that our I/O scheme for messages is

vB the real processors of the target machine. Each real proces-

N

 D = O 

0  i  p 1

ef®cient. This means that . sor i, , executes algorithm ParCompound- vB

Disk Space: The disk space needed by the simulation is the Superstep in parallel. For ease of exposition, we assume

v v total context size , which includes space for messages. that p divides .

At most one track is wasted for each virtual processor. The

v

N

  = DB 

space used on each disk is O ,since . Algorithm 3 : ParCompoundSuperstep

DB v

Computation Time: Steps (a) and (e) of algorithm Seq- Objective: Simulation of a compound superstep of a v -

processor CGM on a p-processor EM-CGM.

v CompoundSuperstep require computation time O .

N Input: The message and context blocks of the virtual pro-

  In Steps (b) and (d), O message data is routed for each

v cessors are divided among the real processors and their local

O N 

virtual processor. Over all v processors, this adds

N

i 0  i  p 1 O  disks. Each real processor , holds 

computation time overall, which can be ignored. Step (c) pB v

v blocks of messages and blocks of context, and each lo-

consumes computation time. pB

v

v

N

 O G

O  O   I/O Time: Steps (a) and (e) consume time, and 

DB cal disk contains blocks of messages and

pD B pD B

N

O G O N=p

steps (b) and (d) consume  time. Since blocks of context.

DB

 

message data is sent in each superstep, and N=p we Output: The changed contexts and generated messages dis-

v

 G

have time O due to I/O overall. tributed as required for the next compound superstep. DB

Proof Sketch. We use the results of Lemma 4. The compu-

v

1 =0 A

For j to do tation time required to simulate the computation steps of

p

v

v is . The computational overhead associated with the I/O

p

+ j

(a) Read the context for virtual processor i from the

p

v

+  and communication steps (Steps (a),(b),(d),(e))is O

local disks. p

v N N

O    

.Since , the total computation time is

v v

(b) Read any message blocks addressed to virtual proces- p

v v

v

+ O   c

bounded by .When -optimality is required,

i + j p

sor from the local disks. p

p

N

= = !  we need . Note that in many cases .

(c) Simulate the computation supersteps of virtual proces- v

= 

Also, when only work-optimality is required,

v

+ j sor i , collecting all generated messages in the local

p suf®ces.

v

 [O  +

internal memory. The I/O time (Steps (a),(b),(d),(e)) is G

DB



N v

O  G  O   c

], which is bounded by .For-

p DB

(d) Send all generated messages to the required (real) des- DB

v

 

tination processor. Upon arrival, the messages are ar- optimality, we require the I/O time to be in o ,which p

ranged within the internal memory of the real destina-

 = DB  o means that G . For I/O-ef®ciency we need

tion processor and then written to its disks as in the sin- 

 = DB  O  only that G . Since the number of supersteps

gle processor simulation; see Algorithm 2. 

p

v

 LN   2

increases by a factor of we require that L .

v

p v

+ j

(d) Write the contexts for virtual processor i back to p the local disks; see Algorithm 2. 4 Conclusions and Future Work

Lemma 4 A compound superstep of a v-processor CGM al-

 + L

gorithm A with computation time , communication The key result of this paper is a deterministic simulation

N N

g  O  +L  

time , message size 2 , and local memory v

v theorem, that maps algorithms for coarse grained parallel

v p size  can be simulated as compound supersteps of a -

p computing models to external memory algorithms for the 0

processor EM-CGM algorithm A in parallel computation parallel disk model with multiple processors. As a corol-



v v v v v

G  O G  + O  + L + L

time  and I/O time ,

p p p DB p

p lary we obtain external memory algorithms for a spectrum

N of problems, and all of these algorithms are scalable in terms

 p  v N = vDB  B = O  for , , and 2 .

v of the number of disks and processors.

 v

Proof. We have p real processors, so the time to sim- Our recent work involves replacing Algorithm 1 by an- v

ulate Step (c) is  . Computational overhead is contributed other algorithm for achieving ®xed size messages in such a

p N

v way that the slackness condition for the

O +O   

by , due to swapping of contexts (Steps 1+

p p

 v B  0 <  1 improves to N ,where is a constant, .

(a),(d)) and messaging I/O (Steps (b),(d)). As before, the

M  N 

Moreover this provides a better bound for N : 1

computational overhead is dominated by the cost of swap- 1+

 M

ping contexts. The original compound superstep is replaced 1 . However, the number of supersteps increases by a



B v

by compound supersteps on the target machine. 1 p

factor of at most 2 . The I/O time is determined by the cost of swapping con- 

texts plus the cost of simulating the original messaging via



N v

v References

O    G 

I/O. These costs are G and respectively,

p DB p vD



v

 2

and the total is dominated by G . DB p [1] AGGARWAL,A.,AND VITTER,J.S. TheIn- put/Output complexity of sorting and related prob-

lems. Communications of the ACM 31, 9 (1988), 1116±

A  Theorem 3 A v processor CGM algorithm with su-

1127.

+ L

persteps, computation time , communication time

N

g + L   

, local memory size , and message size 2 can

v [2] ARGE,L.Ef®cient External-Memory Data Structures

0 A

be simulated as a p-processor EM-CGM algorithm with and Applications. PhD thesis, University of Aarhus,

v v

+ O  + L

computation time  , communication p

p February/August 1996.



v v v v

+ L G  O  + L

time g , and I/O time for

p p DB p

p [3] ARGE, L., KNUDSEN, M., AND LARSEN,K.

N

 M = p  v N = vDB  B = O 

, , , and 2 .

v A general lower bound on the I/O-complexity of

N  LN  v N  N

Let g , , and be increasing functions of .If comparison-based algorithms. In Proc. Workshop on

c g  g N  L  LN 

A is -optimal on the CGM for , Algorithms and Data Structures, LNCS 709 (1993),

0

 v N  A c

and v ,then is a -optimal EM-CGM algo- pp. 83±94.

= !  g  g N  G = BD  o rithm for , ,  and

 [4] BADER,D.,HELMAN,D.,AND JAJÂ AÂ ,J.

p

0

 LN  A L . is work-optimal, communication-ef®cient, v Practical parallel algorithms for personalized

and I/O-ef®cient if A is work-optimal and communication- communication and integer sorting. Jour-

 =  g  g N  G = BD  O  ef®cient, , , , and

 nal of Experimental Algorithmics 1 (1996).

p

 LN  

L . http://www.jea.acm.org/1996/BaderPersonalized/. v [5] CACERESÂ , E., DEHNE,F.,FERREIRA,A.,FLOC- [16] GERBESSIOTIS,A.V.,AND VALIANT,L.G. Di- CHINI,P.,REIPING,I.,SANTORO,N.,AND SONG, rect Bulk-Synchronous Parallel Algorithms. Journal S. Ef®cient parallel graph algorithms for coarse of Parallel and (1994). grained multicomputers and bsp. In Proc. Int. Col- loquium Algorithms, Languages and Programming, [17] GIBSON,G.A.,VITTER,J.S.,AND WILKES,J. LNCS 1256 (1997), pp. 390±400. Strategic directions in storage i/o issues in large-scale computing. ACM Computing Surveys 28(4) (Dec. [6] CHIANG,Y.-J.,GOODRICH,M.T.,GROVE,E.F., 1996), 779±793. TAMASSIA,R.,VENGROFF, D. E., AND VITTER, [18] GOODRICH, M. Communication ef®cient parallel J. S. External-memory graph algorithms. In Proc. sorting. In Proc. ACM Symp. on Theory of Computa- ACM-SIAM Symp. on Discrete Algorithms (1995), tion (1996). pp. 139±149. [19] MEHLHORN, K. Personal communication, Dagstuhl [7] CORMEN, T. H. Personal communication, 1997. seminar on data structures, 1998.

[8] CORMEN,T.H.,AND GOODRICH, M. T. Position [20] NODINE,M.H.,AND VITTER, J. S. Determinis- Statement, ACM Workshop on Strategic Directions in tic distribution sort in shared and Computing Research: Working Group on Storage I/O multiprocessors. In Proc. ACM Symp. on Parallel Al- for Large-Scale Computing. ACM Computing Surveys gorithms and Architectures (1993), pp. 120±129. 28A(4) (Dec. 1996). [21] SIBEYN,J.,AND KAUFMANN, M. Bsp-like external- [9] CORMEN,T.H.,SUNDQUIST,T.,AND WIS- memory computation. In Proc. 3rd Italian Conference NIEWSKI, L. F. Asymptotically tight bounds for on Algorithms and Complexity (1997). performing BMMC permutations on parallel disk systems. Tech. Rep. PCS-TR94-223, Dartmouth [22] STEVENS,R.W. Advanced Programming in the College Dept. of Computer Science, July 1994. To Unix Environment. Addison-Wesley, Don Mills, Ont., appear in SIAM J. on computing. Canada, 1995. [23] VALIANT, L. G. A bridging model for parallel com- [10] DEHNE,F.,DENG,X.,DYMOND,P.,FABRI,A., putation. Communications of the ACM 33, 8 (August AND KHOKHAR, A. A randomized parallel 3d con- 1990), 103. vex hull algorithm for coarse grained multicomputers. In Proc. ACM Symp. on Parallel Algorithms and Ar- [24] VISHKIN, U. Can parallel algorithms enhance serial chitectures (1995), pp. 27±33. implementations? Communications of the ACM 39(9) (1996), 88±91. [11] DEHNE,F.,DITTRICH,W.,AND HUTCHINSON, D. Ef®cient external memory algorithms by simu- [25] VISHKIN, U. From algorithm parallelism to lating coarse-grained parallel algorithms. In Proc. instruction-level parallelism: An encode-decode ACM Symp. on Parallel Algorithms and Architectures chain using pre®x-sum. In Proc. ACM Symp. on (1997), pp. 106±115. Parallel Algorithms and Architectures (June 1997), pp. 260±270. [12] DEHNE,F.,DITTRICH,W.,HUTCHINSON,D.,AND MAHESHWARI, A. Parallel Virtual Memory. 2 page [26] VITTER, J. S. External memory algorithms. Proc. abstract to appear in Proc. 10th ACM-SIAM Symp. on ACM Symp. Principles of Database Systems (1998), Discrete Algorithms (1999), Baltimore, USA. 119±128. [27] VITTER, J. S. Personal communication, 1998. [13] DEHNE,F.,FABRI,A.,AND RAU-CHAPLIN,A. Scalable parallel geometric algorithms for coarse [28] VITTER,J.S.,AND SHRIVER,E.A.M.Algorithms grained multicomputers. In Proc. ACM Annual for parallel memory, I: Two-level memories. Algorith- Conference on Computational Geometry (1993), mica 12, 2±3 (1994), 110±147. pp. 298±307. [29] VITTER,J.S.,AND SHRIVER,E.A.M.Algorithms [14] DITTRICH,W.,HUTCHINSON,D.,AND MAHESH- for parallel memory, II: Hierarchical multilevel mem- WARI, A. Blocking in parallel multisearch problems. ories. Algorithmica 12, 2±3 (1994), 148±169. In Proc. ACM Symp. on Parallel Algorithms and Ar- chitectures (1998). (To appear).

[15] FLOYD, R. W. Permuting information in idealized two-level storage. In Complexity of Computer Calcu- lations (1972), pp. 105±109. R. Miller and J. Thatcher, Eds. Plenum, New York.