Parallel Algorithms in External Memory

By

David Alexander Hutchinson

A thesis submitted to

the Faculty of Graduate Studies and Research

in partial fullmentof

the requirements for the degree of

Do ctor of Philosophy

OttawaCarleton Institute for Computer Science

Scho ol of Computer Science

Carleton University

Ottawa Ontario

May

Includes revisions as of February

The original version of this thesis is available from University Microlms

International

c

Copyright

David Alexander Hutchinson

Abstract

External memory EM algorithms are designed for computational problems in which

the size of the internal memory of the computer is only a small fraction of the prob

lem size The Parallel Disk Mo del PDM of Vitter and Shriver is widely used to

discriminate b etween external memory algorithms on the basis of inputoutput IO

complexity Parallel algorithms are designed to eciently utilize the computing p ower

of multiple pro cessing units interconnected by a communication mechanism Apop

ular mo del for developing and analyzing parallel algorithms is the Bulk Synchronous

Parallel BSP mo del due to Valiant

In this work we develop simulation techniques b oth randomized and deterministic

which pro duce ecient EM algorithms from ecient algorithms develop ed under BSP

like mo dels Our techniques can accommo date one or multiple

pro cessors on the EM target machine each with one or more disks and they also

adapt to the disk blo cking factor of the target machine

We prop ose new more comprehensive mo dels for EM and parallel algorithms

which consider the total costs incurred by the algorithm including computation IO

and communication The new EMBSP EMBSP and EMCGM mo dels combine

the features of the BSP and PDM and thereby answer to a challenge p osed by the

ACM Working Group on Storage IO for LargeScale Computing

We obtain parallel external memory algorithms for a large number of problems

including sorting p ermutation matrix transp ose geometric and GIS problems in

cluding D convex hulls D Voronoi diagrams and various graph problems ii

Acknowledgements

I would like to thank all of the p eople who have help ed me to complete the work

describ ed in this thesis Knowing that I cannot thank everyone who has help ed me

in my studies and fearing that I would forget someone who has help ed a great deal I

have pro crastinated as usual in writing these acknowledgements However the time

has come so I hop e that those who I have not listed will forgive me I am sure that

I will rememb er all of you tomorrow after the publication deadline has passed

I would like to thank m y three advisors Frank Dehne Anil Maheshwari and

JorgRudiger Sack for their patience advice and for their condence that I would

eventually nish this work

In particular I want to thank my initial advisor JorgRudiger Sack for taking a

gamble on me for the excellent infrastructure and nancial supp ort he provided and

for his critical advice b oth on technical and nontechnical issues

I would like to thank Frank Dehne for his help and encouragement in writing

pap ers and for his unagging enthusiasm for our results and my prosp ects

To Anil Maheshwari I oer my gratitude for the many hours wespent discussing

Computer Science and other things over the past several years and for his unfailing

willingness to do so on an impromptu basis Without his daytoday help encourage

ment and quiet condence in our work I would not have been successful

Iwould like to express my appreciation to Wolfgang Dittrich for his collab oration

on many nice results and for showing me the power of mathematics in expressing

ideas It has been a real pleasure working with him

Tomy external examiner Abhiram Ranade and the other memb ers of my exami

nation committee Thomas Kunz Ekow Oto o and Ivan Sto jmenovic thanks are due

for their hard work in reading this thesis and for the useful suggestions they made

I would like to thank my other coauthors Mark Lanthier Doron Nussbaum

David Roytenb erg Radu V elicescu Norb ert Zeh and also Patrick Morin Kumanan

Yogaratnam for many fruitful discussions and other co op erative work I also would

like to thank Lyudmil Alexandrov for many interesting discussions and for his sug

gestions for improving the algorithms in Chapter

For their encouragement comments and questions which led me to new ideas I

would like to acknowledge the help of Lars Arge Tom Cormen and Je Vitter To iii

Lars thanks also for a comprehensive BibTeX le on external memory algorithms

This was a great help to me

Iwould like to thank the p eople of the Scho ol of Computer Science for the excellent

working environment during my studies Particular thanks go to Rosemary Carter

Barbara Coleman Linda Pfeier Maureen Sherman and Marlene Wilson Thanks

also to the technical sta of SCS who kept the machines running and put up with my

many suggestions

To all of my faculty graduate and undergraduate colleagues who I have worked

with as part of the Paradigm Group I oer my appreciation for your contributions to

a great working environment Particular thanks go to Mark Lanthier for organizing

the coke fund the pizza schedule and the ho ckey games and to Doron Nussbaum

and Pat Morin for organizing the pizza talks

FinallyI must acknowledge the help and supp ort of a group of p eople that have

made the crucial dierence between success and failure for me First I would liketo

thank my parents for instilling the desire to stretch for new and wonderful things

and for their unconditional supp ort in all of myendeavours Lastly I oer very sp ecial

thanks to my wife Gail and our children Lauren Julia Steven and Jessica for their

supp ort and encouragement for their cheerfulness and for their love and aection I

could not have continued without your help

David Hutchinson

Ottawa

May

This research was partially supp orted by Almerco Inc by the Natural Sciences and

Engineering Research Council of Canada and by pTeran Computing Inc iv

Contents

Abstract ii

Acknowledgements iii

Intro duction

Overview

Motivation

What Makes aGoodEM Algorithm

Eectiveness of EM Techniques

Fundamental Assumptions for Implementing Ecient IO Al

gorithms

Basic Data Formats on Multiple Disks

Previous EM Research

A Brief Survey

External Memory Paradigms

Batch Filtering

Distribution Sweeping

Data Structuring

Simulation

Lower Bounds on IO Complexity

wer Bound Sorting Lo

Permutation Lower Bound

Matrix Transp ose Lower Bound

A General Lower Bounding Technique

Realistic Parameter Ranges

The Role of RAIDs

Storage Space

Preliminaries Notation and Terminology

Asymptotic Notation

BSP Mo dels v

A Central Issue Communication Bottlenecks

Pap ers Contributing to the Thesis

Claim to Originalityor Contribution to Knowledge

Collab orations

Contributions

Organization of the Thesis

Mo dels

Overview

Organization of This Chapter

EM Mo dels

Aggarwal and Vitter Single Disk Mo del

The Parallel Disk Mo del PDM

Hierarchical Memory Mo dels

Parallel Computation Mo dels

The Parallel Random Access Memory PRAM Mo del

The Bulk Synchronous Parallel BSP Mo del

The Extended BSP BSP Parallel Pro cessing Mo del

The Coarse Grained Multicomputer CGM Mo del

Other Parallel Computing Mo dels

Combining PDM and BSPlike Mo dels

Ob jective A More Comprehensive Mo del

The EMBSP Mo del

System Mo del

Cost mo del

The EMBSP Mo del

The EMCGM Mo del

Go o dness Criteria for Parallel EM Algorithms

Extending the coptimality Criterion

AlternativeGoodness Criteria for EMBSP Algorithms

EMBSP Algorithms from Simulation

Overview of the Metho d

Overview and Rationale

Randomized Simulation

Single Pro cessor Target Machine

Multiple Pro cessor Target Machine vi

Deterministic Simulation

Intro duction

Balancing Communication

Single Pro cessor Target Machine

Multiple Pro cessor Target Machine

Summary

Deterministic Simulation with Lower Slackness

Motivation

Overview

The LowSlackRouting

The Parallel Algorithm NumberBlocks

Analysis of LowSlackRouting

The EM Algorithm EMLowSlackRouting

The EM Algorithm EMNumberBlocks

Analysis of EMLowSlackRouting

Single Pro cessor Target Machine

Multiple Pro cessor Target Machine

Problems of Geometrically Changing Size

Intro duction

Reducing IO Overheads

New EM Algorithms

Overview

Apparent Reductions in IO Complexity

Fundamental Problems

Sorting

Permutation

Matrix Transp ose

Summary

The Multisearch Problem

Overview

Multisearchon the EMCGM mo del

EMBSP Multisearch

EMBSP Multisearch on a Small Tree

EMBSP Multisearch on a Large Tree

Summary

Other New External Memory Algorithms vii

Exp eriments

Buer Tree Case Study

Intro duction

Description of the Buer Tree

Implementation of the Buer Tree

Evaluating the Implementation

Conclusions and Lessons Learned

Simulation Case Study

Intro duction and Motivation

An EM Runtime System for Parallel Algorithms

One Real Pro cessor

Multiple Real Pro cessors

A Deterministic EMCGM Samplesort

Exp erimental Results

Discussion and Conclusions

Conclusions and Future Work

A Useful Probabilistic Results

A Tail Estimates

A Balancing Lemmas

A Summary of Notation and Index

Bibliography viii

List of Figures

A Disk Drive

Layout of Disk Blo cks on a Single Pro cessor Machine

Layout of Disk Blo cks on a Multiple Pro cessor Machine

The Parallel Disk Mo del with One Pro cessor

The Parallel Disk Mo del with Multiple Pro cessors

ABSP Computation

AParallel Machine with External Memory

The memory of a virtual pro cessor

Reorganization in Step of SimulateRouting

Illustration of maximum imbalance in the bin sizes of a pro cessor dur

ing sup erstep A

Illustration of the Messaging Matrix

The division of the jS j pro cessors into v groups

The redistribution of blo cks foragiven bucket

The tree of pro cessors which numb ers the blo cks in bucket

The tree for the numb ering of blo cks in bucket

Imbalance in Item Assignment to Pro cessors

The assignment of messages to lo cal disks on a real destination pro cessor

Imbalance in Data Received by the Real Pro cessors

Overview of New EM Algorithms

Multisearch example

Overview of New EM Algorithms continued

Overview of New EM Algorithms continued

The buer tree

Timings for Internal Quicksort and Buer Tree Sort on Random Inputs

Timings for Buer Tree Sort and TPIE ix

NumberofBlock Pushes for Buer Tree Sort on Random Inputs

Timings for Buer Tree Sort with Smaller Fanout

Timings for Buer Tree Sort with Larger Fanout

Pip elining sort with generation of inputs

Running Times of PDM and Iostream IO Access Metho ds

Stevens measurements on the eects of varying the blo ck size

Organization of EM Library Services

System Design of the External Memory Library

Running times for CGMSampleSort and EMCGM SampleSort x

Chapter

Intro duction

Overview

Motivation

In recent years computer pro cessor sp eeds have increased dramatically computer

storage device capacities have increased prices have dropp ed and applications have

grown in their demands on storage resources Disk sp eeds have not kept pace with

main memory sp eeds however The b ottleneckbetween main memory and external

memory EM devices such as disks threatens to limit the size of very large scale

applications While internal memories have also b ecome much more

reliable cheap er and larger we see many new and upsized applications that demand

more storage than is feasible through internal memories The area of external memory

EM algorithm design aims to create algorithms which p erform eciently when the

internal memory of the computer is signicantly smaller than the storage required

for the application data Such applications require the bulk of their data to reside in

external memory since for practical machines main memory is only a fraction of the

required size

Applications in geographic information systems GIS astrophysics virtual real

ity computational biology VLSI design weather prediction computerized medical

treatment D simulation and mo delling visualization and computational geometry

fall naturally into this category Parallel disk IO and parallel computing have been

identied as critical comp onents of a suitable high p erformance computer for a num

ber of the Grand Challenge problems see for instance the Scalable IO Initiative

pro ject

The time required to p erform an IO op eration is typically several orders of mag

nitude larger than the time for an internal CPU op eration In addition the time

to set up the data transfer disk head movement rotational delay is much larger

CHAPTER INTRODUCTION

Input We are given a list of N items to be permuted where N is much larger

than the the number of items M that t into the internal memory of the

computer The p ermutation to b e applied is given as alistofN values

Output The p ermuted list of N items is to b e output

Assumptions We assume that N is items GB M is items

the average time for a disk access is seconds and the disk blo ck size

B is items

The inputs the sp ecied p ermutation and the resulting output list must all be

stored on disk external memory In internal memory the output can be com

puted in O N time If we apply the same approach to our problem however

we exp end O N IOs in the worst case or IO time of c sec for some

constant c However it has been shown that permutation in external memory

N N

requires only log IOs giving a time of c log sec

MB

B B

onds for some constant c So assuming similar constants c and c we have the

EM algorithm for permutation running ab out B times faster than the

internal memory approach This illustrates the most imp ortant factor in a go o d

EM algorithm which is taking full advantage of the disk blo cking factor B

Example Permutation in External Memory

than the time to actually transfer the data Data items are group ed into blocks and

accessed blo ckwise by ecient external memory algorithms in order to amortize the

setup time over a large number of data items Algorithms develop ed originally for

internal memory mo dels such as the RAM and PRAM mo dels are frequently found

to be inecient in the EM environment for this reason Such algorithms typically

have poor lo calityof reference causing to o many IO op erations to be p erformed

during the execution of an application

Prior to the evolution of a distinct EM metho dology large problems were often

handled via virtual memory techniques which used demand paging to swap pages

between disk and internal memory Unfortunately such techniques are heuristic and

the p erformance can be very poor in dicult cases As an example the simple

problem of p ermuting N numb ers is oered as motivation for the study of external

memory algorithms See Example

As a result researchers have develop ed sp ecial algorithms optimized for EM com

putation In many cases these have b een handcrafted so the choices for an appli

cation develop er have b een somewhat limited as a result One ob jective of the current

CHAPTER INTRODUCTION

work therefore is to identify ways of transferring classes of existing algorithms to the

EM domain and therefore takeadvantage of existing work for this new purp ose Our

fo cus is on algorithms which are provably optimal by some well dened criterion

What Makes a Go o d EM Algorithm

Informally a go o d EM algorithm makes full use of the resources at its disp osal This

includes consideration of the following factors

the disk blo cking factor B as describ ed ab ove

the available internal memory captured by the parameter M the number of

data items that t into internal memory

the abilityto p erform IO concurrently to all D available disk drives

full utilization of all p av ailable pro cessors and

ecient use of the communication bandwidth available b etween pro cessors

Most existing EM algorithms are based on what has b ecome known as the Parallel

Disk Mo del PDM prop osed originally by Vitter and Shriver For optimality

this mo del requires that EM algorithms use asymptotically the minimum number

of disk op erations required to solve a problem in the worst case We will refer to

this as the IOoptimality criterion Of course the running time of an algorithm is

a less abstract and more directly relevant issue While IO can be the most costly

comp onent of an external memory computation if it is not done eciently other

factors such as computation and communication time can be equally dominant if

their costs are not controlled We will describ e new work towards an alternative

criterion which includes other costs such as computation time and in the case of

parallel pro cessors the communication time required by an algorithm

There are interesting similarities b etween parallel algorithms for certain kinds of

parallel mo dels and EM algorithms We will outline a class of parallel algorithms that

also lend themselves to use or simulation as ecient external memory algorithms

Parallel algorithms break a problem into subproblems on which computations can

pro ceed indep endently for some p erio d of time b efore synchronization ie communi

cation is required b etween the pro cessors Between synchronization events references

to data by a particular pro cessor are restricted to data items currently residing in the

lo cal memory of that pro cessor This is p ossible b ecause the subproblems in a given

parallel step are indep endent

In the case of ne grained parallel algorithms the pro cessors may pro ceed inde

p endently for only one machine instruction b efore data is transferred between them

CHAPTER INTRODUCTION

Such a situation may o ccur with algorithms based on the various PRAM mo dels for

instance In contrast parallel computation mo dels such as the Coarse Grained Multi

computer CGM and Extended Bulk Synchronous Parallel BSP provide incentives

for coarser granularity in the computation encouraging a large number of machine

instructions to be executed by each pro cessor b efore interpro cessor communication

o ccurs In this thesis we distinguish between two typ es of granularity granular

ity of the computation as just describ ed and granularity of the communication

that o ccurs b etween pro cessors We nd that the attribute of coarse grained commu

nication in an algorithm allo ws us to mo del the communication between pro cessors

in the parallel algorithm directly as blo cked IO to and from external memory in a

corresp onding EM algorithm The presence of coarse grained computation limits the

number of synchronization steps required overall and therefore limits the number of

context swaps we need to do when simulating an algorithm Coarse grained com

putation is also known in the parallel computing literature as large slackness The

N

degree of slackness of an algorithm is the ratio where N is the problem size and p

p

is the number of pro cessors

Previous EM research has fo cussed primarily on the sequential RAM mo del Only

limited exploration of parallel computing has b een done However the domain of EM

algorithms is tied to large problems and so the applicabilityof parallel algorithms is

clear The relevance and timeliness of combining EM and parallel BSPlikemodelsis

highlighted by the recent publication of a challenge to this eect In this p osition

statement Cormen and Go o drich make note of the similarities b etween parallel and

EM computing and the desirability of creating a parallel EM mo del based on a BSP

like mo del and the Parallel Disk Mo del PDM

Eectiveness of EM Techniques

Few implementation studies have b een done to determine the p erformance of EM

ve b een done algorithms develop ed over the past years The few studies that ha

conrm the eectiveness of these techniques in practice A study by Chiang

rep orted that selected EM algorithms for computational geometry outp erformed con

ventional techniques for large enough problem sizes and were steady and ecient

over the range of problem sizes tested Cormen and Nicol rep orted large FFT

calculations based on EM techniques outp erforming their internal memory counter

parts by a factor in practice on a single disk system and that with disks the

EM algorithms outp erformed their internal memory counterparts by a factor

CHAPTER INTRODUCTION

Fundamental Assumptions for Implementing Ecient

IO Algorithms

Algorithms which are designed to control their IO costs are sometimes said to b e IO

aware Many conventional op erating system services are geared towards optimizing

p erformance for applications which are not IOaware This can b e counter pro ductive

when an IOaware program is running The literature on optimal EM algorithms

describ es many algorithms whichcontrol the details of their own IO op erations and

implicitly require that the computer op erating system get out of the way

A declustering technique is a metho d by which a le of data is spread over multiple

disks either on a single pro cessor or on a multiple pro cessor system Two

comp onen ts of such a metho d are the striping unit the number of bytes or

p erhaps problem items accessed on each disk by a single parallel IO op eration

and the way in which such units are distributed over the disks To be eective

IOaware programs require the following unusual abilities Cormen and Kotz

control over declustering the optimal algorithms control declustering explic

itly

control over the size of the striping unit the optimal algorithms assume that

the striping unit is one blo ck

the ability to query the system ab out the conguration numb er of disks blo ck

size number of pro cessors amount of physical memory current declustering

metho d

control over the oset on each disk and the ability to bypass parity op erations

for indep endent IO eg RAID see Section and

the ability to turn o the op erating systems IO caching and prefetching be

haviours the optimal algorithms do their own caching and prefetching

Naturally for b enchmarking work to be accurate it is also necessary to control

the resources consumed by system tasks and other users The implementation work

of Chiang illustrated the diculties in measuring IO counts when the op erating

system was allowed to manage memory and IO as if the program was not IO

aware Our own implementation work see also Chapter also indicates that

these issues can b e dicult to address prop erly Item in particular proves to b e a dicult one

CHAPTER INTRODUCTION

Read / write Discs

heads

Figure A Disk Drive

Basic Data Formats on Multiple Disks

In this section we briey discuss the format of data on a disk drive and the access

patterns which we will use

A disk drivecanbethought of as a stack of one or more rotating discs or platters

as illustrated in Figure On each surface of each disc an equal numb er of concentric

circular disk tracks are recorded These are usually broken down further into sectors

which can be thought of as the basic units of data storage typically bytes in

size Because tracks at the outer edges of a disc are longer than tracks near the

centre the number of sectors per track varies Op erating systems usually aggregate

anumb er of sectors to form a disk storage blo ck Data is recorded onto or read from

a disk drive via a set of moveable readwrite heads one per recordable surface The

heads typically move in unison and are therefore restricted to accessing the same

trackoneach of the discs The tracks accessible by the heads at a particular p osition

collectively form a cylinder The time to access data on a disk drive is made up of

rotational delay or the time for the disk to rotate to the necessary p osition plus the

seek time which is the time needed to move the heads to the appropriate cylinder

ypical numb ers for a disk drive at time of writing are exemplied by the Seagate T

Barracuda STFC Fibre Channel disk Selected parameters are listed b elow

FORMATTED CAPACITY GB

AVERAGE SECTORS PER TRACK rounded down

ACTUATOR TYPE RO TARY VOICE COIL

TRACKS

CYLINDERS user

Courtesy of Seagate TechnologyInchttpwwwseagatecomsupp ort

CHAPTER INTRODUCTION

HEADS PHYSICAL

DISCS in

SPINDLE SPEED RPM

AVERAGE LATENCY mSEC

SECTORS PER DRIVE

AVERAGE ACCESS ms readwrite

Drive level with controller overhead

SINGLE TRACK SEEK ms readwrite

MAX FULL SEEK ms

In the example a sector is ab out bytes the time for a

complete revolution of the disk is ms and the average time to access a

sector including b oth rotational delay and seek time is ms

We will use a simplied mo del of a disk drive We assume that there is only a

single disc with a single recordable surface and therefore a single head Our disk

storage block we often simply use the term block consumes an entire track of a disk

and therefore we thinkoftrack oset and blo cknumber on a disk as synonymous

When multiple disks are present on a single pro cessor we get the situation illus

trated in Figure Data storage is similar to a matrix with three dimensions

oset within a blo ck blo ck oset within a disk disk number In the example

the index one varies most quic kly followed by index three and nally by index two

When D disks are present on a single pro cessor we will use the following terms

illustrated via Figure A paral lel IO is a read or write op eration in which O D

blo cks are accessed simultaneously one from eachdisk A stripe is a set of D blo cks

one p er disk stored at the same track oset on each disk A striped IO is a parallel

IO which accesses the blo cks of a single strip e An independent IO is a parallel

y dier from disk to disk A consecutive IO is IO in which the tracks accessed ma

a parallel IO which accesses items with consecutive lab els when the data is laid out

as in Figure A staggered IO is a parallel IO in which the blo ck oset changes

by a xed amountbetween each pair of consecutive disks

We will use the terminology consecutive format to mean that external input or

output data is or can b e accessed using consecutive IOs Weuse the terms striped

format staggered format and independent format in a similar manner

When multiple disks are presentonmultiple pro cessors we get the situation illus

Data storage is now similar to a matrix with four dimensions trated in Figure

oset within a blo ck blo ck oset within a disk disk number pro cessor

number However in all of our algorithms we develop indep endent subproblems at

each pro cessor and a global ordering of items as in Figure is not typically required

except perhaps as a nal step

CHAPTER INTRODUCTION

Disk number 12 3 45678

Stripe 0 1, 2 3 4 5, 6 7, 8 9,10 11,12 13,14 15,16

Stripe 1 17,1819,20 21,22 23,2425,26 27,28 29,30 31,32

Stripe 2 33,3435,36 37,38 39,40 41,42 43,44 45,46 47,48

Stripe 3 49,50 51,52 53,54 55,56 57,58 59,60 61,62 63,64

Figure Layout of Disk Blocks on a Single Processor Machine

Processor number 01

Disk number 012 3 0123

Stripe 0 1, 2 3 4 5, 6 7, 8 9,10 11,12 13,14 15,16

Stripe 1 17,1819,20 21,22 23,24 25,26 27,28 29,30 31,32

Stripe 2 33,3435,36 37,38 39,40 41,42 43,44 45,46 47,48

Stripe 3 49,50 51,52 53,54 55,56 57,58 59,60 61,62 63,64

Figure Layout of Disk Blocks on a Multiple Processor Machine

CHAPTER INTRODUCTION

Previous EM Research

The development of external memory algorithms can b e viewed as pursuing

to the case where internal memory is insucient to contain the entire problem Since

many of the analyses that have b een presented for algorithms in computer science

have used asymptotic arguments that assume unit cost access for every problem item

ie a uniform random access memory mo del there are manydiverse areas where EM

considerations can and p erhaps should b e applied in order to up date these analyses

The number of pap ers dealing with EM issues has therefore been growing rapidly

over the past few years and it is dicult to provide a comprehensive survey Vitter

currently maintains the most denitive survey on EM issues and up dates it

p erio dically as new results are published

A Brief Survey

We list here selected research in EM with some relevance to the fo cus of this thesis

Some of the discussion is rep eated in subsequent sections which deal with sp ecic

external memory topics

Floyd is credited with some of the earliest work towards optimal EM al

gorithms He studied matrix transp osition and permutation algorithms in a paged

memory with a limited set of basic op erations

Aggarwal and Vitter studied sorting on a mo del which involved a single disk

whic h could read or write D blo cks in one op eration They showed matching upp er

and lower b ounds for the numb er of IO op erations required for sorting FFT matrix

transp ose and p ermutation under this mo del Their mo del did not constrain the

lo cations from which the D blo cks were to be accessed In other words the blo cks

could be from consecutive blo ck lo cations on the disk or they could be widely sep

arated In this resp ect the mo del was unrealistic in that no physical disk system

allows simultaneous transfer of D blo cks under those circumstances

and Shriver considered a more realistic mo del where each of D Vitter

parallel disks could simultaneously read or write a single blo ck in a single parallel

IO op eration see Figure This mo del has b ecome known as the Parallel Disk

Mo del PDM Since the previous mo del of Aggarwal and Vitter was more p ermissive

their lower b ounds carry over to the new mo del Vitter and Shriver concentrated on

randomized algorithms relating to sorting and p ermutation and intro duced the rst

optimal randomized algorithms for sorting and related problems in EM on the PDM

They also extended their work from twolevel memories to include multilevel

memory hierarchies These pap ers presented some of the earliest work on parallel

pro cessing for EM They assumed an interconnection network that could do sorting

of the records in the internal memories of the p pro cessors in randomized logarithmic

CHAPTER INTRODUCTION

M

time O log M Examples include PRAMlike interconnection networks cub e

P

connected cycles hyp ercub e See Section for a description of the Parallel Disk

Mo del

Anumber of models have b een prop osed for multilayer memories the Hierarchical

Memory Mo delHMM the Blo ck Transfer Mo delBTM and the Uniform

Memory Mo delUMM These mo dels have not been heavily used p erhaps due

to their complexity

Cormen and Cormen Sundquist and Wisniewski followed up on an

of permutations in EM observation of Aggarwal and Vitter that certain typ es

can be done faster than the general p ermutation case Cormen identied a number

of such p ermutations including matrix transp ose address bit reversal hyp ercub e

routing vector reversal p erfect shue inverse p erfect shue gray co de and inverse

gray co de computations He showed that certain fundamental op erations such as

complementing and circular shifting of address bits were common to these problems

and that they could be done faster than general p ermutations in EM

No dine and Vitter studied deterministic sorting algorithms for EM in

both distribution sort and merge sort avours They also rep orted IOoptimal de

terministic sorting algorithms for parallel memory hierarchies These algorithms

are also simultaneously optimal in terms of internal computation for single pro ces

sors and for the PRAM For nonPRAM interconnection networks suchashyp ercub e

and cub econnected cycles only the randomized algorithms of are simultaneously

optimal in b oth IO and computation A summary of randomized and deterministic

EM sorting techniques is presented in

Go o drich TsayVengro and Vitter sketched EM algorithms for a large num

ber of problems in computational geometry and prop osed several useful paradigms

suchasdistribution sweeping and batch ltering for designing ecient EM algorithms

wolayer single pro see Section for more information They fo cussed on the t

cessor version of the PDM but also included a short discussion of the applicabilityof

multipro cessor and multilayer mo dels to their work Many of the computational ge

ometry algorithms prop osed in are listed in Section where they are compared

to new EM algorithms develop ed using the metho ds presented in this thesis

Chiang Go o drich Grove Tamassia Vengro and Vitter presented EM ver

sions of many useful algorithms related to graphs New EM paradigms called time

forward processing and PRAM simulation are intro duced see Section for a

description of PRAM simulation A technique for deriving lower b ounds for EM

problems is also describ ed A num ber of the graph algorithms prop osed in are

listed in Section where they are compared to new EM algorithms develop ed using

the metho ds presented in this thesis

Motivated by the goal of constructing IO ecientversions of commonly used in



M is the total size of the internal memories and P is the numb er of pro cessors

CHAPTER INTRODUCTION

ternal memory data structures Arge prop osed the data structuring paradigm

and in particular the buer tree A buer tree is an external memory search tree

based on the a btree which is a generalization of the B tree It allows

several up date op erations such as insert delete search deletemin and it enables the

transformation of a class of internalmemory algorithms to external memory algo

rithms by exchanging the data structures used A large number of external memory

algorithms have b een prop osed using the buer tree data structure includ

ing sorting priority queues range trees segment trees and time forward pro cessing

These in turn are subroutines for many external memory graph algorithms such as

expression tree evaluation centroid decomp osition least common ancestor minimum

spanning trees ear decomp osition There are a number of ma jor advantages of the

buer tree approach It applies to a large class of problems whose solutions use search

trees as the underlying data structure This enables the use of many normal internal

memory algorithms and hides the IO sp ecic parts of the technique in the data

structures Several techniques based on the buer tree eg time forward pro cessing

are simpler than comp etitive EM techniques and are of the same IO complexityor

b etter with resp ect to their counterparts

The ma jor innovation of the buer tree is the addition of buers to each internal

no de of the tree and lazy insert delete and query strategies The basic idea is that

requests query insert delete accumulate in the buer of an internal no de until it is

worthwhile to p erform the necessary IO op erations to move them to the next level

of the tree See Section for a more detailed description of the buer tree

Using the buer tree Arge et al presented IOoptimal algorithms for a

number of problems related to GIS including map overlay redblue line segment in

tersection trap ezoidal decomp osition triangulation of simple p olygons planar p oint

lo cation General techniques presented include an external memory version of frac

tional cascading using the buer tree and an extended external segment tree

Vengro and Vitter describ e external memory algorithms for three dimen

sional range searching and query pro cessing

In Chiang studied the p erformance of algorithms to rep ort the intersections of

orthogonal line segments The study compared the optimal EM algorithm rep orted in

with various versions of a similar internal memory algorithm which used balanced

trees Various memory sizes and distributions of the data features were tried The

conclusion was that the EM algorithm was b oth ecient and reliable ie it handled

all of the cases in an acceptable fashion and signicantly outp erformed the other

techniques when the internal memory was small relative to the problem size He used

some interesting techniques for generating his test data and his tests highlighted

some of the diculties with running IOaware algorithms on an op erating system

ware applications that is designed for non IOa

TPIE Transparent Parallel IO Interface is a library of IOaware functions

CHAPTER INTRODUCTION

and services designed for building IOecient applications see Vengro and

Vengro and Vitter Many of the IOoptimal algorithms and techniques in the

literature are included At the time of writing the buer tree and related applications

are exceptions

Barve Grove and Vitter examined the practicalityof EM sorting techniques

and prop osed a randomized sorting technique that is IOoptimal yet simpler in

practice than previous EM sorting metho ds summarized in

ViC Colvin and Cormen is language for co ding data parallel programs

with outofcore external variables The ViC compiler provides the necessary in

putoutput calls to allow a user program to manipulate huge data ob jects without

explicitly co ding the IO requests The compiler makes use of m ultiple disks and its

knowledge of program structure to reduce the IO costs b orne by the program

Cormen and Hirschl rep orted the implementation of and timing results for

external memory radix sort and BMMC permutations on a single pro cessor with

multiple disks

Cormen and Nicol rep orted on the development of IOoptimal FFT applica

tions on a single pro cessor system with one or multiple disks and Cormen Wegman

and Nicol implemented FFT calculations on multiple pro cessors with multiple

disks

External Memory Paradigms

Anumb er of general techniques or paradigms for obtaining ecient EM algorithms

have b een identied byvarious researchers in EM In this section we describ e several

which are relevanttothe research prop osed here

Batch Filtering

Batch ltering is a technique rst describ ed by Go o drich et al Given O N

queries and a layered decision DAG directed acyclic graph of size O N stored on

disk batch ltering p ermits the queries to be answered in an optimal O NB IO

op erations Batch ltering can be viewed as a technique for p erforming multisearch

see Section for N queries on a DAG of size O N ona single pro cessor

We assume a single pro cessor machine with internal memory size M forM B

A decision DAG with a single source no de is stored in external memorylevel bylevel

in consecutive format The queries are partially sorted in order of the successor no de

they wish to visit on the current level of the graph Since there is a single source no de

of the DAG the queries can initially b e in arbitrary order The algorithm reads the

rst blo ck of the current level of the DAG eg amultiwaynodeofsizeB and the

rst blo ck of queries for this no de from the disk The queries destined for this no de

are pro cessed sequentially blo ck by blo ck Each query can be directed in one of B

CHAPTER INTRODUCTION

p ossible ways by the current no de and the algorithm keeps B corresp onding blo ck

buers in memory Each blo ck buer represents the head of a list of output blo cks

destined for a given successor no de on the next level of the graph Query blo cks are

read as necessary until a query requires a dierent graph no de Full output blo cks are

written to disks and concatenated to the end of the appropriate list dep ending on

their destination When the rst query with a dierent successor is encountered we

are done with the current graph no de and can read the next one This can continue

until the level is nished For the next level the output lists are concatenated to

form the input for the next round

Distribution Sweeping

The Distribution Sweeping technique was intro duced by Go o drichet al

The input for a geometric problem eg orthogonal line intersection is stored in

external memory The technique rst divides the input into O MB slabs ie it is

typically sorted along one dimension of the data and divided evenly into O MB

buckets The slabs are scanned one blo ck from each slab can simultaneously exist

in memory along a second dimension of the data This typically requires a second

sort to arrange the data elements within each slab in increasing order by the second

dimension The algorithm then solves or pro cesses those subproblems which

consist of interactions between slabs More concretely this amounts to pro cessing

data elements that span a slab When the scan is nished all of the interslab

interactions have b een pro cessed The algorithm then pro ceeds recursively on each

slab

This technique has b een used as the basis for an IOoptimal EM algorithm to

rep ort all intersections of orthogonal line segments Implementation and p erfor

mance of the orthogonal line intersection algorithm are rep orted in

Data Structuring

One of the rst IOoptimal data structures in EM was describ ed by Arge The

buer tree is based on the a btree which is a generalization of the B tree

The ma jor dierence between the buer tree and the a btree is the addition

of large buers to each tree no de The buers collect requests which may include

insert delete and query op erations as they navigate down the tree from the ro ot to

the appropriate leaves A request r in the buer of no de q do es not advance to the

next level of the tree unless there are sucient other requests in the buer to pay

for the costs of reading in the children of q and bucket sorting the requests in q s

buer according to q s splitters

Unlike B tree based applications requests against a buer tree data structure

are not satised immediately Pro cessing pro ceeds in a lazy manner and results are

CHAPTER INTRODUCTION

rep orted only when and in the order in which requests are forced down to the leaf

level of the tree Once a blo ck of requests reaches a leaf pro cessing of those requests

continues to completion Only ltering of the requests through the buer part of the

tree is done in a lazy manner As requests can get stuck in the buers of the buer

tree if there are insucient other requests to ush them out it is generally necessary to

force empty the buers after all requests have b een inserted b efore all requests are

fully pro cessed This involves scanning through the buers and generating sucient

dummy requests to make the requests descend to the leaf level of the tree

The buer tree is suitable for pro cessing subproblems which tolerate the lazy

behaviour of the buer tree ie they do not require their output in a particular

order see

The buer tree has been implemented as part of the LEDASM pro ject see

Crauser et al The p erformance of the external memory priority queue based

on the buer tree was compared to the p erformance of an external priority queue

derived from the internal memory radix heap data structure The authors rep ort

that the new structure based on radix heaps outp erforms the buer tree for the range

of inputs tested but they note that unlike radix heaps the the buer tree p ermits

the priority data typ e to be a noninteger They also found their implementation of

buer tree sort to be slower than implementations of other external memory sorting

algorithms that they tested

Simulation

Simulating parallel algorithms in EM is one of the primary ideas in this thesis There

app ear to have b een three forms in which the simulation paradigm has app eared in

the EM literature

Lo calizing parallel references algorithmically

A class of parallel algorithms from which go o d EM algorithms can be derived

is illustrated by an algorithm describ ed briey in The authors identify a

PRAM algorithm for searching the dictionaries in a fractional cascaded binary

tree which has the prop erty that the p PRAM pro cessors access data in a

compact pattern spanning O pB bloc ks at each PRAM step In general if a

PRAM algorithm has the prop erty that the pro cessors in each step access the

at lo cations that are group ed into clusters of size B then a

single pro cessor external memory algorithm can b e derived in a straightforward

way by simulating the activities of the p PRAM pro cessors

Lo calizing parallel references by sorting

Chiang et al describ e a simulation of PRAM algorithms based on sorting

PRAM instructions and references at each parallel computation step The sort

CHAPTER INTRODUCTION

ensures that the data references that would b e made by the p PRAM pro cessors

are lo calized in a series of consecutive memory lo cations and therefore can be

accessed eciently as a series of disk blo cks The sorting technique must also

be an ecient EM sort Using this technique IOoptimal EM algorithms can

b e derived from certain workinecient PRAM algorithms for problems with a

geometrically decreasing size prop erty

Simulating BSP algorithms This idea was articulated indep endently by

Sib eyn and Kaufmann and by Dehne Dittrich and Hutchinson The

BSP mo del of parallel computation was prop osed in and is discussed in

Chapter

In the authors prop ose an approach for simulating optimal BSP algo

rithms to pro duce ecient EM algorithms Their work is presented in the

context of a single disk unipro cessor EM machine but they suggest that the

concept can b e extended to multiple disks They do not explain however how

to accommo date the blo cking factor which is an intrinsic issue in ecient IO

design nor do they provide mechanisms for handling multiple disks or multiple

physical pro cessors on the target EM machine

Lower Bounds on IO Complexity

A number of imp ortant results regarding lower bounds for EM algorithms are listed

in Sections to The reader is referred to the survey by Vitter for a

more complete treatment

Sorting Lower Bound

Perhaps the most fundamental is the lower b ound shown for sorting by Aggarwal and

Vitter This result states that the IO complexity of sorting is

N N

sor tN log M

B

BD B

Vitter p oints out why disk striping is nonoptimal in terms of IO complexity

N N

log M Striping data With a single disk Equation b ecomes sor tN

B B

B

across D disks is equivalent to increasing the blo cksize to BD giving sor t N

str iping

N N M

log M which is larger than by a factor of log M If D is very large

BD BD B

BD BD

M

the base of this logarithm approaches unity and striping b ecomes approaching

B

strikingly inferior to using the disks indep endently for sorting

CHAPTER INTRODUCTION

Permutation Lower Bound

Aggarwal and Vitter also showed that p ermutation in external memory has IO com

plexity

N N N

log M g per mN minf

B

D BD B

Cormen showed that sp ecial cases of p ermutation corresp onding to ane

transformations havelower IO complexity than the general lower b ound These sp e

cial classes of p ermutations can b e generated by certain binary matrix transformations

on the address bits of the permutation elements

Matrix Transp ose Lower Bound

Aggarwal and Vitter showed that the IO complexity of transp osing a p q matrix

is

N N

minfM p q log M g

B

BD B

A General Lower Bounding Technique

Arge et al showed that in general assuming a comparisonbased mo del any

N N

problem requiring N log N comparisons requires log M IO op erations in

B B

B

the Parallel Disk Mo del

Realistic Parameter Ranges

The general lower b ounds of Aggarwal and Vitter and others are extremely useful in

designing ecient external memory algorithms and give an accurate picture of the

general case for all p ossible values of the parameters

Our simulation techniques pro duce external memory algorithms whose constraints

restrict them to only a subset of the parameter constellations covered by the lower

b ounds of the preceding sections However we will argue that this subset is a useful

one and is often encountered in real machines In this subset we nd that the IO

N N

instance is only O because log M c for constant complexity of sorting for

BD B

B

c We will discuss this in more detail in Section

CHAPTER INTRODUCTION

The Role of RAIDs

The p erformance advantages of parallel disk drives are recognized by the growing

p opularity of RAID arrays Redundant Array of Inexp ensive Disks see

RAID technology p ermits an arrayofR individual disk drives each with blo cksize B

RB

to b e accessed as a single logical disk with blo cksize b etween and RB dep ending

on the RAID level used The RAID levels are designed to provide varying degrees of

error detection and correction based on the observation that the mean time b etween

times the MTBF for one failures MTBF for an array which dep ends on D disks is

D

of the disks RAID level is the simplest case where read and write op erations of

BD items access a blo ck of application data on eachoftheR D disks RAID levels

to intro duce redundancy of one form or another to cop e with the eventuality of

disk or op erating system failure Thus read and write op erations of BD items access

a blo ck of application data on eachofD disks where DR Redundancy is used in

RAID levels to to p ermit data to b e recovered if a disk failure o ccurs

In this thesis we are concerned with algorithms that are capable of eciently

using the large blo cksizes that are required for ecient use of disks including RAID

but the reliability issues are beyond the scop e of our discussions For some of our

algorithms we also nd that the reads and writes are not strip ed ie the blo cks

accessed on each disk are on dierent logical tracks It is not clear how useful a

RAID array is for these algorithms as we may need to bypass the RAID logic in

order to access the disks indep endently

Several of the RAID levels calculate checksums for error correcting purp oses on

a strip e by strip e basis Nonstrip ed writes may mo dify a blo ck in up to D dierent

strip es necessitating the recalculation of their checksums and resulting in large IO

and computational overhead While nonstrip ed access to disk promises a theoretical

advantage in IO complexity see Section strip ed IO may in practice b e more

ecient if RAID technology is used

In general in this thesis we assume that the disks are accessible indep endently

according to the EMBSP and PDM mo dels We note when the access to the disks

is strip ed but otherwise we pay little attention to the issues surrounding the use of

RAIDs for implementing these algorithms

Storage Space

is imp ortant The amount of disk space required by an external memory algorithm

For very large problems even small constant factors in space consumption can be

crucial to the practicality of an algorithm The algorithms considered in this thesis

generally require space which is linear in the problem size Exceptions include the

EMBSP algorithms derived from the CGM parallel segment tree and the buer

CHAPTER INTRODUCTION

tree technique whose implementation is considered in Chapter Both increase

space consumption by a logarithmic factor For the simulation techniques that are

considered in Chapters and the space consumption is linear in the problem

size In this thesis we fo cus on asymptotic analyses of our techniques While the

precise constants are of interest to a practitioner they are beyond the scop e of our

treatment here

Preliminaries Notation and Terminology

Asymptotic Notation

We dene the asymptotic notation O f N of N f N f N in the

usual way The following denitions are adapted from

S S

Let f g be any functions such that f N R fgand g N R fg

S

O f N ft N R fgjc R n Nn n tN cf N g

S

of fgjc R n Nn n tN cfN g N ft N R

S

f N ft N R fgjc R n Nn n tN cf N g

S

c R n Nn n tN cfN g f N ft N R fgj

T

f N O f N f N

We extend the notation to expressions containing asymptotic notations as follows

S

Let g be any function such that g N R fg Let and represent any

L

set of functions such as one of O f N of N f N or f N Let

represent any binary op erator

L L

ft N t N jt N t g

L L L

g N is similar g N ft N t N jt N fg N gt g

L L L

a is similar ft N t N jt N a t g a

Let represent any set of functions such as those dened ab ove As is common

practice in asymptotic analysis of algorithms we will also use the notation g N

to mean g N

CHAPTER INTRODUCTION

BSP Mo dels

In Chapter we discuss the BSP Bulk Synchronous Parallel parallel computing

mo del and two related mo dels BSP Extended BSP and CGM Coarse Grained

Multicomputer which contain extensions or simplications of the basic BSP ideas

We will refer to these three mo dels collectively as BSPlike mo dels or simply as BSP

mo dels We may refer to a BSP algorithm with N data and v pro cessors which

N

communicates exclusively via hrelations of size h as a CGM algorithm and a

v

BSP algorithm whose message size is b for some p ositive integer b may also

be referred to as a BSP algorithm Please refer to Chapter for more information

on the BSP BSP and CGM parallel mo dels

A Central Issue Communication Bottlenecks

The idea of decomp osing a problem into smaller subproblems is a common approach

in computer science Divide and conquer algorithms parallel computing algorithms

are some examples Communication between subproblems is required in order for

such approaches to be successful Dijsktra p ointed out the similarities b etween

which communicate in space and sequential pro cesses which concurrent pro cesses

communicate in time Message passing is a typical means of communicating in

space and disk les are a common metho d of communicating between pro cesses

whose execution is separated in time

Optimizing b oth kinds of communication is naturally an imp ortant consideration

in controlling the running time of practical algorithms Recognition of this issue is one

of the main dierences b etween the BSP and PRAM parallel computing mo dels It is

the main motivation for the development of the BSP and CGM parallel computing

mo dels It also provides the primary issues in external memory algorithm design

An imp ortant common factor in controlling the cost of these typ es of communi

uni cation is aggregation or blocking of the communicated data In many real comm

cation systems eg disk inputoutput interpro cessor message passing there is a

signicantoverhead cost in preparing the communication system for a data transfer

This may include determining and reserving a communication path in a network or

waiting until disk hardware has b een physically adjusted to the appropriate p osition

Blo cking asso ciates multiple data items with a common blo ck When a communi

cation op eration is required on one of the data items in the aggregation it is applied

to the blo ck Provided that the algorithm op erates on all or a signicant fraction of

the elements in the blo ck b efore discarding it this reduces overall costs bypaying the

overhead cost only once for the blo ck instead of once for each of the items actually

accessed by the algorithm

CHAPTER INTRODUCTION

Pap ers Contributing to the Thesis

Some of the results describ ed in this thesis have b een published as pap ers with various

coauthors These pap ers are describ ed briey b elow However any errors in the

current work are mine alone

In a randomized technique is presented for simulating BSP algorithms

in external memory see Section for a description of the BSP mo del of

parallel computation The resulting algorithm is shown to have blo cked IO to

one or multiple disks on a single or multiple pro cessor machine All IO is done

in parallel to multiple disks if present on each pro cessor and communication

between pro cessors is balanced and is also done blo ckwise The authors prop ose

new mo dels for evaluating external memory algorithms on multiple pro cessor

machines which address suggestions made by Cormen and Go o drich see

Section for a description of this challenge This work is covered by

Chapters and

Hutchinson Maheshwari Sack and Velicescu see also Section imple

mented the buer tree for external sorting and priority queue applications and

suggested mo died parameter settings which improved p erformance in prac

tice To the authors knowledge this was the rst full implementation of the

buer tree Buer tree sort was found to be w ell behaved with the running

time increasing linearly over the range of inputs tested Buer tree sort easily

outp erformed the well known quicksort technique using virtual memory

Dittrich Hutchinson and Maheshwari presented external memorymultiple

pro cessor algorithms for multisearchand showed them to b e ecient using the

mo dels previously prop osed in More details of this work can also b e found

in Section of this thesis

Dehne Dittrich Hutchinson and Maheshwari presented a deterministic

technique for simulating CGM algorithms in external memory see Section

for a description of the CGM mo del of parallel computation A number

ve ecient external memory of existing parallel algorithms are shown to ha

multiple pro cessor variants obtained via the simulation technique and analyzed

under the mo dels of

In Dehne Dittrich Hutchinson and Maheshwari explore the transformation

of parallel algorithms via the technique of as a waytomake existing parallel

algorithms scalable to larger problem sizes without using demand paging for

handling virtual memory

CHAPTER INTRODUCTION

Claim to Originality or Contribution to Knowl

edge

Section summarizes the main contributions of this thesis The claims to orig

inality are sub ject to the contributions and assistance of various coauthors which

are summarized in Section

Collab orations

The randomized simulation technique of Chapter and the EMBSP EMBSP

and EMCGM mo dels rst app eared in and also in

The buer tree implementation study describ ed in Section app eared earlier

as

External memory multisearch algorithms for the EMBSP describ ed in Section

rst app eared in and later in

An implementation design and the exp eriments describ ed in Section are also

describ ed in

A summary of the simulation approach its application to adding scalability

to parallel algorithms and the preliminary exp erimental results of Section

were describ ed in

The deterministic simulation technique of Chapter is describ ed in

Contributions

In general terms this work adds a number of results on parallel pro cessing in

communicationaware parallel mo dels to the existing b o dy of external memory

research

Our new EMBSP EMBSP and EMCGM parallel external memory comput

ing mo dels combine the features of the external memory Parallel Disk Mo del

PDM with Bulk synchronous Parallel BSP parallel computing mo dels an

swering a challenge of Cormen and Go o drichinACM Computing Surveys

We prop ose a general paradigm for pro ducing parallel external memory algo

rithms from BSP algorithms We present detailed analyses of instances of the

paradigm including general derivation techniques Examples presented include

CHAPTER INTRODUCTION

a The BSP lower b ound on message sizes in the parallel computing domain

allows blo cked IO in the external memory domain Using randomization

we also achieve balanced communication to multiple real pro cessors and

balanced indep endent IOtomultiple disks on each pro cessor

b Balanced message sizes in a BSP algorithm allows blo cked IO and bal

anced communication to multiple pro cessors and strip ed IO to multiple

disks on each pro cessor without randomization

c Adapting a known parallel routing algorithm to our requirements weshow

that it converts a BSP algorithm to the requirements of either of techniques

a or b ab ove

d We describ e what app ears to be a new routing algorithm for parallel al

gorithms which routes messages indirectly via a series of rounds and has

an adjustable slackness requirement

e Using our new indirect routing technique we describ e a second determin

istic approach to obtaining external memory algorithms for an EMBSP

mo del from existing parallel BSP algorithms This requires smaller slack

ness than technique b ab ove and so reduces the internal memory size

required by the resulting external memory algorithm

f We provide an optimization technique for our simulations when the prob

lem size varies geometrically over a series of rounds of the original BSP

algorithm

g We explain an apparentcontradiction b etween established lower b ounds on

IO complexity for various problems and the IO complexities we derive

for several of our new algorithms

We present a large number of new external memory algorithms obtained from

our simulations

Sorting Permutation matrix transp ose optimal parallel external memory

algorithms existed for mo dels which assumed logarithmic or unitcost com

munication However the new algorithms work on more general mo dels

We present the rst parallel EM multisearch algorithms

We obtain parallel external memory algorithms for various problems in

computational geometry geographic information systems and graph the

ory

We rep ort on the implementation and p erformance testing of a buer tree We

b elieve this to be the rst such implementation of this well know EM data

CHAPTER INTRODUCTION

structure We also describ e preliminary implementation exp eriments of one

of our simulation techniques providing evidence towards its validation as a

practical technique

Organization of the Thesis

In Chapter we survey selected previous work on mo dels for external memory com

putation and for parallel computation We highlightthechallenge presented by Cor

men and Go o drich to combine these mo dels and we present three new combined

mo dels that address many of the issues identied in

In Chapter we describ e the general framework of a new simulation technique

that creates ecient external memory algorithms from ecient BSP algorithms

In Chapter we describ e a randomized approach which implements the general

framework describ ed in Chapter In particular we fo cus on how to eciently

write the messages between pro cessors to parallel disks and eciently retrieve

them in the next comp ound sup erstep We show that if the simulation is p erformed

on a coptimal BSP algorithm the resulting EM algorithm is coptimal with high

probability The resulting EMBSP algorithm in particular EMBSP acquires

unication present in its prop erty of blo cked IO from the attribute of blo cked comm

the original BSP algorithm from which it was derived The techniques describ ed

in Chapter ensure that this prop erty of blo cked communication between tasks is

preserved and that in addition IO is done in parallel if multiple disks are present

In Chapter we describ e a deterministic simulation technique for pro ducing an

EMBSP in particular an EMCGM algorithm from a particular typ e of BSP algo

rithm in particular a CGM algorithm This approach requires that each of the v

virtual pro cessors generate at least vb items of communication data in every sup er

step where the parameter b represents the desired communication blo ck size This

allows us to establish a tight upp er and lower bound on the message size to every

pro cessor and allows a simple ecient deterministic simulation

In Chapter we describ e another deterministic approach which also generates an

EMCGM algorithm from a CGM algorithm It requires less communication volume

than the metho ds of Chapter Each of v virtual pro cessors need only generate v b

items of communication data in every sup erstep where is a constant

The simulation requires a constantnumber or rounds for each communication round

algorithm although this constant is larger than in Chapter of the CGM

In Chapter we describ e an extension to the three simulation techniques which

allows certain problems such as list ranking whose problem size reduces geometrically

N

in each of a series of r rounds to b e simulated in a linear ie O number of IO

B op erations

CHAPTER INTRODUCTION

In Chapter a number of new EMBSP in particular EMBSP and EMCGM

algorithms are presented together with their computation communication and IO

complexities Many of these algorithms are obtained from known BSP and CGM

algorithms via the simulations of Chapters and We also describ e new multisearch

algorithms derived from aknown BSP algorithm

In Chapter we describ e two case studies in implementing EM algorithms We

describ e an implementation of the buer tree technique of Arge and an implemen

tation of the deterministic simulation technique of Chapter applied to a parallel

algorithm for Samplesort

Chapter concludes with a summary of the contributions of this research and

some suggestions for future work

In order to assist the reader in understanding the arguments of this thesis a

summary of notation and an index of terms can be found b eginning on page

Chapter

Mo dels

Overview

In this chapter we develop three new mo dels of computation on parallel external

memory machines The new mo dels which we refer to as EMBSP EMBSP and

EMCGM can b e viewed as uniting the external memory parallel disk mo del PDM

with the BSP BSP and CGM parallel computing mo dels

resp ectively The PDM is describ ed in Section and the BSP BSP and CGM

parallel computing mo dels are describ ed in Section While the new mo dels each

have slightly dierent parameters due to their diering heritage they share a number

of common features

they capture three contributors to the running time of an algorithm namely

computation communication and IO and

the go o dness criteria for algorithms evaluated on each of the mo dels are similar

and sub ject to the same issues

The BSP BSP and CGM mo dels are very similar Algorithms which are ecient

on the BSP and CGM mo dels are also BSP algorithms Because of this similarity

we may refer at times to BSP and CGM algorithms as BSP algorithms and to

EMBSP and EMCGM algorithms as EMBSP algorithms

Organization of This Chapter

In Section we describ e previous mo dels for external memory computation and in

particular the Parallel Disk Mo del Section describ es several mo dels of compu

tation for parallel machines including the BSP BSP and CGM parallel computing

CHAPTER MODELS

mo dels Section then intro duces the new parallel EM mo dels EMBSP EMBSP

and EMCGM

In describing each of these mo dels weidentify two fundamental comp onents the

system mo del and the cost mo del

The system mo del describ es the architecture of the mo del and the cost mo del

provides criteria bywhich comp eting algorithms on the mo del can b e compared The

cost mo del consists of a measurement scheme and one or more goodness criteria

which together provide ameasure by which two algorithms can b e ranked

EM Mo dels

In this section we examine some of the external memory mo dels that have been

prop osed to date

Aggarwal and Vitter Single Disk Mo del

This mo del involves a single pro cessor accessing a single disk In a single IO

request or op eration D dierentblocks can b e transferred b etween disk and memory

There is no restriction on the lo cation of the blo cks on the disk and so they may

be in consecutive blo ck lo cations or widely separated This is more p ermissive than

real disk systems and so the mo del is viewed as somewhat unrealistic compared to

the later PDM mo del of Vitter and Shriver However Aggarwal and Vitter showed

lower b ounds for a numb er of problems using their p ermissive mo del and these lower

bound carry over to the PDM The cost mo dels for the two mo dels are identical see

below

The Parallel Disk Mo del PDM

System Mo del The mo del of Vitter and Shriver has b ecome known simply

as the PDM or Parallel Disk Mo del See Figure The mo del allows for D

disks to b e accessed simultaneously in a single IO op eration Either strip ed or

indep endent IO is allowed The disks may b e attached to a single pro cessor or

individually to dierent pro cessors which are interconnected by some network

The basic PDM is indep endentofthenumb er of pro cessors but b oth single and

multicomputer versions were intro duced in

For the randomized algorithms rep orted in multicomputer interconnec

tion networks on which sorting could b e done in parallel randomized logarithmic

time were considered eg cub e connected cycles The later deterministic algo

rithms of No dine and Vitter considered only a PRAMlikeinterconnection

between pro cessors

CHAPTER MODELS

Disk 1

Processor Disk 2

Memory

Disk D

Figure The Paral lel Disk Model with One Processor

The go o dness criterion is simultaneous optimality of IO

complexity and computation complexity

... Disks

... Local Memories

... Processors

P1 P2 Pp

Interconnection Network (e.g. PRAM, Hypercube,

CCC)

Figure The Paral lel Disk Model with Multiple Proces sors

CHAPTER MODELS

Cost Mo del The cost measure used by the PDM is the number of IO op erations

required asymptotically by an algorithm A for a given problem P as the size

of P grows to innity All other costs such as computation time are ignored

Algorithm A is said to b e optimal we use the term IOoptimal if the number

of IO op erations required by A meets the lower bound for the number of IO

op erations required to solve P The following additional rules apply

for each IO op eration p erformed the mo del assumes that one blo ck of

data can b e transferred for each of the D disks

for each blo ck transferred the mo del assumes that B items of data could

have b een transferred where B is the numb er of items of data that t into

a disk blo ck

throughout the computation the mo del assumes that available internal

memory which is capable of holding M items of data is fully utilized by

the algorithm For instance the lower bound for sorting incorp orates

the fact that internally sorting a memoryload of data is free under the

mo del The number of passes o ver the data and therefore the number of

IO op erations is aected by how many items t into internal memory

The PDM has been widely used for developing and analyzing external memory

algorithms Much of the research describ ed in Chapter is based on this mo del of

external memory computation

Hierarchical Memory Mo dels

In a mo dern computer system there are p otentially manylayers of memory eachwith

a dierent access time and data unit blo ck size These layers form a hierarchy

ranging from those with the fastest access time such as the onc hip registers of the

CPU through various kinds of cache memory main memorytolocalmechanical disk

drives or even remote disks accessed via a network In concept there is p otential at

eachinterface b etween layers of the hierarchy to optimize the data transfer algorithmi

cally Anumberofmodelshave b een develop ed to capture the relationships b etween

layers of this hierarchy These include the Hierarchical Memory Mo delHMM of

Aggarwal Alp ern Chandra and Snir the Blo ckTransfer Mo delBTM of Aggar

wal Chandra and Snir and the Uniform Memory Mo delUMM of Alp ern Carter

and Feig Due to the added complexity of the multilevel mo dels relatively few

exception is researchers have used them to design algorithms One

In this thesis we fo cus primarily on the diskmemory interface b ecause the sp eed

dierence and blo cking requirements of this interface are the most severe oering the

most p otential for time savings and b ecause the hierarchical mo dels are signicantly more complex

CHAPTER MODELS

Parallel Computation Mo dels

Nowwe examine some of the parallel computing mo dels that have b een prop osed As

in the EM mo dels there are b oth a system mo del and a cost mo del comp onent to

each The system mo del describ es the architecture of the mo del and the cost mo del

provides a criterion by which comp eting algorithms on the mo del can be compared

The Parallel Random Access Memory PRAM Mo del

The PRAM mo del has been used for many years as a mo del of parallel compu

tation Its main advantage is its simplicity

System Mo del The PRAM Mo del assumes that p pro cessors are available and

that they are attached to a single common memory The pro cessors op erate

synchronously In each PRAM step p instructions are executed in parallel Any

item in the common memory can b e accessed at unit cost byany pro cessor To

resolve the issues asso ciated with multiple pro cessors reading and writing a

single item in the common memory concurrently several submo dels have been

dened

Concurrent Read Concurrent Write CRCW an item can be read or

written concurrently in this mo del To mo del the eect of concurrent

writes several further submo dels have been prop osed

Exclusiv e Read Exclusive Write EREW neither read nor write con

currency is allowed by the mo del

Concurrent Read Exclusive Write CREW only read concurrency is p er

mitted by the mo del

Exclusive Read Concurrent Write ERCW only write concurrency is

p ermitted by the mo del

Cost Mo del The time complexity T of a PRAM algorithm A for problem P with

p

p pro cessors is dened as the maximum numb er of memory accesses p erformed

byany pro cessor during the execution of A The work complexity W is p T

p p

The go o dness criteria used are typically

y Minimize T where T T p and T is the Parallel Time Optimalit

p p s s

time complexityofthe b est sequential algorithm for P

Parallel Work Optimality Minimize W where W T and T is the

p p s s

time complexityofthe b est sequential algorithm for P

CHAPTER MODELS

The Bulk Synchronous Parallel BSP Mo del

The BSP Mo del was prop osed by Valiant as a bridging mo del for parallel com

putation a standard mo del on which b oth hardware and software designs can agree

The stated ob jective of BSP is to provide a mo del that is simultaneously useful to

hardware designers algorithm designers and programmers of parallel computers It

is neither a hardware nor a programming mo del but somewhere in b etween

System Mo del The BSP Mo del has parameters N v g and L It consists of v

pro cessormemory comp onents we use the letter v b ecause these will b e virtual

processors during our simulations see Chapters a router that delivers

messages in a p ointtopoin t fashion and a facility to synchronize all pro cessors

prevent pro cessors from pro ceeding until all have reached a given barrier in

their co de Each pro cessor has a unique lab el in the range v

A computation pro ceeds in a succession of supersteps separated by synchro

nizations usually divided into communication and computation supersteps see

Figure In computation sup ersteps pro cessors p erform lo cal computations

on data that is available lo cally at the b eginning of the sup erstep and issue

send op erations Between computation sup ersteps a communication sup erstep

is p erformed where each pro cessor exchanges data with its p eers via the router

An hrelation is a sup erstep where O h data are sent and received by every

pro cessor

The mo del anticipates that the routers on dierent systems mayhave dierent

characteristics relative to the computation sp eed of the pro cessors For expla

nation purp oses let N fthe numb er of computational op erations that can

comp

be p erformed by the pro cessors per unit timeg and N fthe number of

comm

words that can be delivered by the router per unit timeg The parameters of

the BSP mo del are then dened as follows

N the number of items in the problem to be solved

v the number of pro cessors available

g N N ie the time conversion factor between communication and

comp comm

computation op erations Another way of expressing the meaning of g

is as the time time required to send a single word of data between two

pro cessors where time is measured in number of CPU op erations

L the minimum setup time or latency of a sup erstep measured in CPU op er

ations This can involve a number of factors including the time required

for the pro cessors to p erform a barrier synchronization

CHAPTER MODELS

v independent subproblems

Computation Superstep 1 123 v

Communication Superstep

Computation Superstep 2

Communication Superstep

Communication Superstep

Computation

Superstep λ

Figure A BSP Computation

Cost Mo del The cost of a computation on the BSP mo del is represented by the

sum of three quantities one for computation time one for communication time

and one for the time required by the pro cessors to synchronize Let A be a

BSP algorithm op erating on input data of size N items Let be its parallel

computation time let be the number of words of data sentbetween pro cessors

during the execution of A and let be the number of sup ersteps required by

A to solve aproblemofsize N Then the cost T Aofthe computation is

T A g L

The coptimality go o dness criterion was prop osed in for the BSP mo del

Denition Let A be the best sequential algorithm on the RAM for the

problem under consideration and let T A be its worst case runtime Let c

be a constant A coptimal paral lel algorithm A meets the fol lowing criteria

CHAPTER MODELS

The ratio between the computation times of A and T A p is in c o

The ratio between the communication time of A and the computation

time T A p is in o

Al l asymptotic bounds refer to the problem size n as n We say that a BSP

algorithm is oneoptimal if it full ls the requirements of the above denition for

c

The Extended BSP BSP Parallel Pro cessing Mo del

The BSP mo del is a sp ecial case of the BSP mo del where each message is

assessed a minimum cost equivalent to a message of size b items The BSP mo del

therefore gives incentives to send messages of at least b in size An appropriate value

for b in a particular case is determined by the characteristics of the particular router

System Mo del The BSP mo del has parameters N v g L b Similarly to BSP

it consists of v pro cessormemory comp onents a router that delivers messages

in a point to p oint fashion and a facility to synchronize all pro cessors in a

barrier style Each pro cessor has a unique lab el in the range v

As in BSP a computation pro ceeds in a succession of sup ersteps separated

by synchronizations usually divided into communication and computation su

persteps In computation sup ersteps pro cessors p erform lo cal computations on

data that is available lo cally at the b eginning of the sup erstep and issue send

op erations In communication sup ersteps the send op erations are implemented

ie the exchange of data between the pro cessors is done by the router

The parameters of the BSP mo del are dened as follows

N the number of items in the problem to be solved

v the number of pro cessors available

time in number of pro cessor op erations the router needs to deliver a g the

packet of size b in number of machine words when in continuous use

L the minimum setup time or latency of a sup erstep measured in CPU op

erations

b the minimum message size required by the router to achieve its rated through

put

Cost Mo del The cost of a computation on the BSP mo del is represented by

the sum of three quantities one for computation time one for communication

time and one for the time required by the pro cessors to synchronize Let A be a

CHAPTER MODELS

BSP algorithm op erating on input data of size N items Let b e its parallel

computation time let be the number of messages sent between pro cessors

during the execution of A and let be the number of sup ersteps required by

A to solve aproblemofsize N

Let b e the maximum of the number of words sent or received by pro cessor j in

ij

th th

the i communication sup erstep The cost of the i communication sup erstep

ij

v

is then t max w where w d e is the maximum of the number of

i ij ij

j

b

P

blo cks sent or received by pro cessor j in sup erstep i Let t Then the

i

i

cost T Aofthe computation is

T A g L

Note that a message which is shorter than b words incurs the same cost as a

message which is b words in length

The coptimality go o dness criterion is typically used to evaluate an algorithm on

the BSP mo del See Section

We treat the BSP mo del as a sp ecial case of BSP diering only in the way the

accounting is done for messages shorter than b items

The Coarse Grained Multicomputer CGM Mo del

The CGM mo del is another sp ecial case of the BSP mo del In the sp ectrum

of granularity of parallel computation the PRAM mo del is the most ne grained

and the CGM is the most coarse grained mo del

System Mo del The CGM mo del has parameters N v The mo del assumes

that a collection of v pro cessors are interconnected via a network Initiallyeach

pro cessor holds an equal p ortion of the N data items of the problem to b e solved

so each pro cessor contains O Nv data items A CGM algorithm is BSPlike in

that it pro ceeds in an alternating sequence of computation and communication

rounds sup ersteps In a single computation round each pro cessor p erforms a

computation internally on the the O Nv items in its lo cal memory Between

computation rounds a communication round is p erformed where each pro cessor

sends a total of O Nv data to its p eers and receives O Nv data in return It

is this restriction on the communication volume of a sup erstep that dierentiates

the CGM mo del from other BSP mo dels

CHAPTER MODELS

Cost Mo del A common go o dness criterion for CGM algorithms is to simultane

ously minimize i parallel computation time and ii number of sup ersteps

p erformed O rounds is the ideal This has the eect of dening an opti

mal CGM algorithm for a given problem as b eing an algorithm that has the

minimum number of sup ersteps and p erforms the same amount of work as an

optimal sequential algorithm for the same problem

In this thesis we will treat CGM as a mo del which has parameters N v g L as

in BSP and but for which eac h communication sup erstep consists of one hrelation

N

for h The cost of measure of a CGM algorithms in our work will be the sum

v

of the computation communication and synchronization times as in BSP Let A be

a BSP algorithm op erating on input data of size N items Let be its parallel

computation time and let be the number of sup ersteps required by A to solve a

problem of size N Then the cost T Aofthe computation is

N

T A g L

v

N

A sup erstep whichinvolves less than data sent or received by each pro cessor is

v

N

nonetheless charged for a relation

v

Other Parallel Computing Mo dels

A large number of other parallel computing mo dels have been prop osed The LogP

mo del is one of the most frequently mentioned see for example This mo del

diers slightly from BSP in that it captures the communication overhead which is

the time sp entby a pro cessor sending or receiving a message A detailed comparison

between the LogP and BSP mo dels is presented in They conclude that BSP and

LogP are nearly equivalent in mo delling capability A variant of LogP is the LogGP

mo del which makes sp ecial provisions for large messages Both LogP and LogGP

require detailed schedules of the communication activities

Combining PDM and BSPlike Mo dels

Mo del Ob jective A More Comprehensive

While the Parallel Disk Mo del captures computation and IO costs it is designed

for a sp ecic typ e of communication network where a communication op eration is

exp ected to take a single unit of time comparable to a single CPU instruction BSP

and similar parallel mo dels capture communication and computational costs for a

more general class of interconnection networks but do not capture IO costs

CHAPTER MODELS

Processors Processors CPU Comm- unication Processor Memory 1 i

Disk 1 . 2 .

. . Disk D Network . . . Router .

i-1 p

Figure A Paral lel Machine with External Memory

The ACM Working Group on Storage IO for Largescale Computing pre

sented the chal lenge of synthesizing a coherent model that combines the best aspects

of the Paral lel Disk Model and Bulk Synchronous Paral lel models to develop and an

alyze algorithms that use paral lel IO computation and communication

Our EMBSP EMBSP and EMCGM mo dels intro duced in Sections

rep ort In and resp ectively address many of the issues outlined in that

particular the EMCGM mo del intro duced in Section has a small number of

parameters is not dicult to use and app ears to correctly predict at least quali

tatively the p erformance of algorithms designed on the mo del See Chapter for

preliminary implementation exp eriments of an EMCGM Samplesort algorithm

The EMBSP Mo del

We rst extend the BSP mo del to include secondary lo cal memories The basic idea

is illustrated in Figure Each pro cessor has in addition to its lo cal memory an

external memory in the form of a set of hard disks

System Mo del

We apply this idea to extend the BSP mo del to its external memory version EMBSP

following to the standard BSP parameters by adding the

M is the lo cal memory size of each pro cessor

D is the number of disk drives of each pro cessor

CHAPTER MODELS

B is the transfer block size of a disk drive and

G is the ratio of lo cal computational capacity number of lo cal computation

op erations divided by lo cal IO capacity numb er of blo cks of size B that can

be transferred between the lo cal disks and memory p er unit time

In this thesis we will use the parameter v to refer to the number of pro cessors of

aBSP and the parameter p to refer to the number of pro cessors of an EMBSP

In many practical cases all pro cessors have the same numb er of disks We restrict

ourselves to that case although the mo del do es not forbid dierentnumb ers of drives

and memory sizes for each pro cessor We denote the disk drives of each pro cessor by

D D D Each drive consists of a sequence of tracks consecutively numbered

D

starting with which can be accessed by direct random access using their unique

track numb er A track stores exactly one blo ck of B records

Each pro cessor can use all of its D disk drives concurrently and transfer D B

items from the lo cal disks to its lo cal memory in a single IO op eration and at cost

G In such an op eration we permit only one track per disk to be accessed without

any restriction on which track is accessed on each disk We assume that a pro cessor

can store in its lo cal memory at least one blo ck from each lo cal disk at the same time

ie M DB

Cost mo del

Like a computation on the BSP mo del the computation on the EMBSP mo del

pro ceeds in a succession of sup ersteps We adapt communication and computation

sup ersteps from the BSP mo del and allow multiple IOop erations during a single

computation sup erstep For the EMBSP mo del the computation cost t and

comp

communication cost t are the same as for the BSP mo del The total cost of each

comm

t t L The term t is the additional IO sup erstep is dened as t

comp comm

IO IO

p j j

cost charged for the sup erstep where t max fw gandw is the IOcost

IO

j

IO IO

incurred by pro cessor j Recall that each IO op eration costs G time units

Note that the mo del gives incentives to access all disk drives using blo ck transfers

For instance a single pro cessor EMBSP with D disks is capable of transferring a blo ck

of B items to or from each disk in a single IO op eration An op eration involving

fewer elements incurs the same cost

The EMBSP Mo del

The EMBSP is derived from the BSP mo del in a similar waytohow the EMBSP

mo del is formed from BSP We add to the BSP mo del the additional parameters M

lo cal memory size at each pro cessor D number of disk drives at each pro cessor

CHAPTER MODELS

B transfer blo ck size for a lo cal disk drive and G ratio of lo cal computational

capacity to lo cal IO capacity

In this thesis we will use the parameter v to refer to the number of pro cessors of

aBSP and the parameter p to refer to the number of pro cessors of an EMBSP

The cost of each EMBSP sup erstep is dened as t t t L where

comp comm IO

t and t refer to the computation time and communication time as dened for

comp comm

the BSP mo del and t is dened as indicated ab ove for the EMBSP mo del

IO

We treat the EMBSP as a sp ecial case of EMBSP diering only in its account

ing of the costs of short messages

The EMCGM Mo del

The EMCGM mo del has the parameters of the CGM plus the following additional

parameters M lo cal memory size at each pro cessor D number of disk drives at

each pro cessor B transfer blo ck size for a lo cal disk drive and G ratio of lo cal

computational capacity to lo cal IO capacity

In this thesis we will use the parameter v to refer to the number of pro cessors of

a CGM and the parameter p to refer to the number of pro cessors of an EMCGM

As in the CGM the cost of eachcommunication sup erstep is identical and reects

N

the cost of each pro cessor sending and receiving data items in each communi

p

cation sup erstep We treat the EMCGM as a sp ecial case of EMBSP diering only

in this constraint on communication volume We use the full set of parameters N

p g L M D B G as used in the EMBSP for analysis purp oses of the EMCGM

Go o dness Criteria for Parallel EM Algorithms

Extending the coptimality Criterion

Since even small multiplicative constant factors in runtime are imp ortant we char

acterize the p erformance of an EMBSP algorithm by comparing its run time with

PRAM algorithm We measure ratios b etween runtimes on an optimal sequential or

pairs of mo dels that have the same set of lo cal instructions and adapt the optimality

criteria prop osed in

Denition Let A be the best sequential algorithm on the RAM for a problem P

of size N items and let T A be its worst case runtime Let c be a constant

A coptimal EMBSP algorithm A for p processors meets the fol lowing criteria al l

asymptotic bounds are with respect to the problem size N

The ratio between the computation times of A and T A p is in c o

CHAPTER MODELS

The ratio between the communication time of A and the computation time

T A p is in o

The ratio between the IOtime of A and the computation time T A p is in

o

Note that the constrainton dierentiates this denition from the corresp onding

one for nonexternal memory parallel algorithms

We shall say that a EMBSP algorithm is oneoptimal if it fullls the requirements

of the ab ove denition for c

Alternative Go o dness Criteria for EMBSP Algorithms

The simultaneous satisfaction of the constraints c o o and o is

a stringent requirement to place on an EMBSP algorithm It is not always p ossible

to show that the running time of an algorithm is dominated by its computation time

for instance

A relaxation of this requirement leads to the notion of balancing the costs of

computation communication and IO so that none of them dominates the running

time and the work done in each is asymptotically the same as the work of an optimal

sequential algorithm

We will use the terms workoptimal communicationecient and IOecient to

describ e an algorithm for which O O and O resp ectively An

algorithm which is workoptimal communicationecient and IOecient there

p fore is one whose running time complexityisnoworse than the complexity T A

Constant factors are ignored

For some problems such as searching a large data structure on disk the cost of

IO may necessarily dominate the computation and communication time In this

case we require that the algorithm b e IOoptimal meaning that the numberofIO

op erations matches the lower bound for the number of IOs required to solve the

problem Wemay still b e able to show that c o and o but if not we

require that the algorithm b e workoptimal and IOecient

Chapter

EMBSP Algorithms from

Simulation

Overview of the Metho d

The purp ose of this chapter is to describ e in general terms how a suitable BSP algo

rithm can be executed as an EMBSP algorithm on a machine with multiple disks

We will call this simulating a parallel algorithm as an external memory algorithm

In essence the simulation is of the message passing that o ccurs in the parallel algo

rithm This approach reects Dijkstras observation that in an abstract sense

communication between pro cesses in space eg executing on the pro cessors of a

BSP computer is no dierent than communication in time in this case by passing

messages between tasks via the disks of an EMBSP computer At a less abstract

level however the simulation of a BSP algorithm for this purp ose involves ma jor

changes to the way that communication b etween pro cesses is p erformed and certain

constraints on the parameters of the problem

The main ideas of the mapping b etween a suitable BSP algorithm A and a corre

sp onding external memory algorithm A are as follows

A communication op eration b etween pro cessors in A is replaced by one or more

output op erations to disk and corresp onding input op erations from disk in A

We cho ose values for the parameters of A that ensure that IO in A can be

done blo ckwise

the target machine we cho ose techniques To accommo date multiple disks on

and parameter values which allowus to use all of the disks in parallel and

Recall that we treat BSP and CGM algorithms as particular typ es of BSP algorithms

CHAPTER EMBSP ALGORITHMS FROM SIMULATION

Program Code

Internal Data Structures Context Size μ Message Data Message Data

Size γ

Figure The memory of a virtual processor The pro

gram co de is assumed to b e of constant size In many prac

tical cases it is the same for all virtual pro cessors and there

fore a single copy is sucient The context is of size and

consists of the memory p ortion which is not readonly The

message area is of size and is contained in the context

To accommo date multiple pro cessors on the target machine we ensure that

the communication between pro cessors in A is balanced so no real pro cessor

receives or sends much more data than any other

The execution of a BSP algorithm pro ceeds as a series of comp ound sup ersteps

and can therefore be simulated b y rep eated application of the simulation steps for

a single comp ound sup erstep Message sendreceive op erations are mo deled by disk

readwrite op erations

We adopt the following terminology The v pro cessors of the BSP machine will

be called virtual processors Each communication sup erstep will be divided into a

sending superstep and a receiving superstep During a sending sup erstep messages are

generated and during a receiving sup erstep they are received A compound superstep

is comp osed of a receiving a computation and a sending sup erstep The context

context size of a virtual of a virtual pro cessor is the lo cal memory it uses and the

pro cessor is the maximum size of its context used during the computation The

maximum context size of all virtual pro cessors is denoted as the maximum size of

the data communicated by a virtual pro cessor is represented by and O see

Figure

Outline of the simulation for a comp ound sup erstep A comp ound sup er

step for the v virtual pro cessors of a BSP machine is simulated by p erforming the

CHAPTER EMBSP ALGORITHMS FROM SIMULATION

following steps in a roundrobin fashion for k virtual pro cessors at a time k v

Fetching phase Read the contexts and the messages to be received by the

current virtual pro cessors from disk into memory

Computation phase Perform the computations indicated by the BSP algorithm

for these k virtual pro cessors in this comp ound sup erstep

Writing phase Save the currentcontexts and the messages sentby the current

virtual pro cessors on disk

The Fetching phase Computation phase Writing phase p erforms the op erations

necessary to simulate the receiving sup erstep computation sup erstep sending su

p erstep of a comp ound sup erstep A comp ound sup erstep pro duces messages which

are received in the following comp ound sup erstep The simulation must store the

generated messages on disk in such away that they can b e fetched eciently during

the Fetching phase of the next comp ound sup erstep By eciently we mean in this

context that b oth input and output op erations are fully blo cked to the disk blo ck

size B and that if parallel disks are present they are utilized in parallel ie for D

parallel disks input and output op erations are p erformed D blo cks at a time These

requirements can easily b e met for the contexts as weknow their maximum size and

can preallo cate a dedicated area for each spread across the D disks For the gener

ated message trac however these requirements may be more dicult as we may

not know the communication pattern for a particular comp ound sup erstep

The communication time of a BSP sup erstep is O g L where is the maximum

size of the data communicated by a virtual pro cessor Recall that is the maximum

context size of a virtual pro cessor hence O In the following the context and

messages generated by a virtual pro cessor are divided into blo cks and the blo cks are

spread evenly over the available disks

Algorithm SeqComp oundSup erstep simulates a single comp ound sup erstep of a v

pro cessor BSP on a unipro cessor EMBSP machine We simulate k v virtual

pro cessors at a time and we refer to such a collection of pro cessors as a batch We

also use this term to refer to the messages asso ciated with a batch of pro cessors

M

c Note that M To maximize the use of available memory we cho ose k b

The following outlines the basic ideas of the algorithm The details are presented in

subsequent chapters

Algorithm SeqComp oundSup erstep

Ob jective Simulation of a comp ound sup erstep of a v pro cessor BSP on a single pro cessor

EMBSP with D disks

v

g the blo cks of the contexts and arriving messages of batch Input For each if

k

i ie virtual pro cessors ik i k are spread over the D disks in a parallel format

CHAPTER EMBSP ALGORITHMS FROM SIMULATION

Output i The changed contexts of the k simulated pro cessors are spread across the

disks in strip ed format ii The messages generated during the comp ound sup erstep are

v

group ed by destination into batches and eachbatch is stored in a parallel format on the

k

disks

v

for i to

k

a Read the contexts V V of the k virtual pro cessors of batch i from

ik

ik

the disks into memory

b Read the packets received bythek virtual pro cessors of batch i from the disks

c Simulate the lo cal computation of the k virtual pro cessors of batch i

d Write the packets whichwere sentbythe k virtual pro cessors to the D disks

e Write the changed contexts V to V back to the D disks

ik

ik

Reorganize the blo cks containing the generated messages into consecutive format for

each batch of k pro cessors so that they can b e accessed in parallel from the disks in

the simulation of the next comp ound sup erstep

End of Algorithm

Algorithm SeqComp oundSup erstep simulates a single comp ound sup erstep of a

BSP algorithm Steps a and b corresp ond to the Fetching Phase Step c is

the Computation Phase and Steps d and e comprise the Writing Phase

Details of Steps a and e Since we know the size of the contexts of the

pro cessors and the order in whichwe simulate the virtual pro cessors is static during

the simulation we can distribute the k contexts deterministically Wereserve an area

v

of total size v on the disks blo cks on each disk where we store the contexts

BD

We split the context V of virtual pro cessor j into blo cks of size B and store the ith

j

ij

B

mod D using track Since the context of each blo ck of V on disk i j

j

B D

pro cessor is now in consecutive format on the disks we can easily read and write the

contexts of k consecutive pro cessors using D disks in parallel for every IO op eration

Details of Step b The details of this step vary slightly dep ending on the tech

nique used the randomized simulation of Chapter or the deterministic simulations

describ ed in Chapters and In all cases however Step of the previous com

pound sup erstep guarantees that the blo cks which contain the messages destined for

the current pro cessors are stored in a reserved area evenly distributed over the disks

Therefore we can use a similar technique to fetch the messages as we used to fetch

the contexts

Details of Step d and Step Writing the messages to the parallel disks Step

d is more dicult than writing the contexts to the disks This is b ecause wemust

ensure that in each IO op eration a single blo ck is written to each disk in parallel

or asymptotically D disks receive a blo ck of data We use the fact that we

CHAPTER EMBSP ALGORITHMS FROM SIMULATION

know the size of the contexts and that they are all the same size This makes the

swapping of contexts rather straightforward However we do not know the size of

interpro cessor messages in general They may b e of widely diering sizes from zero

items to M items In the single real pro cessor case the messages can be delivered

by sorting However we are ultimately interested in an algorithms that works for a

multipro cessor target machine

In addition to writing the messages to the disks in parallel we must also arrange

that in Step b of the next sup erstep we can read the messages intended for each

pro cessor using all of the disks in parallel This is the resp onsibility of Step Steps

d and vary dep ending on the technique used

In Chapter a BSP algorithm is assumed as input Step d involves dis

tributing the communication packets of the BSP randomly among the available

disks Step then requires several passes over the data to prepare it for Step

b of the next comp ound sup erstep

In Chapter Step d involves deterministically splitting the communication

from eac h batch of k pro cessors into v pieces which are written into xed

lo cations on the disks Step in this case is not required

In Chapter another deterministic technique is presented Step d involves

e routing stages for In the rst d e stages the pro cessors d

deterministically split their outgoing message data into v pieces and the pieces

are distributed evenly over the pro cessors of v groups In the nal routing stage

the metho d of Chapter is used Step isnotrequired

We now generalize the simulation to p pro cessors on the target EMBSP

machine

We rst describ e the simulation of a comp ound sup erstep of a v pro cessor BSP

with communication time g L computation time L and context size on a

ppro cessor EMBSP for p We assume that b B where b is the message blo ck

size of the BSP virtual machine

v

Outline virtual pro cessors of the parallel simulation As an initial step

p

v v

i i are assigned to each simulating pro cessor i As b efore the

p p

simulation of each comp ound sup erstep is comp osed of a series of rounds During each

v

of rounds a real pro cessor simulates the steps of k virtual pro cessors Messages sent

pk

between virtual pro cessors on dierent real machines require real communication by

the simulation If these messages are sent directly to their destinations the trac may

be unbalanced causing ineciencies in communication We describ e the techniques

used to deal with this problem b elow



This assignment is dierent for the deterministic simulation describ ed in Chapter See b elow

CHAPTER EMBSP ALGORITHMS FROM SIMULATION

v

th

We maintain batches to store the generated messages The j batch contains

pk

th

the messages destined for virtual pro cessors simulated in the j round of the current

v

batches to its lo cal comp ound sup erstep Each pro cessor writes its share of the

pk

disks

Algorithm ParComp oundSup erstep

Ob jective Simulation of a comp ound sup erstep of a v pro cessor BSP on a ppro cessor

EMBSP

v

the message and context blo cks of the virtual Input For every batch j where j

pk

pro cessors jkpj kp are divided among the real pro cessors and their lo cal disks

as follows

k k

each real pro cessor holds O blocks of messages and blo cks of context

B B

k k

each lo cal disk contains O blo cks of messages and O blo cks of context

BD BD

Output The changed contexts and generated messages distributed as required for the

next comp ound sup erstep

v

For j to do

p

For i to p do in parallel

a Fetching Phase pro cessor i reads the contexts and message blo cks p er

taining to batch j from its lo cal disks in parallel

b Computing Phase pro cessor i simulates the computation sup ersteps of

its current virtual pro cessors and collects all generated messages in its

lo cal memory

c Writing Phase pro cessor i writes the up dated contexts and message

blo cks generated by batch j to its lo cal disks

In a series of communication sup ersteps the real pro cessors exchange and reorganize

message data generated in the previous computation rounds

End of Algorithm

Many of the details of algorithm ParComp oundSup erstep are similar to ones pre

viously describ ed for Algorithm but there are dierences b etween the three tech

niques presented in Chapters and We outline the main dierences below

In the randomized approach of Chapter

The messages blo cks of the BSP are distributed randomly among the real

pro cessors

In a separate round of communication the message blo cks are sent to their

correct destinations

CHAPTER EMBSP ALGORITHMS FROM SIMULATION

This allows us to ensure that the real pro cessors b oth send and receive the same

amount of data with high probability

In concept these two communication rounds can be p erformed for each batch

as part of Step of ParComp oundSup erstep Instead however we distribute

the message blo cks randomly for each round at the end of Step c and send

them to their prop er destination as the rst part of Step a See Chapter

for details

In the deterministic approach of Chapter each round of communication be

tween virtual pro cessors is transformed in to two rounds where every message is

N

of size This ensures that the communication b etween real pro cessors is

v

balanced ie every real pro cessor sends and receives the same amount of data

in each round

The deterministic approach of Chapter requires less communication volume

lower slackness than the metho d of Chapter This p ermits the internal

memory to be reduced relative to the technique of Chapter Here the vir

tual pro cessors are assigned to real pro cessors in a dierent manner than in the

previous cases Whereas we had previously assigned a range of virtual pro ces

sors with consecutive lab els to each real pro cessor we now assign the virtual

pro cessors to real pro cessors in a round robin fashion Our routing algorithm

computes an ordering over all of the blo cks to be sent in each round and then

sends consecutive blo cks in the ordering to virtual pro cessors with consecutive

lab els We can show that each real pro cessor then receives the same amountof

data within a constant factor in every communication round

Overview and Rationale

In Chapter we describ e a randomized approach which allows us to eciently write

the messages to disk and eciently retrieve them in the next comp ound sup erstep We

show in Theorems and that if the simulation is p erformed on a coptimal BSP

algorithm the resulting EM algorithm is coptimal with high probability The result

ing EMBSP algorithm in particular EMBSP acquires its prop ertyofblocked IO

from the attribute of blo c ked communication present in the original BSP algorithm

from which it was derived The techniques describ ed in Chapter ensure that this

prop erty of blo cked communication b etween tasks is preserved and that in addition

IOisdone in parallel if multiple disks are present

In Chapter we describ e a deterministic simulation technique for pro ducing an

EMBSP in particular an EMCGM algorithm from a particular typ e of BSP al

gorithm in particular a CGM algorithm This approach requires that each virtual

CHAPTER EMBSP ALGORITHMS FROM SIMULATION

pro cessor generate at least vb items of communication data in every sup erstep where

the parameter b represents the desired communication blo ck size This allows us to

establish a tightupper and lower bound on the message size to every pro cessor and

allows a simple ecient deterministic simulation

In Chapter we describ e another deterministic approach which also generates an

EMCGM algorithm from a CGM algorithm It requires less communication volume

than the metho d of Section Each virtual pro cessor need only generate v b items

of communication data in every sup erstep where is a constant The

simulation requires a constant number of rounds for each communication round of

the CGM algorithm but this is a larger constant than in Section

In order to assist the reader in understanding the arguments of this pap er a

summary of notation and an index of terms can be found b eginning on page

Chapter

Randomized Simulation

Single Pro cessor Target Machine

We will now describ e the details of the BSP to EMBSP simulation In the follow

ing we use the terms consecutive format strip ed format and parallel format which

were discussed in Section We will also refer to another typ e of parallel format

linked format which will b e describ ed below

The communication time of a BSP sup erstep is O g L where is the max

b

imum size of the data communicated by a virtual pro cessor Recall that is the

maximum context size of a virtual pro cessor hence O In the following the

context and messages generated by a virtual pro cessor are divided into blo cks and

the blo cks are spread in consecutive format over the available disks

Algorithm SeqComp oundSS simulates a single comp ound sup erstep of a v

pro cessor BSP on a unipro cessor EMBSP machine with internal memory of size

M and D disks We simulate k virtual pro cessors at a time and we refer to such

a collection of pro cessors or to the messages they generate as a batch Of course

M k

Algorithm SeqComp oundSS

Ob jective Simulation of a comp ound sup erstep of a v pro cessor BSP on a single pro cessor

EMBSP with D disks

v

g the blo cks of the contexts and arriving messages of group i Input For each if

k

ie virtual pro cessors ik i k are spread over the D disks in consecutive format

so that they can b e accessed in parallel

Output i The changed contexts of the k simulated pro cessors are spread across the

disks in consecutive format ii The messages generated during the comp ound sup erstep

v

are group ed by destination into groups and each group is stored in consecutive format

k

on the disks

v

for i to

k

CHAPTER RANDOMIZED SIMULATION

a Read the contexts V V of the k virtual pro cessors of group i from

ik

ik

the disks into memory

b Read the packets received bythe k virtual pro cessors of group i from the disks

c Simulate the lo cal computation of the k virtual pro cessors of group i

d Write the packets whichwere sentbythe k virtual pro cessors to the D disks

e Write the changed contexts V to V back to the D disks

ik

ik

Perform SimulateRouting Reorganize the blo cks containing the generated messages

into consecutive format for each group of k pro cessors so that they can be accessed

in parallel from the disks in the simulation of the next comp ound sup erstep

End of Algorithm

Algorithm SeqComp oundSS simulates a single comp ound sup erstep of the

BSP algorithm Steps a and b corresp ond to the Fetching Phase Step c is

the Computation Phase and Steps d and e comprise the Writing Phase

Details of Steps a and e Since we know the size of the contexts of the

pro cessors and the order in whichwe simulate the virtual pro cessors is static during

the simulation we can distribute the k contexts deterministically Wereserve an area

v

of total size v on the disks blo cks on each disk where we store the contexts We

BD

split the context V of virtual pro cessor j into blo cks of size B and store the ith blo ck

j

ij

B

of V on disk i j mod D using track Since the context of each pro cessor

j

B D

is now in consecutive format on the disks we can read and write the contexts of k

consecutive pro cessors using D disks in parallel for every IO op eration

Details of Step b Step of the previous comp ound sup erstep guaranteed

that the blo cks which contain the messages destined for the current pro cessors are

stored in a reserved area evenly distributed over the disks Therefore we can use a

similar technique to fetch the messages as we used to fetch the contexts

Details of Step d As describ ed in Chapter managing the generated message

trac may be more dicult than the management of the contexts as we may not

know the communication pattern for a particular comp ound sup erstep After the

Computation Phase all messages sent by the current group of k pro cessors in the

current comp ound sup erstep have b een generated and stored in internal memory

The coarsegrained nature of the BSP algorithm results in large messages which

to blo cks of size are as long or longer than the blo ck size B We cut the messages in

B Each blo ck is lab elled with the destination address from its original message In

D rounds we write the blo cks out to the disks In each round a group of D k

B

blo cks b i D is written in parallel to the disks by cho osing a random

i

p ermutation of f D g and writing blo ck b to disk i

i

The blo cks are partitioned into D buckets on the disks dep ending on their des

v

tination address Each bucket contains the blo cks destined for consecutive virtual D

CHAPTER RANDOMIZED SIMULATION

pro cessors In order to maintain the buckets the simulation uses a table of D pointers

th

on each disk The i entry in the table on a disk p oints to the head of a list of blo cks

of bucket i that have b een written to that disk Whenever we write a blo ckofbucket

i to disk D we allo cate a free trackonD and concatenate it to the list for bucket i

j j

For convenience we shall refer to the format just describ ed for the blo cks in a bucket

as linked format

Clearly we can read all the blo cks comp osing a bucket stored on D disks in

k

parallel IO op erations provided each disk contains the linked format in O

BD

same number of blo cks In Lemma A we will show that with high probability the

blo cks of each bucket are uniformly distributed over the disks

Algorithm SimulateRouting provides the details of Step of SeqComp oundSS

Algorithm SimulateRouting

Ob jective Reorganize the blo cks of messages from the previous computation sup erstep

into consecutive format Step of Algorithm

Input The D buckets of messages stored on the disks in linked format

v

Output The groups of messages stored on the disks in consecutive format

k

Allo cate space for a copy of bucket i on disk ifori D Within each copy

reserve space for each of its groups Read the buckets from the disks in parallel and

write them back one bucket per disk For the j th parallel readwrite we p erform

the following

for d to D in parallel do

Read blo ck b b elonging to bucket d from disk d j modD Write blo ck b

d d

to disk d on the next available track for the group to which it is addressed

From each disk in parallel read a blo ck of each bucket writing the blo cks back to

the disks so that eachbucket is in consecutive format

v

for j to

BD

for d to D in parallel do

th

read the j blo ck from disk d and write it to disk d j modD on track

v

dd e bjDc

D B

End of Algorithm

After Step of Algorithm SimulateRouting all messages that will b e received by

a group of k pro cessors are stored in one region on the same disk After Step the

blo cks are stored in consecutive format In fact they are stored in xed lo cations like

the blo cks of the context See Figure This is p ossible b ecause we know that each

virtual pro cessor receives and sends messages of total size

Lemma Steps a b and e of algorithm SeqCompoundSS have com

v v

putation time O v memory k BD IOtime O G and disk space O

DB BD

blocks per disk

CHAPTER RANDOMIZED SIMULATION

. . .   ......       . . . . . . .   . .    . . .   . . .        

01 D-1 0 1 D-1

Figure Reorganization in Step of SimulateRouting

v

Before each disk contains the message blo cks for pro ces

D

v

sors divided into groups After the messages for each

kD

group are spread over the disks in consecutive format

k

e blo cks of context Steps a and e for Pro of We have to read and write d

B

the simulation of eachsimulated sup erstep In Step b the arriving messages o ccupy

k

d e blo cks Since in these cases the blo cks are arranged on the disks in consecutive

B

format we can access D tracks in parallel at a time Hence we need the following

number of IO op erations for one simulated sup erstep

vk

X

k v k v

BD k BD k

j

v

and this implies O v Since O the numb er of IO op erations is in O

BD

additional lo cal computation steps

Since the memory of the simulating pro cessor must hold the context of k virtual

pro cessors and a parallel disk track M k BD 

The blo cks representing simulated message trac are divided into buckets bythe

simulation The contents of a bucket are written to the available disks in a series of

write cycles In each write cycle at most one blo ck is written to any disk

Lemma The computation time of algorithm SimulateRouting is O l v and its

v

v

l

D

IO time is O l G for k BD v D log D with probability at least e

BD

and constant l

Pro of For the purp oses of the pro of we assume that the maximum amount of

communication is required This can b e accomplished by the intro duction of dummy

CHAPTER RANDOMIZED SIMULATION

v

blo cks if necessary Each bucket contains R blo cks since eachoftheD buckets

BD

v

virtual pro cessors contains the messages destined for

D

R

records of a given bucket with By Lemma A a given disk contains more than l

D

R

probability at most exp l for constant l Let X denote the event

D

R

that any disk contains more than l blo cks of any bucket There are D drives and

D

D buckets so with k BD and v D log D we have

R

Pr X D exp l

D

v

D exp l

BD

vB D

exp l log D

kB D

v

exp l

D

After D iterations of Step in algorithm SimulateRouting D blo cks per bucket

v

have been moved Each disk contains less than blo cks of each bucket with high

D B

v

iterations all blo cks have b een moved probability Thus after D

D B

v

In Step iterations are p erformed During each iteration a parallel read and

BD

a parallel write op eration are p erformed

lv

Thus the total IOtime of algorithm SimulateRouting is O G and the total

BD

computation time is O lv with high probability 

Lemma A compound superstep of a vprocessor BSP with computation time

L communication time g b L and local memory of size can be simulated

on a single processor EMBSP with D disks and internal memory of size M in

v

with probability at least computation time v O l v and IO time O l G

BD

v v

exp l for constant l v D log D M k BD b B v

D k

v

and arbitrary integer k

p

Pro of The lo cal memory of a virtual pro cessor is of size which is sucient for

incoming messages Allowing in addition for a parallel disk track in memory implies

M k BD memory in the EMBSP machine

The disk space needed by the simulation is the total context size vwhich includes

space for incoming messages By Lemma A the communicated data is evenly

distributed over the disks with high probability Therefore we need in total O vBD

space on each disk with high probability

Step c of algorithm SeqComp oundSS consumes v computation time over

all For each group of k virtual pro cessors k b messages are generated This adds

O v computation time overall

CHAPTER RANDOMIZED SIMULATION

During Step d of algorithm SeqComp oundSS a permutation can be gen

v

erated in O D time so the computation time for each batch is O D k and

BD

k

Therefore Step d contributes computation time O v and IO IO time O

BD

v

time O G for the whole sup erstep

BD

By Lemma and Lemma the computation time and IOtime resp ectively

v

for Steps a b and d are O v and O G l Since O these

BD

dominate the costs of Step d so overall computation time is v O l v and

v

v

IOtime is O G l with probability at least exp l 

BD D

Lemma A allows us to exploit the indep endence of the random exp eriments

p erformed during each comp ound sup erstep in order to prove that the success prob

ability for the entire simulation is as large as for the simulation of a single comp ound

sup erstep

Theorem A v processor BSP algorithm A with communication time gb L

computation time L and local memory can be simulatedas a single processor

EMBSP algorithm A with computation time v O l v and IO time O l

v

v

G with probability at least exp l for suitable constant l

BD D

v

and arbitrary integer k v M k BD v D log D b B v

k

In particular if A is optimal and G BD o the simulation

results in a optimal EMBSP algorithm

Pro of Since we assume that the amount of communication each BSP pro cessor

p erforms per sup erstep is b ounded by its memory size we can conclude that the

communication time of each comp ound sup erstep is bounded by gb L

From Lemma a comp ound sup erstep of A with communication time L and

computation time g b L can b e simulated on an EMBSP machine with D disks

v

and lo cal memory k in computation time v O l v and IO time O l G

BD

v

with probability at least exp l

D

The computation time required to simulate the computation steps of A is v

The computational overhead is O l v which is asymptotically smaller than v

for and constant l

Lemma A allows us to exploit the indep endence of the random exp eriments

p erformed during each comp ound sup erstep in order to prove that the success prob

ability for the entire simulation is as large as for the simulation of a single comp ound

sup erstep Since the worst case runtime of a comp ound sup erstep is at most D times

v

the average runtime the theorem follows from Lemma A with m x D and

D

v D log D 

CHAPTER RANDOMIZED SIMULATION

Multiple Pro cessor Target Machine

In this section we generalize the simulation to p pro cessors on the target EM

machine

We rst describ e the simulation of a comp ound sup erstep of a v pro cessor BSP

with communication time g b L computation time L and context size on

a ppro cessor EMBSP for p We assume further that b B where b is the

message blo ck size of the BSP virtual machine

v

virtual pro cessors Outline of the parallel sim ulation As an initial step

p

v v

i i are assigned to each simulating pro cessor i As b efore the

p p

simulation of each comp ound sup erstep is comp osed of a series of rounds Dur

v

th

ing the j of rounds pro cessor i simulates the steps of the k virtual pro cessors

pk

v v

i jki j k Messages sent between virtual pro cessors on dierent

p p

real machines require real communication by the simulation If these messages are

sent directly to their destinations by the simulation the trac may be unbalanced

causing ineciencies in communication Therefore we distribute the messages ran

domly among the pro cessors after each round After the last round of a comp ound

sup erstep each pro cessor reorganizes the messages it has received so that during the

th

fetching phase of the j round of the next comp ound sup erstep it can read the mes

kp in parallel from disk sages destined for the virtual pro cessors jkp j

and direct them to the correct simulating pro cessor

v

th

We maintain batches to store the generated messages The j batch contains

pk

th

the messages destined for virtual pro cessors simulated in the j round of the current

v

comp ound sup erstep Each pro cessor writes its share of the batches to its lo cal

pk

disks The main diculty is the ecient maintenance of the batches so that the

th

packets which are needed during the j simulation round can be read in parallel

from the disks

Algorithm ParComp oundSS is executed by each real pro cessor i for i

p for each comp ound sup erstep

Algorithm ParComp oundSS

Ob jective Simulation of a comp ound sup erstep of a v pro cessor BSP on a ppro cessor

EMBSP

v

the message and context blo cks of the virtual Input For every batch j where j

pk

pro cessors jkp j kp are divided among the real pro cessors and their lo cal

k k

blo cks of message data and O blo cks of disks so that eachdiskcontains O

BD BD

context

Output The changed contexts and generated messages distributed as required for the next comp ound sup erstep

CHAPTER RANDOMIZED SIMULATION

v

For j to do

pk

a Fetching Phase Pro cessor i reads the contexts for group j from its lo cal disks

in parallel and then reads any message blo cks p ertaining to group j from its

lo cal disks in parallel and sends them to the appropriate simulating pro cessors

b Computing Phase Pro cessor i simulates the computation sup ersteps of its

current virtual pro cessors and collects all generated messages in its lo cal mem

ory

c Writing Phase Pro cessor i splits all generated messages into packets of size

b the message size and sends each packet to a randomly chosen pro cessor

Pro cessor i writes the contexts for group j back to its lo cal disks

For i to p do in parallel pro cessors i reorganize the received batches using

algorithm SimulateRouting so that each batch is evenly distributed over the lo cal

disks

End of Algorithm

Many of the details of algorithm ParComp oundSS are similar to ones previ

ously describ ed for Algorithm so we fo cus on the dierences In each round a

k

packets with high probability In Step c each pro cessor cuts pro cessor receives

b

the packets it receives into blo cks of size B The blo cks are written to the disks us

ing a random p ermutation of disk numb ers as b efore Dep ending on their destination

address the blo cks are maintained in D buckets which are divided among the disks

as in Step d of the singlepro cessor simulation Each bucket contains the blo cks

v

for D batches As b efore a table of D pointers is maintained for each disk As

pk

b efore we can show that each disk will receive the same number of blo cks of every

bucket with high probability

v

Lemma Step of algorithm ParCompoundSS can be done in O l com

p

v

l

v

pD

IO time with probability at least e putation time and O l G for

pD B

suitable constant l v pD logpD k BD and B b

Pro of We b egin by considering how the packets b elonging to a xed bucket are

distributed among the pro cessors by Step c Abucket contains the blo ckspackets

kp v

v

batches and each batch contains packets In total for each bucket of

pk D b BD

packets are distributed randomly among p pro cessors Since we have D buckets and

v

packets p pro cessors the probabilitythatany pro cessor receives more than R l

pD b

v v

for anybucket is exp l lnpD from Lemma A with x and y pD

pD b B

v

logpD meaning To bound this probabilitylet

pD b

v pD b logpD

CHAPTER RANDOMIZED SIMULATION

From the statement of the lemma we have

v pD log pD

We now show that implies Since D and B b means that

v pbB logpD

BD

Multiplying by BD and substituting we obtain as required With

k

v

constraint therefore wehave probability at least exp l that each

pD b

v

packets To bound this probability bucket of every pro cessor contains less than l

pD b

v

further to exp l we need a further constraint

pD

v v

pD b pD

k b

Since B b this constraint is absorb ed by k BD given in the lemma

We now turn to the issue of whether the packets for each bucket are balanced

across the lo cal disks of each real pro cessor We know that each pro cessor holds

v

R packets for eachbucket For a xed pro cessor wehave a similar situation to

pD b

the single simulating pro cessor case

R

blo cksofaxed By Lemma A weknow that a xed drivecontains less than l

D

R

bucket with probabilityat most exp l Let X denote the event that any disk

D

R

on any pro cessor contains more than l blo cks of any bucket There are p pro cessors

D

D drives and D buckets With k BD we have similar to the pro of of Lemma

PrX pD exp l RD

v

pD exp l

pbD

vB D

exp l log D log p

kpbD

vB

exp l logpD

pbD

Now provided that

vB

logpD

pbD

b

logpD v pD B

CHAPTER RANDOMIZED SIMULATION

we have

vB

PrX exp l

pbD

and for Bb a constant we have

v

Pr X exp l

pD

So in Step each pro cessor can reorganize its packets as required of the input in

v v

algorithm ParComp oundSS in O l computation time and O l G IO

p pD B

v

time with probability at least exp l provided that v pD logpD

pD

k BD B b l 

Lemma A compound superstep of a v processor BSP with computation time

L communication time g b L and local memory size can be simulated on a

v v v

pprocessor EMBSP in computation time O l Lcommunication

p p p

v

l

v v v v

pD

time O l g L and IOtime O l G L with probability at least e

p b p p BD p

vb v

for M k BD v pD logpD B b k BD k b log p k v

pD k

v

k and suitable constant l

p

v

batches contains the packets generated by pk simulated Pro of Each of the

pk

pk

pro cessors We ensure that each batch contains packets by creating dummy

b

packets if necessary We rst consider the runtime of Step for a xed round j

Step a Each pro cessor reads the blo cks b elonging to batch j from its lo cal

k

disks in IOtime O G The blo cks destined for a common pro cessor are combined

BD

into packets of size b and all packets are then sent to their destination in a single

k

sup erstep Each simulating pro cessor sends O message packets since each holds

b

k k

O blo cks of messages Each simulating pro cessor receives packets Thus an

B b

k k

O relation is routed consuming O g L communication time The pro cessors

b b

can reassemble messages from received packets in linear time since each receives k

data p er round and sucient real memory exists to hold all of the received messages

Each pro cessor then reads the contexts of its k currently simulated pro cessors in

k

IOtime O G from its lo cal disks

BD

Step b The steps of the currently simulated pro cessors are p erformed and

k

the contexts are written back to the disks Since O the IOtime is O Gl

BD

and the computation time is k O lk L

Step c During step b message packets of total size O k each one of

size b were generated and stored in the lo cal memories of the simulating pro cessors

CHAPTER RANDOMIZED SIMULATION

k

Now each pro cessor sends each of its O packets to a randomly chosen pro cessor

b

k

balls into p bins This corresp onds to randomly throwing p

b

k

With Lemma A for x p y pwe nd that each pro cessor receives more

b

k

ln p than l k b packets with probabilityat most exp l

b

Since these messages are exchanged in one sup erstep an O l k brelation is

k k

routed requiring O l g L communication time with probability exp l

b b

ln p Thus each pro cessor receives data of size O k which is written to its disks

k

time for suitable l e b B using randomly generated in IOtime l O G

BD

p ermutations of the disk numb ers

k

For a single round the IOtime is b ounded by l O G communication time

BD

k

is O l g L and computation time is k O kl L with probability

b

k

exp l ln p For

b

k b log p

k

the probability b ecomes exp l

b

v

we require that To bound the probabilityat exp l

pD

v k

b pD

vb

k

pD

v

rounds The rounds are indep endent We Now we consider the runtime for

pk

v

intro duce z indep endent random variables X X X where X represents

z i

pk

th

the cost communication computation and IOtime of the i round We can

v

rounds using the apply Lemma A to b ound the total communication cost of all

pk

vb

substitutions x p and m provided

pD

m log p

v pD log p

v v

v

communication time l O g L and In total wehave IOtime l O G

pB D pb pk

v v v v

O l L with probability at least exp l computation time

p p pk pD

Incorp orating the costs and constraints for Step from Lemma constraint

is absorb ed by v pD logpD 

Theorem states that a coptimal BSP algorithm is transformed into a c

optimal EM algorithm by the simulation for the general case of p simulating

pro cessors

N v b N

relation k andwe get the constraint N If each communication sup erstep was an

v v pD

CHAPTER RANDOMIZED SIMULATION

Theorem Let l be a constant Let v pD logpD M k BD

vb v v

B b k BD k b log p k v and k

pD k p

A v processor BSP algorithm A with communication time gb L computa

tion time L and local memory can be simulated as a pprocessor EMBSP

v v v

O l L communication time algorithm A with computation time

p p p

v v v v

O l g L and IO time O l G L with probability at least

p b p p BD p

v

exp l

pD

vb

If A is also a CGM algorithm then the constraints k BD k b log p k

pD

v B

can replaced by N vB D N vB log p and N respectively

pD

If A is optimal on the BSP for g g n b bn L Ln and v v n

then A is a optimal EMBSP algorithm for g g n b bn

pk

L Ln and G BD o

v

Pro of Since we assume that the amount of communication each BSP pro cessor

p erforms per sup erstep is b ounded by its memory size we can conclude that the

communication time of each comp ound sup erstep is bounded by gb L

From Lemma each comp ound sup erstep of A with computation time L

communication time g b L and lo cal memory size can be simulated on a p

v v v

pro cessor EMBSP in computation time O l L communication

p p p

v v v v

time O l g L and IOtime O l G L with probability at least

p b p p BD p

v

l

pD

for suitable constant l M k BD v pD logpD B b e

v v

k b v and k

k p

v

The computation time required to simulate the computation steps of A is The

p

computational overhead is times the overhead of a single sup erstep or O l v

which is asymptotically smaller than v for and constant l

Lemma showed that the communication between real pro cessors within each

v v

and so the communication for A is l communication sup erstep is l

p b p b

packets

The IO volume increases by a factor of over the time for a single comp ound

v

sup erstep to l IOs

p BD

Lemma A allows us to exploit the indep endence of the random exp eriments

p erformed during each comp ound sup erstep in order to prove that the success prob

ability for the entire simulation is as large as for the simulation of a single comp ound

sup erstep Since the worst case runtime of a comp ound sup erstep is at most D times

v

x D the average runtime the theorem follows from Lemma A with m

pD

and v pD logpD 

v

The slackness required by the simulation is controlled by the number of pro

p

cessors and disks we want to employ as well as the desired success probability The

v

simulation increases the number of sup ersteps by a factor of However we inherit pk

CHAPTER RANDOMIZED SIMULATION

the go o d communication time from the BSP algorithm which p ermits a optimal

multiple pro cessor EMBSP algorithm

Chapter

Deterministic Simulation

Intro duction

A randomized simulation of BSP algorithms on an EMBSP machine was describ ed

in Chapter This simulation to ok a BSP algorithm and pro duced an EMBSP

algorithm making use of the knowledge that messages of the BSP are no smaller than

a known value b In this chapter we describ e a deterministic simulation technique

that p ermits a CGM algorithm to be simulated on an EMCGM machine

We rst consider the simulation of a single comp ound sup erstep and in particular

how the contexts and messages of the virtual pro cessors can be stored on disk and

retrieved eciently in the next sup erstep We assume that the size of the contexts of

the pro cessors is known and is the same for each virtual pro cessor We can therefore

store the blo cks of the contexts on xed tracks of the disks The main issue is howto

organize the generated messages on the D disks so that they can be accessed using

blo cked and fully parallel IO op erations This task is simpler if the messages havea

N

xed length Although a CGM algorithm has the prop ertythat data is deemed

v

to be exchanged by each pro cessor in every sup erstep there is no guarantee on the

size of individual messages

Section describ es a technique which replaces each communication sup erstep of

aBSP algorithm not necessarily aCGM algorithm by two sup ersteps in which the

messages are of uniform size

In Sections and we describ e how the simulation can be accomplished on

target machines with one and multiple pro cessors resp ectively

CHAPTER DETERMINISTIC SIMULATION

Balancing Communication

Algorithm BalancedRouting gives us a technique for acquiring messages of nearly

uniform size We will showthatwe can derive upp er and lower b ounds on the message

size which allow us to save and retrieve messages on the disks in an ecient and

convenient manner

A total ofn data is divided evenly among v pro cessors Each item is lab elled with

the identier of a destination pro cessor in suchaway that no more than h items have

n

Algorithm BalancedRouting delivers these items the same destination where h

v

to their destination in two rounds of communication where no message diers in size

from any other by more than v items see Theorem b elow

Algorithm BalancedRouting from

n

Input Each of the v pro cessors has elements which are divided into v messages each

v

n

of arbitrary length Let msg denote the message to b e sent from pro cessor i to

ij

v

pro cessor j and let jmsg j b e the length of such a message

ij

Output The v messages in each pro cessor are delivered to their nal destinations in two

balanced rounds of communication and each pro cessor then contains at most h data

A For i to v in parallel

Pro cessor i allo cates v lo cal bins one for each pro cessor

For j to v

For to jmsg j

ij

th

Pro cessor i allo cates the word of msg to lo cal bin i j

ij

mo d v

Pro cessor i sends bin j to pro cessor j

Sup erstep Boundary

B For j to v in parallel

Pro cessor j reorganizes the messages it received in Step into bins according

to each elements nal destination

Pro cessor j routes the contents of bin k to pro cessor k for k v

End of Algorithm

Observation If bin is the smal lest bin created at a processor in step of

min

v v

Superstep A then the other v bins can contain at most v

more elements than does bin see Figure min

CHAPTER DETERMINISTIC SIMULATION

Theorem We are given v processors and n data items Each processor has

n

exactly data to be redistributed among the processors and no processor is to be

v

the recipient of more than h data The redistribution can be accomplished in two

communication rounds of balanced communication A Messages in the rst round

n n v v

are at least and at most in size and B Messages in the second

v v

h h v v

and at most in size round are at least

v v

Pro of The pro of of the maximum message sizes is given by Bader et al The

proofofthe minimum message sizes relies on Observation see Figure

n

In round A each pro cessor initially has data At the end of Sup erstep B each

v

pro cessor will have at most h data

First we consider the minimum message size in Sup erstep A Due to the round

robin allo cation mechanism a given bin after Step will contain at most one more

element of a message to pro cessor j than do es any other bin Let us x any pro cessor i

Consider the bin sizes after all of the messages have b een distributed among the bins

by pro cessor i see Figure Clearly all of the bins will contain at least as many

elements as the smallest bin bin Let e be the num ber of extra elements more

min j

than this minimum in bin j at Step The crucial observation is that if bin is the

min

v v

smallest bin then the other v bins can hold at most v

extra elements

Thus

X

n

v jbin j e

min j

v

j

P

v v

Since e we have

j

j

v v n

v jbin j

min

v

n v

jbin j

min

v

We now turn to the message sizes in Sup erstep B The elements which arrive at

th

pro cessor j as aresultofStep are the contents of the j lo cal bins formed in Step

th

ateach of the pro cessors through v We can think of the j lo cal bin of each

of the v pro cessors as a comp onent of a single global sup erbin which is the union

th

of the j lo cal bins of all v pro cessors Consider only the messages destined for a

prior to Step xed pro cessor k which are held by each pro cessor i i v

These are allo cated among the sup erbins starting with sup erbin i k mo d v by

Step Sup erbin j now contains the message whichistobesent from pro cessor j to

pro cessor k in Step

CHAPTER DETERMINISTIC SIMULATION

Bin Numbers

0134 v-3 v-2 v-1

Evenly-Placed Items

1

2

3 Number of Extra Items

v-3

v-2

v-1

Figure Il lustration of maximum imbalance in the bin

sizes of a processor during superstep A The light circles

represent evenly placed elements and the dark circles are

unevenly placed or extra ones The maximum bin size

v

bin v in the diagram is at most more than the

average and the minimum bin size bin in the diagram

v

less than the average is at most

In a similar manner to the analysis of sup erstep A let E be the numb er of extra

j

elements in sup erbin j after Step Let sbin be the sup erbin which contains

min

the minimum number of elements after Step and hence jsbin j represents the

min

minimum message size in Step When pro cessor k is one of the pro cessors which

P

receives the maximum h data elements we have h v jsbin j E and since

min j

j

P

v v

h v

E we have jsbin j 

j min

j

v

The notion of an hrelation is often used in the analysis of parallel algorithms

based on BSPlike mo dels eg BSP BSP CGM An hrelation is a communication

sup erstep in whicheach of the v pro cessors sends and receives at most h data items

CHAPTER DETERMINISTIC SIMULATION

It is typically used in b ounding the communication complexity in an asymptotic

analysis Based on this usage of an hrelation we have

Corollary An arbitrary hrelation can be replaced by two balanced hrelations

h v h v

whose message size is bounded by and

v v

n n

For h h we have a message size of Pro of In Theorem clearly h

v v

h v

at most for eachofthetwo rounds of communication We can repro duce the

v

precise conditions of Theorem by adding dummy items if necessary in Sup erstep

n

A to ensure that h

v

h v

We can assume a minimum message size of in the second round b ecause the

v

cost of communication is b ounded by the assumption of an hrelation When every

pro cessor is the destination of h data it do es not aect the worst case complexity

of the sup erstep We can therefore assume that every pro cessor receives h data by

adding dummy items for the sake of the argument Hence the minimum message size

h v

for any pro cessor in Sup erstep B b ecomes without aecting the asymptotic

v

communication cost of the sup erstep 

Lemma An arbitrary minimum message size b can be assured provided that

min

v v

N v b

min

where N is the total number of problem items summed over the v processors

Pro of From Corollary we can achieve a minimum message size b provided

min

v N

 that b

min

v

Assurances regarding the minimum message size are particularly relevant to the

BSP mo del In Section we discuss the use of BalancedRouting and Theorem

for creating BSP algorithms from BSP algorithms

In Section and we describ e the simulation of CGM algorithms with the

additional prop erties that BalancedRouting provides Not every CGM algorithm will

require balancing but Lemma ensures that we can obtain balanced message sizes

when necessary by increasing the number of sup ersteps by a factor of

Lemma Let A be a CGM algorithm with N data v processors and commu

N

nic ation steps where h in every step The communication steps of A can be

v

replaced by steps of balanced communication in which the minimum message size

v v

N

is B and the maximum message size is provided that N v B

v

Pro of The minimum and maximum message sizes follow from Corollary with

v v

N v N N

h This is true if N and the constraint that which

v v v

is absorb ed by from Lemma 

N

will be a signicant overestimate of the In many practical EM situations

v

N

maximum message size as v

v

CHAPTER DETERMINISTIC SIMULATION

Single Pro cessor Target Machine

N

We now turn to the actual simulation results which rely on a message size of c

v

v

for a known constant c As we have seen this is guaranteed where v

k

by algorithm BalancedRouting see Lemma for c Not every algorithm will

require balancing see CGMTranspose in Section for an example of a CGM algorithm

whose messages are already balanced

For the case of a single EM pro cessor wesimulate a comp ound sup erstep of a BSP

like eg CGM algorithm A using the algorithm SeqComp oundSup erstep shown

below No real communication is required Lemma gives the resulting complexities

for a single comp ound sup erstep and Theorem summarizes the overall simulation

result for a single real pro cessor For ease of exp osition we assume that k divides v

Lemma A compound superstep of a v pro cessor CGM algorithm A with compu

N N

for a known L message size c tation time L communication time g O

v v

constant c and local memory size can be simulated as a single processor EM

v

N

CGM algorithm A in computation time v O v and IO time O G

BD BD

provided that M k BD N vB D and N v B for an arbitrary

v

integer k v and v

k

N

in size we can allo cate xed sized slots for Pro of Since messages are at most c

v

them on the disks while preserving an O v disk space requirement A requires that

N

B so N v B The assurance of minimum message size B implies

v

that IO op erations will be blo cked Although a CGM algorithm may o ccasionally

N

use smaller messages than B it is charged for an relation in each sup erstep as if

v

every pro cessor sent and received h data

We simulate a comp ound sup erstep of A using algorithm SeqComp oundSup erstep

shown below The algorithm exp ects the input messages to the virtual pro cessors in

the current sup erstep to be organized by destination in a parallel format on the

disks and it also writes the messages generated in the current sup erstep to the disks

in a parallel format We use the phrase a paral lel format to mean an arrangementof

the data that p ermits fully parallel access to the disks b oth for writing the messages

and for reading them back in a dierent order in the next sup erstep Two instances

of a parallel format are the consecutive and staggered formats A more intuitive

denition app ears in Section

Consecutive format We say that a disk readwrite op eration on D blo cks is consec

th

utive when the q blo ck q D is readwritten fromto disk d q mo d D on

track T bd q D c where T is the track used for the rst of the D blo cks to b e

readwritten and d is the disk oset from disk for the rst of the D blo cks to b e readwritten

CHAPTER DETERMINISTIC SIMULATION

Staggered format We say that a disk readwrite op eration on D blo cks involving n

th

messages each of size at most b blo cks is staggered when the q blo ck q b

th

of the j message j n is readwritten fromto disk d S q modD on

th

track T bd S q D c where T is the track used for the blo cks of the message

d is the disk oset from disk for the rst of the D blo cks to b e readwritten and

dSD e is the number of tracks by which consecutive messages are to be staggered

separated

Algorithm SeqComp oundSup erstep simulates a comp ound sup erstep of a v

pro cessor CGM on a single pro cessor EMCGM with D disks It simulates k pro

cessors at a time where k v provided that the simulating pro cessor has

M k memory

Algorithm SeqComp oundSup erstep

Input For each j f v g the blo cks of the context are stored on the disks in

consecutive format and the arriving messages of virtual pro cessor j are spread over

the D disks consecutive format

Output i The changed contexts of the v simulated pro cessors are spread across the

disks in consecutive format ii The generated messages for each pro cessor in the

next sup erstep are stored in consecutive format on the disks

v

For j to

k

a Read the context of virtual pro cessors jk to j k from the disks into

memory

b Read the packets received by virtual pro cessors jk to j k from the disks

c Simulate the lo cal computation of virtual pro cessors jk to j k

d Write the packets whichwere sentby virtual pro cessors jk to j k to the

D disks in the staggered format illustrated in Figure See b elow for details

e Write the changed context of virtual pro cessors jk to j k backto the

D disks in consecutive format

End of Algorithm

v

Details of Steps a and e We reserve an area of total size v on the disks

BD

blo cks on each disk where we store the contexts We split the context V of virtual

j

mod D pro cessor j into blo cks of size B and store the ith blo ckofV on disk i j

j

B

ij

B

Since the context of each pro cessor is now in strip ed format on using track

D

the disks we can read and write the contexts of k consecutive virtual pro cessors using

D disks in parallel for every IO op eration

CHAPTER DETERMINISTIC SIMULATION

Details of Step b The previous comp ound sup erstep guaranteed that the blo cks

whichcontain the messages destined for the current pro cessor are stored in consecutive

format on the disks Therefore we can use a similar technique to fetch the messages

as we used to fetch the contexts

Details of Step d After the Computation Phase all messages sent by the cur

rent k virtual pro cessors have b een generated and stored in internal memory The

coarsegrained nature of the underlying BSP algorithm results in large messages see

Lemma which are as long or longer than the blo ck size B We cut the messages

into blo cks of size B Each blo ck inherits the destination address from its original

k

message In rounds we write the blo cks out to the disks as describ ed in detail

BD

below Recall that is the maximum size of the data sent or received byany virtual

pro cessor over all sup ersteps

Let b represent the maximum message size and let b represent the maximum

b

numb er of disk blo cks p er message Hence b d e Let msg represent the message

ij

B

sent from pro cessor v to pro cessor v in one communication sup erstep We will store

i j

the messages destined for a xed pro cessor j in consecutive format b eginning with

msg and ending with msg We ensure that the rst blo ckofmsg is assigned

j v j ij

to disk b b modD for j v where b is the disk numb er of the rst blo ck

for msg In other words the starting blo ck p ositions for messages to consecutive

ij

pro cessors are appropriately staggered on the disks to ensure that we can write blo cks

of messages to consecutively numb ered pro cessors in a single parallel IO when b mo d

vb

D Let T j d e b e the track oset for msg the rst such message destined

j j

D

for pro cessor v Let d jb mo d D be the disk oset from disk for the rst

j j

th

blo ck of msg The q blo ck of msg is assigned to disk d ib q mo d D on

j ij j

track T bd ib q D c This storage scheme maintains what we will call the

j j

destined to a particular messaging matrix across the parallel disks The messages

virtual pro cessor are stored in a band or strip e of consecutive parallel tracks See

Figure

Outgoing message blo cks are placed in a FIFO queue for servicing by pro cedure

DiskWrite DiskWrite removes at most D blo cks from the queue in each write cycle

and writes them to the disks in a single write op eration Blo cks are serviced strictly

in FIFO order Blo cks will b e added to the current write cycle and removed from the

queue until a blo ck is encountered whose disk numb er conicts with that of an earlier

blo ckinthe current write cycle

Since the messages destined for any given pro cessor are stored in consecutiv e

format on the disks we can read the messages received by a virtual pro cessor using

D disks in parallel for every IO op eration Except p ossibly for the last each parallel

read p erformed by the simulation of pro cessor v will obtain D message blo cks By

j

staggering the rst message blo cks for consecutive virtual pro cessors across the disks

we can achieve fully parallel writes to the disks

CHAPTER DETERMINISTIC SIMULATION

To Proc. j+2

From i

To Proc. j+1

From i

To Proc. j

From i

Disk 0 Disk 1Disk 2 Disk 3 Disk 4

Figure Il lustration of the Messaging Matrix In the

example we have D and message size b blo cks

Messages from pro cessor i to pro cessors j j and j

are shown as shaded rectangles Messages to consecutively

numb ered pro cessors are staggered on the disks to p ermit

D blo cks to b e written in parallel

The scheme just describ ed requires two copies of the messaging matrix b ecause the

messages generated by virtual pro cessor i in comp ound sup erstep must be stored

on disk b efore virtual pro cessor i can pro cess the messages generated for it in

comp ound sup erstep

We can avoid this extra space requirement however as follows

Observation By alternating between fconsecutive reads staggered writesg and

fstaggered reads consecutive writesg in successive compound supersteps the simu

lation can achieve ful ly paral lel IO on al l message blocks with a single copy of the

messaging matrix

SeqComp oundSup erstep loads k virtual pro cessors into the real memory at once

the messages sent or received in a sup erstep by a requiring that M k Since

kN N

in total size we require that D to ensure virtual pro cessor are h O

v vB

that our IO scheme for messages is ecient This means that N vB D where

v

v k

CHAPTER DETERMINISTIC SIMULATION

Computation Time Steps a and e of algorithm SeqComp oundSup erstep take

N

O v computation time overall In Steps b and d O message data is routed

v

for each virtual pro cessor Over all v pro cessors this adds O N computation time

overall which can b e ignored Step c consumes v computation time

v

IO Time Steps a and e consume O G time and Steps b and d consume

BD

N

O G time overall Since O Np message data is sent in each sup erstep and

BD

v

due to IO overall Np we have time O G

BD

v

Thus overall the computation time is v O v and the IOtime is O G 

BD

Theorem A v processor CGM algorithm A with supersteps local memory size

N N

running time g O can be simulated as a L and message size

v v

v

single processor EMCGM algorithm A with time v O v G O for

BD

M k BD N v B and v D for an arbitrary integer k v and

v

v

k

BD

In particular algorithm A is coptimal if A is coptimal and G o

Furthermore algorithm A is workoptimal and IOecient if A is workoptimal and

BD

communicationecient and G O

Pro of We use the results of Lemma The computation time required to

simulate the computation steps of A is v The computational overhead asso ciated

with the IO steps Steps abde is O v O N Since v N the

total computation time is bounded by v O v When coptimalityis required

N

we we therefore need v ov or Note that when

v

N

can substitute for For workoptimality we require that

v

v O v or

v

N

The number of IO op erations Steps abde is O O which is

BD D

v

b ounded by O For coptimalitywe require the IO time to b e in ov which

BD

BD

means that G o For IOeciency we require the IO time to b e in O v

BD

which means that G O

The messages sent by a group of k virtual pro cessors to any other group must be

large enough in total that they ll a disk blo ck so N v B Also the messages

sent or received by a group in a single sup erstep must ll a track of the D parallel

v

disks meaning that N vB D where v These constraints can be combined

k

into N v B for v D 

CHAPTER DETERMINISTIC SIMULATION

Multiple Pro cessor Target Machine

For the case of p pro cessors on the EMCGM machine we simulate a comp ound

sup erstep of a CGM algorithm A using the algorithm ParCompoundSuperstep

shown b elow Unlike in the case of a single real pro cessor we are now forced to

p erform real communication b etween the real pro cessors of the target machine

Each real pro cessor i i p executes algorithm ParComp oundSup erstep

in parallel For ease of exp osition we assume that pk divides v

Algorithm ParComp oundSup erstep

Ob jective Simulation of a comp ound sup erstep of a v pro cessor CGM on a ppro cessor

EMCGM

Input The message and context blo cks of the virtual pro cessors are divided among the

real pro cessors and their lo cal disks Each real pro cessor i i p holds

v

N

O blo cks of messages and blo cks of context and each lo cal disk contains

pB pB

v

N

blo cks of messages and O blo cks of context O

pB D pB D

Output The changed contexts and generated messages distributed as required for the

next comp ound sup erstep

v

do For j to

pk

a Read the contexts of virtual pro cessors ij k to ij k from the lo cal disks

b Read any message blo cks addressed to virtual pro cessors ij k to ij k from the

lo cal disks

c Simulate the computation sup ersteps of virtual pro cessors ij k to ij k collecting

all generated messages in the lo cal internal memory

d Send all generated messages to the required real destination pro cessors Up on ar

rival the messages are arranged within the internal memory of the real destination

pro cessor and then written to its disks as in the single pro cessor simulation see Al

gorithm

e Write the contexts for virtual pro cessors ij k to ij k back to the lo cal disks

see Algorithm

End of Algorithm

Lemma A compound superstep of a vprocessor CGM algorithm A with compu

N N

tation time L communication time O g and local L message size

v v

v

memory size can be simulated as compound supersteps of a pprocessor EM

p

v v v

O L communication CGM algorithm A in paral lel computation time

p p p

CHAPTER DETERMINISTIC SIMULATION

N v v v

time g O L and IO time G O L for pk v arbitrary integer

p p p BD p

v v

N v B v D and v provided M k BD memory is k

p k

available on each real processor

v

Pro of There are virtual pro cessors resident on each real pro cessor Each real

p

v

pro cessor simulates vp virtual pro cessors of a round of A in a total of vpk

p

sup ersteps as k virtual pro cessors are simulated on each real pro cessor in each round

of A

Communication time In Step d in each round of A a real pro cessor receives

p real messages one from each real pro cessor

v

pk virtual messages one from eachofpk virtual pro cessors simulated in this

p

v

round to each of its own resident virtual pro cessors

p

v N N kN

pk data items since each virtual message is items in size

p v v v

v kN

data in each round of A We have such rounds so we have It also sends

v pk

N N v

k data to communicate overall Therefore the overall communication

pk v p

N v

time of A is g O L

p pk

IO time The IO time is determined by the cost of swapping contexts plus the

cost of simulating the original messaging via IO Each group of k pro cessors has

k

context of size O k which requires O IO op erations to swap in or out of

BD

k

v v

sup ersteps the swapping of contexts therefore costs G O memory Over

pk pk BD

kN

in The IO costs due to messaging b etween virtual pro cessors are b ounded by O

vB D

v

each sup erstep and so the total costs due to virtual messaging over sup ersteps is

pk

v kN N

G O Since the context size the total IO cost overall is dominated

pk vB D v

v

by G

p BD

Computation time We have p v real pro cessors so the time to simulate Step

v v kN

c is Computational overhead is contributed by O k due to swap

p pk v

ping of contexts Steps ae and messaging IO Steps bd As b efore the

computational overhead is dominated by the cost of swapping contexts 

Theorem A v processor CGM algorithm A with supersteps computation time

N

L communication time g L local memory size and message size b

v

N

can be simulatedasa pprocessor EMCGM algorithm A with computation time

v

v v v N v v

O Lcommunication time gO L and IO time G O

p p p p p p BD

CHAPTER DETERMINISTIC SIMULATION

v

L for M k BD p v N v B and v D for arbitrary integer

p

v v

and v k

p k

Let g N LN and v N be increasing functions of N If A is coptimal on

the CGM for g g N L LN and v v N then A is a coptimal EM

p BD

and L LN A is CGM algorithm for g g N G o

v

workoptimal communicationecient and IOecient if A is workoptimal and

BD p

communicationecient g g N G O and L LN

v

Pro of We use the results of Lemma The computation time required to simulate

v

the computation steps of A is k The computational overhead asso ciated with the

pk

v v N

IO and communication steps Steps abde is O from Lemma

p pk v

v v N

the total computation time is b ounded by O When Since

v p pk

N

coptimality is required we need Note that in many cases

v

Also when only workoptimality is required suces

From Lemma the communication time of the simulation p er sup erstep of A is

N v N v

O g L giving g L time overall

p pk p pk

v N

The IO time Steps abde is G O which is b ounded by

p BD vB D

v v

G O For coptimality we require the IO time to be in o which means

p BD p

BD BD

that G o For IOeciency we need only that G O Since the

pk

v

number of sup ersteps increases by a factor of we require that L LN

pk v



Theorem shows that Theorem holds even if A has varying message sizes

v

and N v B provided that in addition to the constraints of Theorem B

v D

Theorem A v processor CGM algorithm A with supersteps computation time

N

L communication time g L and local memory size can be simulated

v

v v v

O L as a pprocessor EMCGM algorithm A with computation time

p p p

N v v v

communication time gO L and IO time G O L for M k

p p p BD p

v v

BD p v N v B v D and B for arbitrary integer k

p

v

and v

k

Let g N LN and v N be increasing functions of N If A is coptimal on

the CGM for g g N L LN and v v N then A is a coptimal EM

BD p

CGM algorithm for g g N G o and L LN A is

v

workoptimal communicationecient and IOecient if A is workoptimal and

BD p

communicationecient g g N G O and L LN

v

Pro of We rst apply algorithm BalancedRouting to each communication sup erstep

v v

N

provided that N v b If b B of A which ensures message size b

v

CHAPTER DETERMINISTIC SIMULATION

v

and B this condition is preserved by N v B v D We can then apply

Theorem 

Summary

The result of Corollary can be applied to any algorithm which communicates

exclusively via hrelations The concept of an hrelation is relevant primarily with

resp ect to whether it is an assumption in the analysis of the algorithm in question

Typically a good algorithm has been shown to be asymptotically optimal when the

communication volume to and from each pro cessor is b ounded by h in each sup erstep

we can additionally ensure any desired minimum communication Using Lemma

blo ck size of b at the cost of at most doubling the number of communication rounds

for problems with sucient slackness Here we use the term communication to mean

either IO or conventional message passing

The main results of this chapter are as follows

Using Theorem algorithms whichhave b oth BSP and CGM prop erties can

be converted to EMBSPEMCGM algorithms

Using Corollary CGM algorithms can be converted to BSP algorithms

N

with b

v

Using Corollary conforming BSP algorithms can be converted to BSP

h

v

min

where h is the minimum value of h used in algorithms with b

min

v

any communication sup erstep

Using Corollary and Theorem CGM algorithms can be converted to

N

EMBSPEMCGM algorithms with b

v

The term conforming in item ab ove refers to the need for the b ounding con

cept of an hrelation to b e a universal assumption in the analysis of the original BSP

algorithm for each of its communication rounds It is convenient but not necessary

that the same value of h b e used in every round

Chapter

Deterministic Simulation with

Lower Slackness

Motivation

Algorithm BalancedRouting describ ed in Section is a metho d for delivering the

messages generated by the v pro cessors of a parallel algorithm It do es this in twice

the original number of communication rounds while maintaining balanced message

N

sizes ie b where N is the problem size

v

In the external memory context this technique p ermits the messages to b e written

to and read from the lo cal disks of each real pro cessor in parallel with minimal over

head and also allows us to keep the communication balanced b etween real pro cessors

It has the disadvantage that it requires large slackness ie N v b When we sub

N b N

so BalancedRouting into N v b we obtain the constraint N stitute v

M M

requires that

M Nb

This is practical at mo dest cost on contemp orary computers even for very large

problem sizes at least into the teraitem range but a smaller value of M would

be desirable from a theoretical point of view In this chapter we will present a

more exible message balancing and distribution algorithm which we refer to as

LowSlackRouting It works for smaller slackness than BalancedRouting but inv olves

an increase in the number of rounds For constant LowSlackRouting

requires

N v b

CHAPTER DETERMINISTIC SIMULATION WITH LOWER SLACKNESS

ε v groups ε Group 1: Group v : ε ε |S| / v |S| / v processors processors

ε v buckets

123. . . |S|

Processor

Figure The division of the jS j processors into v groups

S is a set of pro cessors which are participating in the current

routing subproblem Initially jS j is the set of all v virtual

pro cessors The shaded areas illustrate the comp onents of

axed bucket on the various pro cessors

N

Substituting v into we obtain the constraint

M

M N b

When is the same as but it can be made more attractive by

cho osing a smaller value for For example with b and a huge p erhaps

even imp ossible problem size of N BalancedRouting requires internal memory

size M which is currently dicult to achieve However with

LowSlackRouting requires only M for the same conditions and

gives M

Overview

We rst present a parallel algorithm LowSlackRouting which implements an h

N

relation for h while ensuring that the minimum message size is b items We

v

then adapt the algorithm to an external memory context with p real pro cessors

kN

p v where each real pro cessor has D lo cal disks and internal memory O v

CHAPTER DETERMINISTIC SIMULATION WITH LOWER SLACKNESS

Weshow that the algorithms are valid provided that N v bwhere is a constant

r

For ease of exp osition we will assume that the quantities v and v are

Such assumptions do not materially aect the positive integers for r

analysis

Overview of the Parallel Algorithm LowSlackRouting Algorithm LowSlack

Routing presented in Section delivers messages generated in a preceding com

N

putational sup erstep to their destinations Provided that v b it maintains a

v

sp ecied minimum message size b even though the amount of data held by each pro

cessor is to o small for it to send such a message to eachofthev pro cessors in a given

r th

round Instead LowSlackRouting constructs v groups of v pro cessors in the r

round for rd e and messages for pro cessors within a group are combined

LowSlackRouting delivers messages to their eventual destination by routing them in

directly ie through a series of intermediate destinations in a corresp onding series of

communication rounds The messages originating from a given pro cessor are divided

into v lo cal buckets dep ending on their intended destination The lo cal buckets of

r th

the v pro cessors of a group in the r round are treated collectively as v global

buckets Each global bucket corresp onds to one of v groups of intermediate destina

tions See Figure By computing a total order on the blo cks in each global bucket

we can distribute the blo cks evenly among the virtual pro cessors of that group We

r

will refer to v as the fanout of each round of routing as a group of v pro cessors in

r

round r divides its data into v groups of v pro cessors in round r Message

data intended for a particular pro cessor of a group is dierentiated and delivered to

its correct destination in a later round For the last round BalancedRouting is used

as by then its slackness constraint is satised

Lemma deals with the dicult case where a small number of large messages

round of a client algo and a large number of small messages are generated in a

rithm Lemma allows us to treat the small messages as if they were of size b

without asymptotically aecting the CPU communication or IO costs of algorithm

LowSlackRouting

N

Since LowSlackRouting p erforms an relation in each sup erstep with a mini

v

mum communication blo cksize of b it can be considered to be b oth a BSP and a

CGM algorithm For our purp oses however the distinction is not material We are

interested primarily in its use as an external memory algorithm

Overview of External Memory Algorithm EMLowSlackRouting We now

adapt algorithm LowSlackRouting for use in an external memory context ie on an

EMBSP machine EMLowSlackRouting is essentially a simulation of EMLowSlack

Routing F ew changes are needed as LowSlackRouting already has coarse well be

haved communication b etween its subproblems virtual pro cessors

CHAPTER DETERMINISTIC SIMULATION WITH LOWER SLACKNESS

On a single pro cessor single disk machine a simple approach exists We know

N

that each group of x pro cessors in each round will eventually receive x data and

v

N

space on the disk for each group Messages so prior to each round we allo cate x

v

are then concatenated to the end of the list of message data for their resp ective

destination group As each virtual pro cessor is simulated it can read as many blo cks

N

of message data for its group as is necessary to ensure that it pro cesses data

v

For an EM machine with p pro cessors however it is more dicult to ensure

that communication is balanced across physical pro cessors and message data is written

to the disks D blo cks at a time where a disk blo ck is B items in size We will show

that allo cating the v virtual pro cessors of LowSlackRouting to the p real pro cessors of

EMLowSlackRouting in roundrobin order we can achieve balanced communication

between real pro cessors on the target machine provided that p O v where p is

the number of real pro cessors By requiring that the communication blo ck size of

LowSlackRouting b e suciently large ie b BD we can fully utilize D parallel

disks on each real pro cessor

In Section we present the parallel algorithm LowSlackRouting In Section

we present the external memory routing algorithm EMLowSlackRoutingwhichis

derived from LowSlackRouting

The Parallel Algorithm LowSlackRouting

e We are given N problem items and v pro cessors The routing is p erformed in d

rounds of LowSlackRouting The constant determines the fanout v

of each routing stage and b is the number of problem items in a message blo ck S

indicates the set of pro cessors participating in the routing op eration S is a two

elementvector consisting of the rst and last lab els of the pro cessors participating in

the routing op eration Initially S v We use jS j to represent the number

of pro cessors in the range indicated by S Initially jS j v and round r

Algorithm LowSlackRouting S

N

bv elements partitioned into jS j messages Input Each of the jS j pro cessors has

v

N

Each message may b e of arbitrary length

v

Output The O jS j messages in each pro cessor are delivered to their nal destinations

N

Each pro cessor receives elements

v

If jS j v the pro cessors in S route their messages to their destination bymeansof algorithm BalancedRouting

CHAPTER DETERMINISTIC SIMULATION WITH LOWER SLACKNESS

The pro cessors in S organize themselves into v groups each of size jS jv See Figure

th

Each pro cessor divides its messages into v buckets The j bucket contains

those messages destined to the pro cessors in group j Each message is sub divided

into communication blo cks of size b

Each pro cessor in S determines a rank for each of its blo cks as follows

a Assign unit weighttoeach lo callyheld blo ck

b Create a vector T of the lo cal partial sums of the weights in eac h bucket as

follows

T the number of blocks in each bucket j with size for b and

j v

c Call Numb erBlo cks S T below This determines the rank of the rst full

blo ck and any partial blo ckineachbucket

d Lab el all of the lo callyheld blo cks with their rank The full blo cks in each

bucket receive consecutiveranks

r th

The pro cessors addresses eachblockoflocalbucket j to the i mo d v pro cessor

of group j where i is the rank of the blo ck Blo cks to a common pro cessor are then

concatenated to form a single message and sent Figure illustrates the result of

this redistribution pro cess

Sup erstep b oundary r r

r

Each pro cessor receives at most one message from each of the v pro cessors

whichwere part of its group in the previous round Step ab ove

Let S be the range of pro cessors in group j j v Via v parallel recursive

j

calls each pro cessor of S routes the blo cks it received in Step to their destination

j

pro cessor by executing LowSlackRouting S for all j v

j

End of Algorithm

The Parallel Algorithm NumberBlocks

Algorithm NumberBlocks assigns a rank to every blo ck by applying a parallel prex

computation over unit weights Blo cks are numb ered indep endently within each of

the v global buckets The ordering is such that blo cks of size are given ranks lower

than any blo ck of size for b This fact is used by Lemma

to show that each pro cessor receives nearly the same amount of data as any other

Figures and illustrate the op eration of algorithm NumberBlocks

Algorithm Numb erBlo cks S T

p ossibly via a fork for each

CHAPTER DETERMINISTIC SIMULATION WITH LOWER SLACKNESS

Processor 123... |S|

Bucket j

(a) Before redistribution

123... |S|

group j

(b) After redistribution

Figure The redistribution of blocks for a given bucket

Step of LowSlackRouting redistributes the blo cks in a

bucket j evenly among the pro cessors of group j

Input Each of the jS j pro cessors has v lo cal buckets whichcontain blo cks of at most b

items T is a b v array containing partial sums for the weights of each blo ck size

within eachbucket

Output The blo cks in each global bucket are each given a rank which is unique within

that bucket

The pro cessors in S organize themselves into jS jv ordered groups of size v Let

i

c be the group for the current i be the index of the current pro cessor and let b

v

pro cessor

i

th

Pro cessor i sends the b sums for its j lo cal bucket to pro cessor b cv j

v

Sup erstep b oundary

Each pro cessor receives b partial sums from each of v children Each pro cessor is

now asso ciated with the computation of ranks for a single bucket ie pro cessor i is

asso ciated with the computation of ranks for bucket si i mo d v Pro cessors

v v v v nowcontain partial sums for bucket pro cessors v

v v nowcontain partial sums for bucket etc v

CHAPTER DETERMINISTIC SIMULATION WITH LOWER SLACKNESS

Each pro cessor computes b lo cal partial sums from the data it received

partial sums travel from leaves to ro ot

for r to do

i

r

c v si then pro cessor i sends its lo cal partial sums to If i b

r 

v

i

r

parent pro cessor b c v si

r

v

Sup erstep b oundary

Each pro cessor computes b lo cal partial sums from the data it received

endfor

partial sums travel from ro ot to leaves

for r downto do

Each pro cessor whichwas a destination in Step computes the rank of eachof

b comp onents received from eachofitsv children in Step

Each pro cessor sends b computed ranks to eachofitschildren

Sup erstep b oundary

endfor

Pro cessor i sends its b partial sums to each of v children i si i si

i si v

End of Algorithm

Description of Algorithm Numb erBlo cks We assume a lo cal copy of the lo op variable

r not the same one that is used in LowSlackRouting The blo cks of eac h bucket

are lab elled with sequential numb ers using algorithm NumberBlocks NumberBlocks

orders the blo cks according to size and requires two passes through a tree of pro cessors

each pass requiring d e sup ersteps for a total of d e sup ersteps where isa

constant The numb ering is computed via parallel prex sum computations using a

v waytreeover the pro cessors for eachbucket Blo cks of size elements are assigned

sequence numb ers b efore blo cks of size for all b

In the rst sup erstep each pro cessor assigns a weight of to each lo cal blo ck

and computes a partial sum for the weights of the lo cally held blo cks of each size

et In Step each pro cessor i rst sends its b partial sums b in each buck

i

th

for the j bucket for j v to pro cessor b c j For each bucket therefore

v

v

v destination pro cessors are selected or v destinations over all buckets In

v

Steps each such group of v pro cessors op erates indep endently on a dierent

bucket In round the pro cessors v at the ro ot of the resp ective v trees

CHAPTER DETERMINISTIC SIMULATION WITH LOWER SLACKNESS

0

0

0

ε

0 12 v v

Figure The tree of processors which numbers the blocks

in bucket

will compute the sum of the weights of the blo cks of size for b in buckets

v resp ectively

Steps p erform a second pass through the trees of pro cessors This consists

of additional rounds during which the results are propagated back down the trees

of pro cessors In each round r the pro cessors in level r of the tree return the sequence

numb er of the rst blo ck contributing to each of the partial sums received from their

children in the rst pass

Figures and illustrate the tree of pro cessors used for calculating the se

quence numbers for blo cks in buckets and resp ectively

Analysis of LowSlackRouting

Lemma The number of blocks both ful l and partial ly l led in Step of

N

LowSlackRouting is at most per processor

vb

Pro of Let us assume that partial blo cks are padded with dummy items if necessary

to ensure that the size of every lo cal bucket on every pro cessor is a multiple of b items

N

blo cks Since there are v buckets p er pro cessor each pro cessor adds less than v

vb

Therefore the total amount of data b oth real and dummy at each pro cessor is

N

at most blo cks Note that dummy elements are considered only for purp oses of

vb

the pro of and are not actually sent 

CHAPTER DETERMINISTIC SIMULATION WITH LOWER SLACKNESS

1

1

1

ε

0 12 v v

Figure The tree for the numbering of blocks in bucket

N

Lemma Each virtual processor receives at most blocks some perhaps par

vb

tial ly ful l in Step of LowSlackRouting

th

e of LowSlackRouting there are less than Pro of In the r round r d

r

v extra blo cks generated per bucket as a result of padding partial blo cks There

N

th

are a minimum of blo cks in a bucket in the r round so there are at most

r

v b

N

r r

v blo cks p er bucket These are spread evenly over v pro cessors of

r

v b

the group so each pro cessor receives at most

N

r

v

N

r

v b

v blo cks

r

v vb

N N

Since v b each pro cessor receives at most blo cks 

v vb

Lemma Although some of the blocks it receives may be only partly ful l each

virtual processor receives at most b fewer items than any other processor in any

round r of LowSlackRouting r

Pro of The sequence numb ering of the blo cks in a bucket by NumberBlocks orders

all of the blo cks of size b efore any blo ck of size for b The

blo cks are then allo cated in roundrobin fashion over the ordered set of pro cessors in

CHAPTER DETERMINISTIC SIMULATION WITH LOWER SLACKNESS

N

r th

the group The allo cation of data items among v pro cessors in the r round via

r

v

r

Nv

r

ev matrix as in Figure The partially lled blo cks can b e mo delled byad

r

v

N

blo ck sizes form a sequence ii ii

r

v

The blo cks are allo cated row by row from left to right to the pro cessors columns

The maximum imbalance in the number of items per pro cessor is the maximum

imbalance in column sums of the matrix

First we consider the case where every pro cessor receives the same number of

blo cks In this case the maximum imbalance in column sums o ccurs b etween columns

r

and i v so

i

i i

N i N

r r

v v

Weknow that the maximum cell value is b and so reaches a maximum value when

N

r r

b and v v for all r This gives

r

v

N

b

r

v

In the case where a pro cessor is allo cated more blo cks than another the dierence

in number of blo cks can be at most Let q be the last pro cessor to receive an

extra blo ck Then q may have received at most b items more than pro cessors

r

q v and by the previous argument at most b items more than pro cessors

q 

Lemma gives an upp er b ound on the number of rounds required by Num

berBlocks for each round of LowSlackR outing We generalize the lemma slightly to

account for q v participating pro cessors so we can use it in analyzing EMNum

berBlocks in Lemma Note that we do not actually use NumberBlocks in the nal

routing round as BalancedRouting do es not require it

r

Lemma Let be a constant and let q v Given qv processors

th

participating in the ranking of blocks in v buckets in Step c of the r round of

LowSlac e the rank of each block in each bucket can be kRouting r d

r e communication rounds determined by algorithm Numb erBlo cks in d

Pro of For each bucket NumberBlocks forms a v ary tree of pro cessors over the

r r

qv pro cessors The heightofsuch a tree is dlog qv e d r e since q v

v

The trees for each of the v buckets are traversed simultaneously The parallel prex

ranking op eration involves traversing the tree of pro cessors in parallel from leaves to

r e communication ro ot and then from the ro ot back to the leaves giving d

rounds 

CHAPTER DETERMINISTIC SIMULATION WITH LOWER SLACKNESS

Relative Pro cessor Number

r

v

Figure Imbalance in Item Assignment to Pro cessors

The columns represent pro cessors and the cell values represent the size of

the blo ck in that p osition The illustration shows a p ossible scenario for

b

e d e Lemma The number of rounds requiredby LowSlackRouting is d

for constant

Pro of The ranking of eachblockinStep ofLowSlackRouting in round r r

d e can b e done in d r e rounds from Lemma The data sentby pro cessor

t

i to pro cessor j is sent to a group of v pro cessors containing pro cessor j in the h

th

round by Step of LowSlackRouting In the r round the data sentby pro cessor i to

r

pro cessor j is sent to a group of v pro cessors containing pro cessor j Therefore

e rounds the data sent by pro cessor i to pro cessor j is sent to a group after d

containing only pro cessor j The routing step is p erformed by LowSlackRouting

which takes rounds

Since there are d e rounds of LowSlackRouting which require the use of Num

berBlocksthe number of rounds overall is

d e

X

d r e

r

e d

X

d e r e d

r

d d e ed e

CHAPTER DETERMINISTIC SIMULATION WITH LOWER SLACKNESS

d e d e



N

Lemma Algorithm Numb erBlo cks uses O L computation time and O g

v

v b L communication time

N

Pro of In each round any pro cessor computes at most b sums from at most v b

v

N

inputs The maximum number of items received is v b From Lemma the

v

number of rounds used is a constant 

N

Lemma LowSlackRouting uses O O L computation time and O g

v

e d e v b L communication time where d

N

Pro of From Lemma we can conclude that each virtual pro cessor has O data

v

N

in each round for b O Exclusive of NumberBlocks therefore the computation

v

N

L per round From time and communication time of LowSlackRouting are both

v

Lemma we get the overall computation time including NumberBlocks to be

N

O O L and the communication time to be O g v b L for d e

v

e  d

The EM Algorithm EMLowSlackRouting

In this section we describ e algorithm EMLowSlackRoutingwhich is based on LowSlack

Routing EMLowSlackRouting p erforms the routing of messages among virtual pro

cessors during the simulation of a BSP algorithm on an EMBSP machine

In the general case of multiple disks and multiple real pro cessors a key step

is the calculation of a xed ordering of the communication blo cks This allows us

to calculate the destination pro cessor number b oth virtual and real and the disk

address to which the blo ck is to be written at its destination We rst describ e this

general case and later we examine sp ecial cases such as a single disk and single real

pro cessor which p ermit simpler algorithms

First we describ e the simulation of LowSlackRouting as an external memory al

gorithm EMLowSlackRouting then we describ e an adaptation of NumberBlocks as

external memory algorithm EMNumberBlocks We do not simulate the parallel algo

rithm but instead we use it directly on the real pro cessors of the EMBSP

We assume the existence of p real pro cessors on the EM machine Each real

v

virtual pro cessors i p i pro cessor i where i p simulates a xed set of

p

v

p i p i In other words we assign virtual pro cessors to physical p

CHAPTER DETERMINISTIC SIMULATION WITH LOWER SLACKNESS

pro cessors in a roundrobin fashion As in Chapter we intro duce the concept of

N

a messaging matrix M Let M be a matrix with v columns and rows that is

vb

v

columns of M per real pro cessor Each cell of M stored in a distributed fashion

p

contains a blo ck of b items Each column of M will be used to store the message

blo cks destined for a particular virtual pro cessor Each real pro cessor will maintain

the columns of M corresp onding to the virtual pro cessors which it is resp onsible for

simulating

Wehave v buckets and in the rst round v pro cessors p er bucket Wedeliver

the messages within each bucket to their nal destinations by delivering the comp o

nent blo cks of each message to a series of intermediate destinations The immediate

destination virtual pro cessors for a xed bucket j for j v are those with

lab els jv jv j v

We simulate the routing of communication blo cks in external memory by executing

algorithm EMLowSlackRouting below on each real pro cessor EMLowSlackRouting is

derived from the parallel algorithm LowSlackRouting As b efore S is a twoelement

vector consisting of the rst and last lab els of the pro cessors participating in the

routing op eration Initially S v We use jS j to represent the number of

pro cessors in the range indicated by S Initially jS j v and round r

Algorithm EMLowSlackRouting S

N

bv elements partitioned into jS j mes Input Each of the jS j virtual pro cessors has

v

N

sages Each message may b e of arbitrary length The messages for each virtual

v

pro cessor are stored on the lo cal disks of the real pro cessor which simulates them

Output The O jS j messages in each pro cessor are delivered to their nal destinations

N

Each pro cessor receives elements

v

If jS j v the pro cessors in S route their messages to their destination bymeansof

algorithm BalancedRouting

The virtual pro cessors in S organize themselves into v groups each of size jS jv

th

See Figure Each virtual pro cessor divides its messages into v buckets The j

bucket contains those messages destined to the pro cessors in group j Each message

is sub divided into communication blo cks of size b

Each virtual pro cessor in S determines a rank for each of its blo cks as follows

a Assign unit weighttoeach lo callyheld blo ck

bucket as b Create a vector T of the lo cal partial sums of the weights in each

follows

T the number of blocks in each bucket j with size for b and

j v

CHAPTER DETERMINISTIC SIMULATION WITH LOWER SLACKNESS

c Call EMNumberBlocks S T b elow This determines the rank of the rst

full blo ck and any partial blo ck in eachbucket

d Lab el all of the lo callyheld blo cks with their rank The full blo cks in each

bucket receive consecutiveranks

th

Call EMSend S below Each virtual pro cessor addresses the i blo ck of lo cal

r th

bucket j to the i mo d v pro cessor of group j Blo cks to a common virtual

pro cessor are then concatenated to form a single message and sent

Sup erstep b oundary r r

Call EMReceive S below Each virtual pro cessor receives at most one message

r

from each of the v pro cessors which were part of its group in the previous

round Step ab ove

Let S b e the range of virtual pro cessors in group j j v Each virtual pro cessor

j

of S routes the blo cks it received in Step to their destination pro cessor by executing

j

EMLowSlackRouting S forall j v

j

End of Algorithm

Algorithm EMSend S

Input Each communication blo ckheldby the current virtual pro cessor is asso ciated with

one of v buckets and lab elled with a sequence number i as calculated by Num

berBlocks that is unique within that bucket

Output Each communication blo ck held by the current virtual pro cessor is sent to the

real pro cessor which is resp onsible for simulating its immediate destination virtual

pro cessor Each blo ck is lab elled with the disk track to which it should be written

when it reaches its destination

Let j represent the bucket number of a blo ck and let i be its sequence number

Calculate for each blo cka column number and rownumber in the messaging matrix

as follows

a Column number C jv i mo d v Note C is also the number of the

virtual pro cessor which is the immediate destination for the currentblock

i

b Rownumber R b c Note R can also b e thought of as a sequence number



v

for the currentblock within virtual pro cessor C

Send blo ck i to its destination real pro cessor C mo d p for all i

End of Algorithm

Algorithm EMReceive S

CHAPTER DETERMINISTIC SIMULATION WITH LOWER SLACKNESS

Input The current virtual pro cessor has sent its data to the router and is waiting for data

addressed to it to arrive Each arriving blo ck is lab elled with the lo cal disk trackto

which it should be written Let i be the sequence number of a blo ck and C is the

numb er of the virtual pro cessor to which it is addressed

Output The blo cks of data addressed to the current virtual pro cessor in the current

sup erstep have b een written to the lo cal disks of the appropriate real pro cessor

While messages remain to b e received

a Receive next message from router

N v

b Write each blo ck of the received message to parallel disk track C mo d

vb p

i

c b



v

End of Algorithm

Details of Algorithms EMSend and EMReceive For ease of exp osition we assume

that p divides v We also assume that b BD eg b BD where b is the

communication blo ck size D is the numb er of lo cal disks on each real pro cessor and

B is the disk blo ck size We can therefore write each communication blo ck on one

or more tracks of the parallel disks of a real pro cessor Each communication blo ck

is lab elled with its messaging matrix indices C and R where column numb er

i

c The blo cks for a xed C jv i mo d v and row number R b



v

virtual destination pro cessor are stored on the real destination pro cessor in a xed

v

band of tracks on the lo cal disks The band numb er is calculated as C mo d Within

p

a band R determines the relativetracknumber on which the message is stored The

message blo ck with sequence number i in bucket j is sent to real pro cessor C mo d p

N v

and written to parallel disk track C mo d R on its lo cal disks Messages

vb p

arriving at a real pro cessor can therefore b e written to its disks on the correct parallel

track indep endent of other the arrival times of other messages Figure shows the

assignment of messages on the lo cal disks of a real destination pro cessor

The EM Algorithm EMNumberBlocks

It remains to show how NumberBlocks can be executed as an external memory algo

rithm NumberBlocks involves two phases an upward phase where data propagates

from leaves to ro ot and a downward phase where data propagates from ro ot to

ard phase sends a blo ck leaves Each pro cessor which is active in a round of an upw

of b items to a parent pro cessor which is sp ecic to a given bucket There are v such

N



from Lemma The maximum numberofblocks p er virtual pro cessor is vb

CHAPTER DETERMINISTIC SIMULATION WITH LOWER SLACKNESS

Band Row

(v/p)-1 2 1 ... 0

...

1 2 1 ... 0

2 0 1 ... 0

01 D

Disk

Figure The assignment of messages to local disks on a

real destination processor A blo ck with indices C R in

C

the messaging matrix M is assigned to real pro cessor b c

vp

and a band of its lo cal parallel disk tracks which corresp onds

to the blo cks destination virtual pro cessor The parallel

disk track within that band is R

pro cessors sending to each parent and so communication is fully blo cked and v b

data is received by a parent in any round

Algorithm NumberBlocks forms v v ary trees of pro cessors each one computing

an ordering of the blo cks for a single bucket In the parallel external memory version

EMNumberBlocks the p pro cessors compute the same v prex sums by forming a

v ary tree over p pro cessors for each bucket We therefore do some lo cal pro cessing

on each real pro cessor to reduce the number of leaves in each tree from v to p then

apply NumberBlocks to the reduced problem Finallyeach real pro cessor do es some

lo cal pro cessing to deliver the prex sum information for each bucket to eachofitsv

pro cessors

v

columns of a v v matrix M Each cell Each real pro cessor maintains

NB

p

of M is large enough to store a message of size b Each virtual pro cessor i can

NB

indep endently calculate its own rank i within the collection of pro cessors sending to

a given destination j Communication from virtual pro cessor i to virtual pro cessor j

is implemented by the message from i b eing written to and subsequently b eing read

from M i j NB

CHAPTER DETERMINISTIC SIMULATION WITH LOWER SLACKNESS

Algorithm EMNumberBlocks S T

Input EachofthejS j virtual pro cessors has v lo cal buckets whichcontain blo cks of size

b items T is a b v array containing partial sums for the weights of each blo ck

size within eachbucket Eachofp real pro cessors has vp virtual pro cessors whichit

simulates Each virtual pro cessor is prepared to execute the algorithm NumberBlocks

Output The blo cks in each global bucket of the virtual pro cessors are each given a rank

which is unique within that bucket

Each real pro cessor sums the values of T held by each of each virtual pro cessors it

simulates It calculates b sums for each bucket one sum for eachblocksize b

The p real pro cessors execute the parallel algorithm Numb erBlo cks S TforS

p all real pro cessors

Each real pro cessor scans the virtual pro cessors it simulates and delivers its p ortion

of the prex sum for each of its buckets

End of Algorithm

Details of EMNumb erBlo cks We arrange that the virtual pro cessors are susp ended

as in a barrier synchronization op eration when they call NumberBlocks In the worst

case steps and of EMNumberBlocks can be implemented by a scan over the

contexts of the virtual pro cessors held by each real pro cessor Step now requires

only a parallel algorithm as no IO is required The virtual pro cessors simply re

turn from their calls to NumberBlocks with the required results as calculated by

EMNumberBlocks

Analysis of EMLowSlackRouting

N N

L computation time O g Lemma Algorithm EMNumb erBlo cks uses O

p p

N

L IO time L communication time and O G

pB D

Pro of In Step of EMNumberBlockseach real pro cessor calculates b sums from

N v

v b O items In each round of NumberBlocks Step of EMNumberBlocks

p p

vb v v N

any real pro cessor computes at most sums from at most v b inputs The

p p p v



In practice no extra IO op erations may be required for steps and The size of each

bucket is contained in the variable T in the stack frame of each virtual pro cessor and so step

do es not require extra IO op erations to accomplish Similarly when the results of the prex sum

are to b e delivered in step the real pro cessors simply deliver the results to the variable T already

in the currentstack frame of each pro cess prior to relinquishing control to each virtual pro cessor

CHAPTER DETERMINISTIC SIMULATION WITH LOWER SLACKNESS

v v N

maximum numb er of items received by a real pro cessor is v b From Lemma

p p v

the number of rounds used over p v pro cessors by NumberBlocks is at most

r e where r d e is the recursion depth in EMLowSlackRouting  d

The last round of routing consists of v subproblems consisting of v pro cessors

in each case The data for each of these pro cessors is spread over v pro cessors and

is to b e delivered to its nal destination in this round

Since the size of the groups in the nal round is only v we no longer

can rely on successive blo cks being sent to successive physical pro cessors and so

the roundrobin allo cation approach can result in large imbalance between physical

N

pro cessors in the data they receive the largest one can receive O p items We

vb

do however have the necessary conditions to apply algorithm BalancedRouting in

the nal round instead as Lemma shows

Lemma Let v be the number of processor involved and let N be the size of each

routing subproblem in the nal round of EMLowSLackRouting Then N v b

v

and so algorithm BalancedRouting can be used for b

Pro of Since for each subproblem there are v pro cessors each with at least v b

data items each subproblem in the nal round contains N v b data Therefore

wehave a problem of size N with v v pro cessors and N v b From Theorem

therefore wehave the necessary conditions for algorithm BalancedRouting when

v

b 

Wenow turn to the issue of whether the communication b etween real pro cessors in

e If communication algorithm EMLowSlackRouting is balanced in rounds d

is unbalanced not only may an arbitrary pro cessor do more than its share of the work

but it may be asked to receive more data than can t into its internal memory in a

single round

In general the kp virtual pro cessors may be participants of say z groups in this

th

the r round

First we consider the case when k In general the rst few real pro cessors

contain some of the virtual pro cessors of group i the last few pro cessors contain some

of the virtual pro cessors of group i z and the pro cessors in b etween contain all

of the virtual pro cessors of groups i to i z

We call the groups i to i z whose virtual pro cessors are completely

contained within the p real pro cessors ful l groups A full group pro duces an imbalance

in communication of at most one blo ckbetween its real pro cessors The partial groups

dened in next paragraph may also have an eect on these real pro cessors however

We refer to the at most two groups which are only partly contained within

the p real pro cessors as partial groups Within each bucket every blo ck has a lab el

CHAPTER DETERMINISTIC SIMULATION WITH LOWER SLACKNESS

N

between and but only part of this range will be represented in such a group

v b

We refer to these buckets as partial buckets

Lemmas and show that although we simulate only p virtual pro cessors

at a time in each sup erstep of EMLowSlackRoutingthecommunication and memory

usage of the real pro cessors remains balanced when p v

Lemma Let p v and let j be an arbitrary bucket in a xed round of EM

each real processor receives at LowSlackRouting In Step of EMLowSlackRouting

most one more block of bucket j than any other real processor

r th

Pro of There are v pro cessors in any group during the r round of EMLowSlack

r

Routing for r and during these rounds v v We therefore haveat

least as many virtual pro cessor in a group as there are real pro cessors Since virtual

pro cessors are allo cated to real pro cessors in roundrobin order any real pro cessor

hosts at most one more virtual pro cessor of group j than any other

Since the blo cks in bucket j are allo cated to the virtual pro cessors of group j in

roundrobin order each real pro cessor receives at most one more blo ck of bucket j

than any other real pro cessor in Step ofEMLowSlackRouting 

N

Let p v and let k No real processor receives more than Lemma

vb

e blocks in Step of EMLowSlackRouting in rounds d

th

Pro of Consider the memory of p consecutive virtual pro cessors in the r round

for r d e where only a single virtual pro cessor is simulated concurrently

in each of p real pro cessors Their lo cal memories are divided into v buckets

We send blo cks with consecutive numb ers to dierent virtual pro cessors with

consecutive lab els Because of the roundrobin allo cation of virtual pro cessors to

physical pro cessors blo cks with consecutive numbers are guaranteed to be sent to

dierent real pro cessors as well in rounds r d e The worrisome issue

however is whether one real pro cessor receives signicantly more data than any other

From Lemma this cannot b e as a result of distributing a single bucket Wenow

examine whether it can still happ en due to the fact that we distribute the v dierent

buckets indep endently

Since p v the p virtual pro cessors may b e participants of at most two groups

Let us assume that p virtual pro cessors participate in the rst of these groups p

participate in the second and p p p Let us concentrate on the rst of the

two groups In this case the data from p virtual pro cessors is sent to p p real

pro cessors

 

In round however since the groups are only v pro cessor in size successive blo cks would

not necessarily b e sent to successive real pro cessors if we used the same algorithm

CHAPTER DETERMINISTIC SIMULATION WITH LOWER SLACKNESS

The dierence in the number of blo cks received by a real pro cessor for a single

bucket of a single group is at most one no matter how much data is in that bucket

ie partial or not

The worst case of imbalance o ccurs when most of the buckets thin buckets

contain only w op blo cks which get allo cated to the same set of w real pro cessors

P P for every thin bucket Figure illustrates this case for w Since

w

N N

wehave an relation each virtual pro cessor distributes O blo cks in total Each

v vb

of the w pro cessors can receiveatmostv blocks from thin buckets since wehave

only v buckets in total

bucket 0

. 1

. 2

v ε .-1

v ε

01234 5p

Real Processor Number

Figure Imbalance in Data Received by the Real Proces

sors The rst v buckets contain only one blo ck which

coincidentally get routed to the same real pro cessor The

last bucket must then contain p v v blo cks which

get evenly distributed over the p pro cessors

Since every virtual pro cessor must send v blo cks in total and there are p of them

participating we will have p v w v blo cks in the last partial bucket fat

bucket

We know however that the data from the fat bucket will be distributed evenly

among those real pro cessors whichcontain the virtual pro cessors of this group There



If saytwoblocks are to b e routed to P in a bucket this would imply that all of the other real



pro cessors receive at least one blo ckaswell

CHAPTER DETERMINISTIC SIMULATION WITH LOWER SLACKNESS

w v

fore P P may each receive another v blo cks this way for a total

w

p

w v w v

N

or v and this is no more than of v v

p p vb

This pro cess can happ en at both ends of the range of p virtual pro cessors If we

assume that some of P P can receive b oth sets of extra blo cks we obtain

w

N

the bound of 

vb

We now consider k pro cessors simulated concurrently by each real pro cessor

kN

Lemma Let let k No real processor receives more than blocks in Step

vb

of EMLowSlackRouting in rounds d e

Pro of Each additional p virtual pro cessors added to the memories of the real

pro cessors in a simulation round can do no worse than indep endently pro duce the

scenario of Figure Lemmas Since for k the maximum communication

N kN

received by a pro cessor was blo cks the worst case is blo cks for general k 

vb vb

N

Lemma For p O v EMLowSlackRouting uses O O L computa

p

N N

L communication time and O G L IO time where tion time O g

p pB D

d e d e for constant

N

Pro of From Lemma we can conclude that each virtual pro cessor has O

v

N

data in each round for b O Exclusive of EMNumberBlocks therefore the

v

N

computation time and communication time of EMLowSlackRouting are b oth

v

L per round From Lemma we get the overall computation time including

N

O L and the communication time to be O g EMNumberBlocks to be O

v

v b L for d e d e 

Single Pro cessor Target Machine

In each round of EMLowSlackRouting the EM pro cessor rst allo cates v buckets

b

r

ev disk tracks where r is the current recursion depth in algorithm eachofsized

BD

LowSlackRouting the initial call to LowSlackRouting corresp onds to r

The execution of each virtual pro cessor as it executes algorithm LowSlackRouting

is susp ended after Step which is a communication sup erstep The simulation of

Step is p erformed by simply writing the communication blo cks intended for bucket j

to the end of a list of blo cks stored in the area reserved for bucket j on the disk When

every virtual pro cessor has completed recursion depth r r of LowSlackRouting the

rst pro cessors execution at level r r begins For purp oses of our EM routing

algorithm we need only ensure that each pro cessor of a given group receives the same

number of blo cks of the messages sent to that group in the previous sup erstep As

CHAPTER DETERMINISTIC SIMULATION WITH LOWER SLACKNESS

N b

it b egins execution of a sup erstep each virtual pro cessor takes the next d ev

v BD

undelivered disk tracks for its group

Both reading and writing of message data is done BD items at a time As we

know b eforehand the total number of blo cks destined for each group the sequential

simulation for this case need not include the ranking step NumberBlocks required

by the parallel algorithm LowSlackRouting and the nal routing step need not be

done by BalancedRouting as in the multiple real pro cessor case

Multiple Pro cessor Target Machine

W e now describ e a simulation technique for the multiple real pro cessor case

Each real pro cessor i i p executes algorithm ParComp oundSup erstep

in parallel For ease of exp osition we assume that pk divides v The v virtual

pro cessors are assigned to the p real pro cessors in roundrobin order ie real pro cessor

i hosts virtual pro cessors i i p i p i v p

Algorithm ParComp oundSup erstep

Ob jective Simulation of a comp ound sup erstep of a v pro cessor CGM on a ppro cessor

EMCGM

Input The message and context blo cks of the virtual pro cessors are divided among the

real pro cessors and their lo cal disks Each real pro cessor i i p holds

v

N

O blo cks of messages and blo cks of context and each lo cal disk contains

pB pB

v

N

O blo cks of messages and O blo cks of context

pB D pB D

Output The changed contexts and generated messages distributed as required for the

next comp ound sup erstep

v

do For j to

pk

a Read the contexts of virtual pro cessors jkp i jk p i j k p i

from the lo cal disks

b Read any message blo cks addressed to virtual pro cessors jkp i jk p i j

k p i from the lo cal disks

c Simulate the computation sup ersteps of virtual pro cessors jkpi jkpi j

k p i collecting all generated messages in the lo cal internal memory

d Send all generated messages to the required real destination pro cessors Up on ar

internal memory of the real destination rival the messages are arranged within the

pro cessor and then written to its disks

CHAPTER DETERMINISTIC SIMULATION WITH LOWER SLACKNESS

e Write the contexts for virtual pro cessors jkp i jk p i j k p i

back to the lo cal disks

End of Algorithm

Details of algorithm ParComp oundSup erstep Steps a b c

and e are identical to the corresp onding steps in the simulation of Chapter see

algorithm ParComp oundSup erstep Step d is accomplished by algorithm EM

routing steps followed by algorithm BalancedRouting when LowSlackRouting for

N

the routing subproblems are reduced to delivering messages among v pro cessors

v

We cho ose b BD and so for each interim destination during the routing steps

of EMLowSlackRouting every message blo ck can be written to the D lo cal disks in

parallel In the nal routing step involving algorithm BalancedRouting a distributed

messaging matrix is created as in algorithm ParComp oundSup erstep but for each

subproblem of v pro cessors

We can now restate Theorem with a more general slackness constraint for the

case p O v

Theorem A v processor CGM algorithm A with supersteps computation time

N

Lcommunication time g L and local memory size can be simulatedas

v

v v v

a pprocessor EMCGM algorithm A with computation time O O L

p p p

v v v N

O L and IO time G O O L for communication time g O

p p p BD p

v

p O v M k BDandN v BD for arbitrary integer k and

p

v

v

k

Let g N LN and v N be increasing functions of N If A is coptimal on

the CGM for g g N L LN and v v N then A is a coptimal EM

BD p

CGM algorithm for g g N G o and L LN A is

v

workoptimal communicationecient and IOecient if A is workoptimal and

BD p

communicationecient g g N G O and L LN

v

Pro of The changes b etween algorithms ParComp oundSup erstep and ParComp

oundSup erstep are restricted to the rst d e routing steps and the constraint

that b BD EMLowSlackRouting takes a constantnumb er of sup ersteps for constant

Therefore asymptotically the computation and communication time as

well as the total number of sup ersteps do not change from Theorem

However the slackness requirement now b ecomes N v BD absorbing a

previous requirement that N vBD and therefore eliminating the constraint

v D 

Chapter

Problems of Geometrically

Changing Size

Intro duction

The techniques describ ed in Sections and may require asymptotically more IO

than is necessary for certain problems sp ecically those whose active set of data

changes in size by a constant factor in each round Examples include certain al

gorithms for parallel prex computations and list ranking where during a series of

rounds the bridging of no des eliminates a constant fraction of the input from the com

putation in each round see for example Our previous approach is inecient for

this sort of problem b ecause over O log rounds of geometrically decreasing prob

lem size the computation time is p erhaps only O as in the list ranking example

but we w ould swap a total volume of O log context In this section we discuss

mo dications to the simulation which allow it to take advantage of the knowledge

that the problem size is changing in this way The mo dications we will discuss are

relevant to the asymptotic complexity of the simulation of a client algorithm with a

nonconstant number of rounds of such geometrically decreasing problem sizes but

time in practice may also represent a useful saving of elapsed

The main idea of this improvement to the simulation is to swap only the active

p ortion of the context of a virtual pro cessor and in order to do this we require that

the client algorithm keep track of which p ortions of its data space are active and

which are not In practice we anticipate that the client algorithm will also need to

defragment its data space p erio dically say every round or after a constant number

of rounds to makethe active and inactive p ortions individually contiguous We will

also require the client algorithm to provide the sizes and lo cations of these p ortions of

space to the simulation Aside from these general implementation remarks its data

however we will restrict our attention here to the theoretical issues arising from this

CHAPTER PROBLEMS OF GEOMETRICALLY CHANGING SIZE

mo dication

We consider the messaging and the context swapping but do not sp ecically

consider the executable program co de itself It is not relevant to our asymptotic

analysis b ecause it can b e considered to b e of constant size Its eects on the analysis

are captured in the expressions for the context

Reducing IO Overheads

We use the term phase to mean a series of consecutive rounds of a BSPlike algorithm

A decreasing phase will refer to a phase in which the activecontext size is decreasing

i

by a constant factor f in each round ie the context size in round i is

f

where is the context size in round i An increasing phase will refer to a phase in

i

which the active context size is increasing by a constant factor f in each round

ie the context size in round i is and the context size in round i is f A

i i

monotone phase is a phase which is strictly increasing or decreasing

Let b e the minimum size of activecontext that is swapp ed during a mono

MIN

tone phase Clearly in order to ensure that the IO remains blo cked and parallel to all

D disks we require that k BD where k is the num b er of virtual pro cessors

MIN

simulated in the memory of a real pro cessor concurrently While the actual active

context size may b ecome smaller we will require that the simulation track the actual

active context size only as small as Lemma shows that if the simulation

MIN

can track the context for log rounds the total volume of contexts swapp ed into

f

internal memory during the phase is only O

MAX

Lemma Let be the context size at the beginning of a decreasing phase

MAX

MAX

with rounds For constant f and the sum of the sizes

MIN T otal

of the active contexts which must be swapped during is O and this can

MAX

be achieved in a ful ly blocked fashion on a single processor EMBSP with D disks

k

MAX

provided that

BD

Pro of is dened as follows

T otal

MAX MAX MAX

T otal MAX

f f f

A useful b ound on for our purp oses is

T otal

X

MAX MAX MAX

MIN T otal MAX

log

MAX MIN

f

f f

f

MAX

ilog

f

MIN

CHAPTER PROBLEMS OF GEOMETRICALLY CHANGING SIZE

MAX

Now if we cho ose this b ecomes

MIN

X

MAX MAX MAX MAX

T otal MAX

log

MAX MIN

f

f f

f

MAX

ilog

f

MIN

Now using the identity

X

i

x where x

x

i

we obtain for x

f

f

MAX

T otal MAX

f

O

MAX

MAX MAX

we have O So for

log T otal MAX MIN

f

f

In order for the IO to b e blo cked and in parallel to all D disks we also require that

at the contexts of k pro cessors ll a parallel disk track in other words

k BD

MIN

MAX

Since we chose we get

MIN

k

MAX

BD

k

MAX

BD



The case of an increasing phase is symmetric During the lastrst log

f

rounds of a decreasingincreasing phase the actual context size may be less than

k

MAX

we can transfer data in each of these rounds but for

MAX MAX

BD

without exceeding O data transfer overall for the phase or violating our

MAX

blo cking requirements for the disks

Lemma A monotone phase with rounds computation time L and

parameter f can be simulated on a single processor EMCGM machine with O k

v v

MAX MAX

computation operations and O IO op internal memory in O v

B BD

k

v

MAX

erations provided that N v B B D v for integer constant

BD

v

k v k

CHAPTER PROBLEMS OF GEOMETRICALLY CHANGING SIZE

Pro of Algorithm SeqCompoundSuperstep can be used for the simulation with

changes discussed b elow We discuss the case of a decreasing phase An increasing

phase is symmetric to a decreasing phase and so the arguments remain the same

IO Due to Contexts First we consider how the simulation should deal with the

contexts We can handle the swapping of contexts by allo cating a xed area on the

D disks exactly as in the xed context size case but with the space for each group of

k virtual pro cessors divided into two parts i the active p ortion of the contexts and

ii the inactive p ortion of the contexts The inactive p ortion is not swapp ed into the

real memory of the simulating pro cessor until needed in a future round

We rst consider the case when the active context size is decreasing during a

phase In each round the active p ortion decreases in size by the factor f At the end

of a round the simulation concatenates the newly deactivated part of the context to

the inactive p ortion on the disks and writes the active p ortion into the remainder of

MAX

the disk space reserved for the context By Lemma we need only O IO

BD

op erations p er virtual pro cessor for the swapping of contexts during such a phase

IO Due to Messages We assume that the total message data volume in each round

is also limited by the volume of the active contexts and that the message volume

during a phase varies by the same factor f between rounds as do es the context size

As in Section we use a xed size messaging matrix throughout the simulation

As the message volume drops during a decreasing phase less of the space allo cated to

the messages for each virtual pro cessor will b e required but no change in the address

calculation is required

It remains to ensure that the message volume is sucient for algorithm Balance

dRouting page to work and that IO op erations p erformed by algorithms Se

page and ParComp oundSup erstep page to simulate qComp oundSup erstep

this communication remain blo cked and in parallel to the D disks

MIN MAX

Let N N represent the minimum and maximum message volumes re

sp ectively of a round during phase We apply the same analysis to the maximum

message volume in a round as we used to derive equation This gives us a

MIN MAX

minimum data volume of N N over a monotone phase of rounds

First we examine the constraints required by algorithm BalancedRouting Lemma

v

requires a low er b ound on the problem size N ie N v B B If

MIN

we replace N by the minimum data size N N this constraint b ecomes

v

When k virtual pro cessors are simulated concurrently in each N v B B

v

real pro cessor there are onlyv distinct destinations for the messages and so this

k

v

B B constraint b ecomes N v

For algorithms SeqComp oundSup erstep and ParComp oundSup erstep we have

the requirement N vB D which ensures that the messages sent or received bya

CHAPTER PROBLEMS OF GEOMETRICALLY CHANGING SIZE

group of k virtual pro cessors in one round was enough to justify doing a parallel IO

to the D disks In our current scenario however wemust adapt this condition to the

MIN

minimum data size N If we replace N by the minimum data size N Nthis

constraint b ecomes N vB D This constraint is absorb ed by the constraints

v

N v B B

Computation Time As v virtual pro cessors are simulated on a single real pro cessor

the total time for Step c is v IO contributes computational overhead prop or

v

MAX

tional to the number of blo cks input or output or O 

B

Lemma A monotone phase of a CGM with rounds computation time L

communication time g Landparameter f can be simulatedonapprocessor EM

v

v v

MAX

CGM machine in O paral lel computation operations O commu

p pB pk

v v

MAX

nication operations O IO operations and with synchronization time L

p BD pk

k

v

MAX

N v B B D v for integer constant k provided that

BD

v

p v and v

k

Pro of We use the results of Lemma Each of the p real pro cessors simu

v v

lates virtual pro cessors k at a time The synchronization time is therefore L

p pk

Because there are now p pro cessors instead of only one as in Lemma the IO

v

v

MAX MAX

op erations decrease from O to O giving computational overhead of

BD p BD

v v

MAX

O The basic computation time decreases from O v to O however

p BD p

Since real communication is not required b etween virtual pro cessors simulated on the

same real machine and since k virtual pro cessors are simulated concurrently on each

v

real machine the numb er of nonlo cal pro cessor groups is and the numb er of com

pk

v

The constraints munication op erations is reduced compared to the CGM to O

pk

carry over from the single pro cessor case 

Theorem Let A be a v processor BSPlike algorithm with supersteps and

monotone phases where during each phase i the active problem size

i

changes by a constant factor f in each of rounds Let be the maximum context size

i i

When we assume that computational overhead for IO is prop ortional to the numb er of blo cks

readwritten it is equivalent to assuming that the CPU is involved only in posting the request

for eachblock and the IO is p erformed by an IO channel or disk controller directly fromto a

buer in our programs data space While this is p ossible and often true on larger systems many

older versions of the C IO library have p erformed the IO fromto a separate system buer This

approach requires that the data be copied tofrom the user buer afterb efore the inputoutput

op eration The copy op eration requires the involvementof the CPU for every item of the buer

and so this approach implies computational overhead whichisB times larger

CHAPTER PROBLEMS OF GEOMETRICALLY CHANGING SIZE

for each phase and let L and g L be the computation time and communication

i i

time respectively of each phase of A Let max

i i

i

v

A can be simulated on a single processor EMCGM machine in O

BD

P

k

paral lel IO operations each of BD items provided that

i

i

BD

v v

D v for integer constant k v N v B B

k

Pro of We use Lemma for each monotone phase plus the arguments of Theorem



Theorem Let A be a v processor BSPlike algorithm with supersteps and

monotone phases where during each phase i the active problem size

i

changes by a constant factor f in each of rounds Let be the maximum context

i i

size and L and g L be the computation time and communication time

i i

resp ectively of each phase of A

i

P

v

A can be simulatedona pprocessor EMCGM machine in O

i

i

pB D

k

paral lel IO operations each of pB D items provided that N v B

BD

v v

B D v p v for integer constant k v

k

Pro of We use Lemma for each monotone phase plus the arguments of Theorem



We therefore can simulate an algorithm containing monotone phases with less IO

cost than the simplest approach would suggest Under the conditions stated in the

N

parallel IO op erations for each of the theorems it costs only a constant times

pB D

monotone phases despite the fact that suchphasesmayinvolve more than a constant

number of sup ersteps

Chapter

New EM Algorithms

Overview

A number of basic algorithms with applications in data structuring computational

geometry and graph theory have b een develop ed for BSP mo dels byvarious authors

In this section we present adaptations of a selection of these algorithms for the EM

BSP mo dels

In many cases the IO complexities of these EMBSP algorithms app ear at rst

glance to contradict known lower b ounds for their problems In Section we explain

this contradiction by discussing the eect of our constraints on the lower b ounds

In Section we present simple BSP algorithms for the fundamental problems

of sorting permutation and matrix transp ose and describ e their adaptations to our

EMBSP mo dels In Section we present several new multisearch algorithms for

EMBSP and EMCGM In Section we presentanumb er of other new EMCGM

algorithms which we obtained from known CGM algorithms

Apparent Reductions in IO Complexity

Several known lower b ounds on IO complexity see Section contain a multi

N

plicative factor of log M We obtain a number of results where this factor do es not

B

B

N

app ear This can be explained by the fact that the term log M is a constant when

B

B

N v b for as shown by Lemma

For instance Lemma together with Aggarwal and Vitters lower b ound of

N N N

IOs for sorting imply lower b ounds of log M IO op erations

BD B pB D

B

for a number of problems such as sorting general p ermutation matrix transp ose

computing FFTs etc when the slackness constraint N v b is valid

CHAPTER NEW EM ALGORITHMS

Lemma For slackness N v b where is a constant the value of

N

log M is a constant whose value depends on but not on N

B

B

Pro of For constant

N v b

v b N

Now cho osing c equation b ecomes

c c c

N v b

N

Substituting v and b B into equation gives

M

M N

c

B B

N

log M c

B

B

N

We can conclude that log M is a constant whose value dep ends on butnoton

B

B

N when N v b 

So for constant where the logarithmic term is a constant of size at

when N v b most

Fundamental Problems

We rst present new EMCGM algorithms obtained from CGM algorithms via Lemma

and Theorem for the fundamental problems of sorting p ermutation and ma

trix transp ose In each case the CGM algorithm uses O communication

N

rounds and O internal memory per pro cessor Since our simulation techniques

v

require that N v b the IO complexities of the resulting parallel external mem

ory algorithms do not exhibit the logarithmic factor known to b e present in the general

case for these problems

Sorting

The time complexity of sorting N items is N lg N for a comparisonbased mo del

of computation On the PDM sorting has been shown to have IO complexity

N N

log M for general values of N M D and B O

BD B

B

Go o drich has describ ed a deterministic BSP algorithm for sorting which has a

constant number of sup ersteps for N v a xed constant We can achieve

CHAPTER NEW EM ALGORITHMS

N

IO complexity O by simulating this algorithm using the deterministic

pB D

simulation technique of Chapter for p O v and N v BD a xed

constant

Alternatively we can simulate the algorithm CGMSampleSort due to Schi and

Schaeer which is describ ed and analyzed in Section using the deterministic

simulation techniques of Chapter Lemmas and give the p erformance of

CGMSampleSort and its external memory version EMCGMSampleSort resp ectively

analyzed under the EMCGM mo del presented in Chapter The pro ofs of Lemmas

and are given in Section

N N N

log L computation time O g L Lemma CGMSampleSort uses O

v v v

N

communication time O local memory per processor and O communication

v

N

supersteps provided that v

v

Lemma CGMSampleSort can be simulated on a p processor EMCGM ma

N log Nv

v N v

L computation time O g L communication time chine in O

p p p p

v N

L IO time provided that p v N v B D B B v and O G

pB D p

N N

Since CGMSamplesort requires v we obtain IO complexityof O

v pB D

for EMCGMSampleSort for N v B v D B v

Permutation

Permutation of N items on a RAM has time complexity N On the PDM this

N N N

problem has IO complexity min log see However we can

MB

D BD B

N

by simulating algorithm CGMPermute for p achieve IO complexity O

pB D

O v N v BD and xed constant The slackness constraint N v BD

comes from the routing algorithm LowSlackRouting

Algorithm CGMPermute

V is an N element vector containing items to be permuted P is a corresp onding N

elementvector containing new indices for each elementofV

N

Input Each pro cessor i i v holds an elementvector V containing elements

i

v

N N N N

i to i ofV and an elementvector P containing elements i to

i

v v v v

N

i ofP

v

N N

to i of the p ermuted vector V Output Each pro cessor i contains items i

v v

Assumption v divides N evenly

Each pro cessor i i v sends the items of V to the pro cessors holding the

i

items indicated by P

i

Each pro cessor p erforms the necessary rearrangements in its lo cal memory to complete

the calculation of P

CHAPTER NEW EM ALGORITHMS

End of Algorithm

CGMPermute p erforms the indicated p ermutation in a single communication round

N N

consisting of an relation The internal computation time is O and the memory

v v

N

per pro cessor used is O

v

Matrix Transp ose

Transp osing a n m matrix where N n m takes N time on a RAM On

log minM nmN B

N

the Parallel Disk Mo del this problem has IO complexity

BD log MB

N

However we can achieve IO complexity O by simulating algorithm

pB D

CGMTransp ose below for N v BD For ease of exp osition we assume that v

divides N

Algorithm CGMTransp ose

Let M be an n m matrix and let a be the element of M in row i and column

ij

j Let M be a one dimensional array containing the elements of M row by row ie

a a a a a a Let b e the element M b e the transp ose of Mandleta

m nm

ji

b e a one dimensional arraycontaining the elements of M in row j and column i Let M

a a a a a of M rowbyrow ie a

nm

m

N N

Input Pro cessor i i v holds items i to i ofM

v v

N N

Output Pro cessor i holds items i to i ofM

v v

Each pro cessor determines the destination pro cessor for each of its items and sends

them in a single sup erstep to their destination

Each pro cessor inserts the received items into the appropriate p ositions in its memory

End of Algorithm

CGMTransp ose transp oses the matrix M in a single communication round consisting

N N

relation The internal computation time is O and the memory used is of an

v v

N

O p er pro cessor

v

Summary

Theorem and Figure summarize the p erformance of our EMBSP algorithms

for sorting p ermutation and matrix transp ose

CHAPTER NEW EM ALGORITHMS

Theorem Sorting permutation and matrix transpose can be performed on a

N

pprocessor EMCGM with M local memory in each processor in O IO

pB D

operations provided N v BD for p O v and constant

Pro of The pro of follows from Theorem and the p erformance of the CGM

algorithms for sorting p ermutation and matrix transp ose describ ed in Sections

and 

The Multisearch Problem

Overview

Chapters and describ ed twosimulation metho ds one randomized and one deter

ministic for obtaining EMBSP algorithms from parallel algorithms for BSP mo dels

In Chapter our new optimality criteria were describ ed In the current section we

describ e research into multisearch algorithms for the EMBSP mo dels in which we

use the results of all three chapters Thus the primary purp ose of this chapter is as

a case study of the use and limitations of the simulation techniques of Chapters

and We omit many of the details of the pro ofs in this chapter as they can be

found in

In multisearch problems a large number of queries are simultaneously

pro cessed and answered by navigating through a large data structure on a parallel

computer A closely related application is an Internet search engine As input for

the multisearch problem we consider a set of n queries and a balanced binary search

tree T of size m The search tree is distributed among the pro cessors and stored

in their lo cal memories For clarity and ease of exp osition we restrict our discussion

n

to the case M ie each pro cessor has enough internal memory to keep its

p

share of the queries completely in its internal memory However with minor changes

larger sets of queries can be accommo dated in an IOoptimal manner Figure

illustrates an example application of multisearch In the example a two dimensional

universe is divided into regions by lines in the plane The queries are points and the

result of a query is the lab el of the region containing the point Figure shows a

tree built over the line segments which separate one region from another The dots

The queries navigate from the ro ot of the tree to a particular representpoint queries

leaf in order to determine a result Examining the leaf level of the tree with the

point queries sup erimp osed it is clear that sorting is insucient for this problem see

queries lab elled q and q for instance

The reader is cautioned that our use of the variables m and n is dierent than the meanings

N M

often used in the external memory literature where n and m

B B

CHAPTER NEW EM ALGORITHMS

Problem PDM IO Complexity Complexity

EMBSP

abc d ef

Description See notes in BSP Mo del Complexity

Group A Fundamental Algorithms

N log N

N N N

Sorting log O O

M

BD B v p

B

N N

O M O O log N

v p

N

see section

O

pB D

N N N N N

Permutation min O O log

M

D BD B v p

B

N

N

O M O

v

O

p

see section

N

O

pB D

log minMkNB

N N N

g

Matrix transp ose O O

BD log MB v p

N

N

O M O

v

O

p

see section

N

O

pB D

Figure Overview of New EM Algorithms

a

PDM IO complexities as listed apply to general values for N M D and B If the constraints

N

required by our techniques are applied to these results the term log M b ecomes a constant

B

B

b

When the parameter D is quoted in the PDM IO Complexity column this parameter refers to

the numberofdisksoverall whereas the parameter D used in the EMBSP IO Complexity column

refers to the numb er of disks on eachofthe p pro cessors of the machine

c

The PDM IO complexities listed for this group apply also to multiple pro cessors when the

interconnection metho d is a shared RAM hyp ercubic network or cub econnected cycles

d

The BSP running time is gM L where is the parallel computation time is the

numb er of BSP rounds M is the p eak lo cal memory used by a pro cessor of the BSP machine

v

e 

and v D if the techniques of These results are sub ject to the conditions N v B B





Chapter are used and N v BD p  v if the metho ds of Chapter are used Eachofthe p

v kN

memory for integer  k  real pro cessors has O

v p

f

We use the following notation is the amount of data communicated is the parallel compu

tation time and is the numb er of parallel IO op erations

g

k are the numb er of columns and rows resp ectively where N k

CHAPTER NEW EM ALGORITHMS

q1

q2

Figure Multisearch example Point queries select a

region of a twodimensional area divided by line segments

Multisearch has b een shown to b e useful for solving a numb er of data structuring

problems on a parallel machine eg trap ezoidal decomp osition next element search

on hyp ercub es multiple planar point lo cation interval trees hierarchical repre

sentations of p olyhedras on meshes dictionaries on trees on a EREWPRAM

maintaining balanced binary search trees on a BSP multiple p oint

lo cation in a class of hierarchical DAGs on a BSP and a BSP The BSP

multisearch algorithm due to Baumker Dittrich and Meyer auf der Heide with

improvements due to Dittrich was the rst optimal algorithm for multisearch

on a BSP mo del for b oth small and large trees

In external memory multisearchhas b een studied in for balanced binary

trees and optimal trees A ma jor issue in external memory multisearch is deciding

how to blo ck the graph Blo cking of graphs for external memory searching is examined

in

The batch ltering technique due to and describ ed in Section can b e

viewed as a multisearchtechnique that is IOoptimal on a single pro cessor machine

when n m For larger trees however batch ltering do es more IO than necessary

and it do es not accommo date multiple pro cessors

Our study of multisearch was prompted by its imp ortance as a to ol for various

regarding the optimality problems but also b ecause it highlights some of the issues

of algorithms under our EMBSP mo dels

Section discusses work due to RauChaplin and Dehne et al on

multisearch for the CGM mo del and sketches our EMCGM multisearch algorithm

obtained by simulating the CGM algorithm

CHAPTER NEW EM ALGORITHMS

Simulation of BSP multisearchgives us an ecientmultisearch algorithm on the

EMBSP for small trees discussed in Section

While we are able to give a multisearch algorithm that is optimal and IO

ecient for small trees IOeciency is not p ossible for large trees There is a

lower bound on the IO complexity of multisearch on large trees n m which

exceeds the computational and communication costs In Section an

IOoptimal EMBSP algorithm for large trees is sketched While this algorithm is

based on the same BSP multisearch algorithm as in Section it is not derived

via simulation but rather by handco ding the IO management at the points in

the algorithm where it is necessary The simulation approach involves swapping the

contexts of the virtual pro cessors of a BSP algorithm and the contexts in this case

would include space for the large search tree This approach would therefore require

the entire search tree to b e read once for each sup erstep of the BSP algorithm This

is inecient for large trees and so the handco ded approach is necessary for eciency

reasons Details are available in

Multisearch on the EMCGM mo del

RauChaplin and Dehne et al describ ed the rst ecient multisearch algo

rithm for a BSP mo del Theorem describ es the p erformance of their algorithm on

a CGM machine for n m This CGM multisearch algorithm do es not accommo date

larger trees eciently

Theorem adapted from

Let Q be a set of n queries and let T be a search tree of size m For n m

n n n

multisearch of Q against T can be done in time O log g L on a v processor

v v v

n

v CGM machine with

v

As mentioned earlier multisearch is used as a subroutine for a numb er of compu

tational geometry problems In such a situation the tree and the queries are often

generated by a previous step of the higher level algorithm Such an example illus

trates the usefulness of the CGM technique and our external memory algorithms for

the case n m Using the results of Chapter in particular Theorem we obtain

an EMCGM algorithm with the following p erformance

Corollary Let Q be a set of n queries and let T be a search tree of size m

n log n

m n

Multisearch of Q against T can b e done in time O O g O G

p p p

n v

O L on a v processor EMCGM machine for n m M k BD

pB D p

v v

and v p v N v B and v D for arbitrary integer k k

p k



Dehne et al use dierent terminology For instance they reverse our meanings of m and nand

they do not use the usual BSPstyle expressions for stating time complexity

CHAPTER NEW EM ALGORITHMS

EMBSP Multisearch

Wenow turn to multisearch on the EMBSP mo del We presenttwo dierent kinds of

results in the following The rst results see Section are applicable to smaller

trees as in Section ab ove and are obtained by applying the simulation techniques

of Chapters to the BSP multisearch algorithm of Section sketches

a new multisearch algorithm for large trees tailored to the EMBSP This do es not

rely on simulation but is based on the same underlying BSP algorithm as in Section

Theorem summarizes the p erformance of the BSP multisearch algorithm Tree

Search

Theorem Let T beadary balancedsearch tree of size m distributed among

v processors by a preprocessing step Let Q be a set of n queries distributed evenly

among the processors Let and c be constants and b is the BSP block

n n

size parameter Let b log n d and Multisearch of Q

v vb

nm

against T using BSP algorithm TreeSearch uses O memory per processor and

v

n n

o log m O g log m L log v log m running time with probability

d d d

v vb

log d

n

c

n In particular TreeSearch is optimal for L o and g ob log d

v log v

EMBSP Multisearch on a Small Tree

When the search tree is small ie ab out the size of the query set we can obtain

a workoptimal and IO ecient algorithm by applying the simulation techniques of

Chapter Theorem to the BSP multisearch algorithm of Theorem

Theorem Let be constants Given a dary tree D of size m

T

O

n b D O n log n n distributed by suitable preprocessing Let d

B b O n and

n log m

Multisearch for n queries against T can be done whp in time o

p

n m m

g log m G L D logpD on a pprocessor EMBSP with internal O

d

p pb pB D

nm m

memory of size O BD per processor and external memory of size O per

p pD

disk

n log m

g O b log n and L In particular for m O n log n G O BD

m

n log m

the algorithm is workoptimal communication and IO ecient O

D log pD

nm

and so v O m Pro of We combine Theorems and noting that

v

The expression for running time in Theorem can be decomp osed into four

comp onents as follows



Rened versions of the BSP multisearch algorithms presented in canbefoundin and

so we base the currentwork on the latter descriptions

CHAPTER NEW EM ALGORITHMS

The synchronization time L D logpD must b e accounted to each of computation

IO and communication times To be workoptimal communicationecient and

n log m

IOecient we require that L D logpD O

p

n log m

m

Exclusive of synchronization time the computation time is o O

p p

n log m

m

For workoptimalitywe require that O

p p

n

The communication time isg log m To b e communicationecientwe require

d

pb

n log m

n

that g O log m

d

pb p

n log m

m m

The IO time is G To b e IOecient we require that G O

pB D pB D p



The overhead of the simulation with regard to work do es not aect the asymp

totic running time EMBSP multisearchisworkoptimal for suciently small trees

ie m O n log n As in BSP multisearch the communication time scales accord

O

ing to the numb er of EMBSP pro cessors and is optimal for trees of size n The

numb er of sup ersteps increases slightly compared to BSP multisearch However the

size of the internal memory is O mn where for a suitably large number

of disks which allows very large problem instances Moreover the external storage

used per disk is optimal The constraints on the blo cksize B and number of disks

D are functions of n which p ose no practical limitations for suitably large problem

instances

EMBSP Multisearch on a Large Tree

For trees of size n log n the IO time of the ab ove algorithms b ecomes the domi

nating term of their running time complexity IOeciency is not p ossible for large

ratios mn

For large trees the binary tree T is partitioned into two parts which are referred

tains no des to as the upper part and the lower part resp ectively The upp er part con

m

and the lower part contains the no des of levels L of all levels L for all i log

i i

n

m

for log i log m The ro ot is on level L and at height log m while the leaves

n

are on level L and at height Thus the upp er part contains O n no des and

log m

the lower part contains O m n no des

In a prepro cessing phase a search tree T with no de degree d is constructed from

T and is distributed using the layered blo ck mapping among the pro cessors

Within a pro cessor each sup erno de is broken into subtrees of size B which are

distributed among the disk drives The tree is much to o large to be stored com

pletely in the internal memory We assume that each pro cessor has enough internal

memory to keep its share of the queries completely in its internal memory However

with straightforward changes larger sets of queries can b e accommo dated in an IO

optimal manner Theorem gives the p erformance of the search phase of EMBSP

CHAPTER NEW EM ALGORITHMS

multisearchona large tree m n

Theorem Let and c be constants Given n queries and a d

ary tree T of size m distributed among p processors by a preprocessing step Let

n

n Dbp log n B d and Multisearch for n queries against

pb

T can be done on an EMBSP with p processors and D disks in time

log m n logmn n n

log m O g o

p p log B pb log d

n log m n logmn log p

G L

pB D pD log B lognpb log d

n m

using O internal memory per processor and O external memory per disk

p pD

c

with probability n for constant c

Pro of Sketch EMBSP multisearchisavariantofBSPmultisearch and inherits

its prop erties regarding communication time computation time and number of su

p ersteps External memory space consumption is optimal as it requires O mpD

space on each of the pD disks Unfortunately the IO costs dominate the overall

running time for large trees so EMBSP multisearch is not IOecient for large

trees However show a lower bound that proves that the IOtime of EM

BSP multisearch is asymptotically optimal In other words this lower b ound shows

that IOeciency is not feasible for large trees In the light of this lower b ound

EMBSP multisearch is workoptimal communication ecient and IOoptimal for

suitable parameter constellations 

Summary

We have derived new parallel EMBSP multisearch algorithms for static balanced

binary trees Our multisearch algorithms involve a single prepro cessing phase after

which multiple search phases can be p erformed

For search trees up to size O n log n where n is the number of queries we ob

tained EMBSP and EMCGM multisearch algorithms which are workoptimal

IOecient and in the case of multiple pro cessors communicationecient

These algorithms are obtained via the simulation techniques describ ed in Chap

ters and resp ectively

For larger trees we obtained an EMBSP algorithm which is simultaneously

workoptimal and and IOoptimal When multiple pro cessors are present the

algorithm is also communicationecient

CHAPTER NEW EM ALGORITHMS

Batch ltering describ ed in Section was intro duced in as a sequential

EM technique for executing a collection of simultaneous queries in search structures

mo deled as planar layered directed acyclic graphs DAGs Batch ltering is therefore

an EM multisearch technique In this chapter we restricted our attention to multi

search in balanced binary trees While batch ltering is IOoptimal for sequential

versions of these problems when m O n log n where n is the numb er of queries and

m is the size of the data structure to be queried we consider data structures larger

than O n log n and also parallel pro cessing mo dels of computation Our EM mul

tisearch techniques can therefore be viewed as extending the EM paradigm of batch

ltering for balanced trees from the m O n case to include also m n

and from the RAM mo del to BSPlike parallel computation mo dels We will refer to

the extended technique as multiltering while recognizing that it may itself invite

extensions to other data structures such as hierarchical DAGs

We also consider larger data structures than can be accommo dated optimally by

the techniques of The work of the algorithm remains an imp ortant asp ect of

our complexity measure for external memory algorithms but achieving coptimality

in IO may not be p ossible due to the blo cking factor of the disks Our algorithms

for EMBSP multisearch are optimal in the sense of Dep ending on the size

of the data structure they are also either IOecient or IOoptimal

We note that the simulation techniques intro duced in Chapters to can be

used to create ecient parallel EM algorithms from a large class of BSP algorithms

In the current chapter we used these results in the application of multisearch but

we also considered larger data structures than can be accommo dated optimally by

simulation

The results obtained by the BSP simulation are the rst parallel workoptimal

and IOecient external memory algorithms for multisearch The EMBSP mul

tisearch algorithm for large trees is the rst multisearch algorithm for a realistic

parallel external memory mo del that is ecient for large trees in terms of work IO

and communication

While batch ltering is IOoptimal for sequential versions of these multisearch

problems when m O n log n where n is the number of queries and m is the size of

the data structure to b e queried we consider data structures larger than n log n and

also parallel pro cessing mo dels of computation

Other New External Memory Algorithms

Figures and list a numb er of other imp ortant problems arising in computational

geometry GIS and graph algorithms for which we rep ort EMCGM algorithms cre

ated by our technique together with their IO complexity and that of the previously

b est known algorithm for the problem In Figure we rep ort the same IO com

CHAPTER NEW EM ALGORITHMS

plexities as previously known after accounting for our parameter restrictions The

IO complexities we rep ort in Figure are not as comp etitive with previous results

In each case however our algorithm is scalable not only in terms of the number of

disks per pro cessor but also in terms of the number of pro cessors used Previous

algorithms were often not ecientinamultipro cessor environment particularly in a

environment and in many cases it is not clear how they could

be adapted to parallel disks

The problems listed in Table are stated briey b elow Please consult for

applications and more complete denitions of these problems and for related previous

results on RAM and PRAM computation mo dels

Given a simple p olygon S thetriangulation problem is to partition the interior

of S into a set of triangles by joining vertices of S with nonoverlapping straight

line segments

Let S be a set of line segments in the plane The trapezoidation problem is to

decomp ose the plane into a set of trap ezoids based on the arrangement of the

line segments

A segment tree is a data structure used to organize a set of line segments and

supp ort various kinds of queries against the set

Let S be a set of n nonintersecting line segments in the plane and let Q be

a set of m query points The next element search problem is to nd for each

query p oint q Q the line segment directly ab ove q The endpoint dominance

i i

problem is a sp ecial case of the next element search problem where the query

set is comp osed of the endp oints of the line segments themselves

Let G be an emb edding of a graph in the plane The batched planar point

location problem is to nd for each of a set of N query p oints q i N

i

the face of G which contains q

i

Given a set S of p oints in dimensional dimensional space the D D

convex hul l problem is to nd the convex p olygon p olytop e which contains all

points in S and whose vertices are points in S

Given a set S fs s g of N points in two dimensional space the D

N

V oronoi diagram problem is to nd a partition of the plane into N regions

R R such that for each i i N region R contains a single p oint s

N i i

of S and all the other points in R are closer to s than to any other p oint of

i i

S

CHAPTER NEW EM ALGORITHMS

Given a set S of N points in two dimensional space the Delaunay triangulation

problem is to nd a triangulation of the p oints in S such that for each such

triangle the circle incrib ed through its vertices do es not contain any other

pointofS

Given a set S of n nonintersecting line segments in the plane the lower envelope

problem consists of computing the set of segment p ortions visible from the p oint

The generalized lower envelop e problem is similar to ab ove except that

segments in S mayintersect

Given a set R of isothetic rectangles the measure problem is to compute the

area covered by the union of R

Consider a set S of n points in space For a point v let xv y v and

z v denote the xco ordinate y co ordinate and z co ordinate resp ectivelyofv

Point v dominates a point w i xv xw y v y w and z v z w

A p oint is maximal in S if it is not dominated by any other p oint of S The

Dmaxima problem consists of determining the set of all maximal p oints in S

Given a set S of n points in the Euclidean plane the al l nearest neighbors

problem is to determine for each point vS its nearest neighbor NN v in S

S

v uuS nfv g where NN v isapoint wS nfv g suchthatdistv w dist

S

Let S b e a set of n p oints in the plane with some weight w v assigned to each

vS The Dweighted dominance counting problem consists of determining for

each vS the total weight of all p oints which are dominated by v

Let S be a set of r pairwise disjoint mvertex p olygons The unidirectional

separability problem consists of determining all directions d such that S is sep

arable by a sequence of r translations in direction doneforeach p olygon The

multidirectional separ ability problem consists of determining if S is separable

by a sequence of r translations in dierent directions

The problems listed in Table are stated briey b elow Please consult for

applications and more complete denitions of these problems and for related previous

results on the RAM and PRAM computation mo dels

Given a linked list of N ob jects the list ranking problem is to compute for

each ob ject in its distance from the b eginning of the list

Given a connected graph G V E the Euler tour problem is to nd a cycle

that traverses each edge of G exactly once

CHAPTER NEW EM ALGORITHMS

Given a ro oted tree G V E and a pair of vertices u v of G the lowest

common ancestor problem is to nd the vertex w of G that is an ancestor to

both u and v and is farthest from the ro ot

The tree contraction technique is a metho d for constructing parallel algorithms

on trees working from the b ottom up

The expression tree evaluation problem is to evaluate the result of an arithmetic

expression represented as an expression tree

Given an undirected graph G V E a connected subset of vertices is a

subset of vertices in which there is a path in G between each pair of vertices

The connectedcomponents problem is to nd the maximal connected subsets of

vertices in G

Given a planar graph G V E the spanning forest problem is to form a

spanning tree for each connected comp onent of G

Given a connected undirected graph G V E the ear decomposition problem

is to nd an ordered partition of E into r simple paths P l dots P such

r

that P is a cycle and for each i i r P is a simple path whose endp oints

i

b elong to P P but with none of its internal vertices b elonging to P

i j

for j i The open ear decomposition problem is similar but none of the P

i

for i is a cycle

Given a connected undirected graph G V E the biconnected components

problem is to nd the maximal connected subsets of vertices of G which remain

connected when any single edge is deleted

CHAPTER NEW EM ALGORITHMS

Problem PDM IO Complexity Complexity EMBSP IO

ab c d

Description See notes in BSP Mo del Complexity

Group B GIS and Computational Geometry Algorithms

N log N N log N

N N

e

Polygon triang O O O log

M

B B v pB D

B

N log N

Trap ezoidal decomp O M O

v

Segment tree con

struction

Next Element Search

on line segments

N log N N log N

N N K

Batched planar point O O O log

M

B B B v pB D

B

N log N

lo cation

O M O

v

N log N

N N N

f

D convex hull O log O O

M

B B v pB D

B

N

D Voronoi diagram O M O

v

Delaunay triang

NlogN

N

Lower envelop e of O O

v pB D

N

nonintersecting line O M O

v

segments

NN

NlogN

Generalized lower en O O

v pB D

NN

velop e

O M O

v

N log N

N N N

log Area of Union of Rect O O O

M

B v v pB D

B

angles

N

Dmaxima O M O

B

Dnearest neighbors

N log N

N

Dweighted domi O O

v pB D

nance counting

N

Uni multidirectional O M O

v

separability

Figure Overview of New EM Algorithms continued

a

PDM IO complexities as listed apply to general values for N M D and B If the constraints

N

b ecomes a constant required by our techniques are applied to these results the term log M

B

B

b

When the parameter D is quoted in the PDM IO Complexity column this parameter refers to

the numberofdisksoverall whereas the parameter D used in the EMBSP IO Complexity column

refers to the numb er of disks on eachofthe p pro cessors of the machine

c

The BSP running time is gM L where is the parallel computation time is the

numb er of BSP rounds M is the p eak lo cal memory used by a pro cessor of the BSP machine

v

d 

These results are sub ject to the conditions N v B B and v D if the techniques of





Chapter are used and N v BD p  v if the metho ds of Chapter are used Eachofthe p

v kN

memory for integer  k  real pro cessors has O

v p

e

The PDM IO complexities for this group are do cumented for the single pro cessor single disk

case We are not aware of multidisk or multipro cessor extensions

f The PDM IO complexities for these problems apply to a RAM or PRAM

CHAPTER NEW EM ALGORITHMS

Problem PDM IO Complexity Complexity EMBSP IO

ab c d

Description see notes in BSP Mo del Complexity

Group C Graph Algorithms

N log v

N N N

List ranking O log O O log v O

M

B B v pB D

B

N

Euler tour of tree M O

v

Lowest common an

cestor

Tree contraction

Expression tree eval

uation

V E log v

E V V V E

Connected comp o O log O O

M

V DB B v pB D

B

e

VBD

nents O log v

maxf log log g

E

V E

M O

v

Spanning forest

Ear and op en ear de

comp osition

Biconnected comp o

nents

Figure Overview of New EM Algorithms continued

a

PDM IO complexities as listed apply to general values for N M D and B If the constraints

N

required by our techniques are applied to these results the term log M b ecomes a constant

B

B

b

When the parameter D is quoted in the PDM IO Complexity column this parameter refers to

the numberofdisksoverall whereas the parameter D used in the EMBSP IO Complexity column

refers to the numb er of disks on eachofthe p pro cessors of the machine

c

The BSP running time is gM L where is the parallel computation time is the

numb er of BSP rounds M is the p eak lo cal memory used by a pro cessor of the BSP machine

v

d 

and v D if the techniques of These results are sub ject to the conditions N v B B





Chapter are used and N v BD p  v if the metho ds of Chapter are used Eachofthe p

v kN

memory for integer  k  real pro cessors has O

v p

e

for a graph of V vertices and E edges

Chapter

Exp eriments

The growing theory of external memory algorithms is less interesting to practitioners

than a theoretician might exp ect The main reasons lie in the diculties which

surround implementation of these algorithms on real machines Often it is extremely

time consuming to verify whether a theoretically attractive algorithm is also capable

of being attractive in practice A new algorithm may apparently p erform p o orly for

many reasons which may include

resource contention b etween the program and other tasks running concurrently

incorrect implementation of the algorithm

large constants in the algorithms time complexity

inappropriate choice of algorithm parameters

conicts b etween the assumptions of the mo del under which the algorithm was

conceived and the realities of the machine on which it is implemented

Some of these situations can themselves haveanumb er of cases For instance Item

ab ove may be caused by other application programs which may be avoidable

or by tasks spawned by the op erating system The resources in contention can be

internal memory IO buers and devices or a pro cessing unit Resolution or even

identication of such a conict may require insight into the internal workings of the

particular op erating system and the particular hardware devices used

Recently there has b een an increasing interest in implementation and exp erimental

h work in this area research work targeted to IO ecient computation Researc

includes

The TPIE Transparent Parallel IO Environment pro ject of Vengro and

Vitter which aims to collect implementations of existing algorithms

CHAPTER EXPERIMENTS

within a common framework and to make development and validation of new

implementations easier

Exp eriments by Chiang with four algorithms for the orthogonal segment

intersection problem

Cormen et al have rep orted on a number of implementation

issues and results relating to IO ecient algorithms including FFT computa

tions using parallel pro cessors and FFT p ermutations and sorting using the

Parallel Disk Mo del

LEDASM is a pro ject which aims to make the LEDA library of com

putational geometry algorithms and data structures compatible with external

memory applications

As part of the research for this dissertation tw o implementation studies were un

dertaken The rst study was to the authors knowledge the rst full implementation

of the buer tree external memory data structure due to Arge This study is

describ ed in Section and is also presented in

The second implementation pro ject discussed in Section involved implemen

tation of the deterministic simulation describ ed in Chapter and The im

plementation design and the exp eriments describ ed are joint work with Kumanan

Yogaratnam and are also mentioned in

Buer Tree Case Study

Intro duction

Motivated by the goal of constructing IO ecient versions of commonly used in

ternal memory data structures Arge prop osed the data structuring paradigm

and in particular the buer tree A buer tree is an external memory search tree It

supp orts op erations such as insert delete search deletemin and it enables the trans

formation of a class of internalmemory algorithms to external memory algorithms by

exchanging the data structures used A large numb er of external memory algorithms

have b een prop osed using the buer tree data structure including sorting pri

ority queues range trees segment trees and time forward pro cessing These in turn

are subroutines for many external memory graph algorithms such as expression tree

evaluation centroid decomp osition least common ancestor minimum spanning trees

ear decomp osition There are a number of ma jor advantages of the buer tree ap

whose solutions use search trees as proach It applies to a large class of problems

the underlying data structure This enables the use of many normal internal memory

CHAPTER EXPERIMENTS

algorithms and hides the IO sp ecic parts of the technique in the data structures

Several techniques based on the buer tree eg time forward pro cessing are sim

pler than comp etitive EM techniques and are of the same IO complexity or b etter

with resp ect to their counterparts

In this section we present an implementation of the buer tree and show the

exibility and generality of the structure by implementing EM sorting and an EM

priority queue To test the eciency of our implementation we use sorting as an

example For data sets larger than the available main memory our implementation of

buer tree sort outp erforms internal memory sort eg qsort by a large and increasing

margin We use an EM merge sort algorithm from TPIE to provide comparativ e

p erformance results for the larger problem sizes We observe that certain parameters

suggested in may not provide the b est results in practice By tuning these

parameters we obtained improved results while maintaining the same asymptotic

worst case IO complexity The buer tree is a conceptually simple data structure

and it turned out that implementation of these applications based on the buer tree

was straightforward Therefore we can supp ort the claim that the buer tree is a

generic EM data structure that p erforms well in theory and practice

Description of the Buer Tree

We rst describ e the buer tree data structure of Arge with two up date op er

ations namely insert and delete Subsequently we will discuss how we can p erform

sorting and maintain a priority queue using a buer tree

Let N be the total number of up date op erations M be the size of the internal

memory and B b e the blo cksizeandsetm MB and n NB The buer tree is

an a btree where a mandb maugmented with a buer in eachnodeof

size m blo cks Each no de with the exception of the ro ot has afan out number

of children between m and m Each no de also contains partitioning elements or

splitters which delimit the range of keys that will be routed to each child The

number of splitters is one less than the fanout of the no de The height of the buer

tree is O log n see Figure Since the buer tree is an extension of the a b

m

tree the computational complexity analyses of the various a btree op erations still

apply The buers are used to defer op erations to allow their execution in a lazy

manner thus achieving the necessary blo cking for p erforming op erations eciently

in external memory A buer is ful l if it has more than mblocks

For any up date op eration a request element is created consisting of the record to

b e inserted or deleted a ag denoting the typ e of the op eration and an automatically

generated time stamp Such request elements are collected in the internal memory

until a blo ck of B requests has b een formed The request elements as a blo ck are

inserted into the buer of the ro ot using one IO If the buer of the ro ot contains

CHAPTER EXPERIMENTS

Buffer Root

Internal Node O(log n) m

Leaf Node

Leaves

(1 Disk Block Each)

Figure The buer tree

less than m blo cks there is nothing else to be done in this step Otherwise the

buer is emptied by a bueremptying pro cess

The bueremptying pro cess at an internal no de requires O m IOs since we load

m blo cks into the internal memory and distribute the elements among the m

children of that no de A bueremptying pro cess at a leaf may require rebalancing the

underlying a btree An a btree is rebalanced by p erforming a series of splits

in the case of an insertion or a series of fuse and share op erations in the case

of a delete Before p erforming a rebalance op eration we ensure that the buers

for the corresp onding no des are empty This is achieved by rst doing the buer

emptying pro cess at the no de involved The deletion of a blo ck may involve the

initiation of several bueremptying pro cesses By using dummy blo cks during the

deletion pro cess a buer emptying pro cess can be protected from interference by

other pro cesses See for details

The analysis ie the IO complexity of op erations on the buer tree is obtained

Each up date element by adapting the amortization arguments for a b trees

log n

m

on insertion into the ro ot buer is given O credits Each blo ck in the buer

B

of no de v holds O the height of the tree ro oted at v credits For an internal no de its

buer is emptied only if it gets full and moreover this requires O m IOs Therefore

ignoring the cost of rebalancing the total cost of all buer emptying on internal

no des is b ounded by O n log n IOs The total number of rebalance op erations

m

required in an a btree where b a over K up date op erations on an initially

empty abtree is b ounded by Kb a Therefore for N up date op erations

on an mmtree the total numb er of rebalance op erations is b ounded by O nm

Moreover each rebalance op eration may require a bueremptying pro cess as well as

up dating the partitioning elements and therefore may require up to O m IOs Thus

the total cost of rebalancing is O n IOs The cost of emptying leaf no des is b ounded

by the sorting op eration We summarize

CHAPTER EXPERIMENTS

Theorem Arge The total cost of an arbitrary sequence of N intermixed

insert and delete operations on an initial ly empty buer tree is O n log n IO op

m

erations

Sorting A buer tree can b e used to sort N items as follows First insert N items

into the buer tree followed by an emptywrite operation This is accomplished by

p erforming a bueremptying pro cess on every no de starting at the ro ot followed by

rep orting the elements in all the leaves in the sorted order This can be done within

the complexity of computing the buer tree datastructure

Corollary Arge N elements can be sorted in O n log n IO operations

m

using the buer tree

The PDM compares comp eting EM algorithms according to the asymptotic num

b er of IO op erations they require to solve a given problem of size N By this mo del

the buer tree sorting algorithm is optimal as the number of IO op erations

matches the lower bound n log n for the sorting problem In practice how

m

ever other factors can also aect the running time Cormen and Hirschl observe

that many PDM applications are not IO b ound which suggests that CPU time is an

imp ortant factor to be considered The EMBSP EMBSP and EMCGM mo dels

incorp orate b oth IO and CPU time see Chapter

Priority Queues A dynamic search tree can be used as a priority queue since

in general the leftmost leaf of the search tree contains the smallest element Wecan

use the buer tree for maintaining a priority queue in external memory by p ermitting

the up date op eration describ ed previously for insertion into the priority queue and

adding a deletemin op eration It is not necessarily true that the smallest element

is in the leftmost leaf in the buer tree as it could be in the buer of any no de

on the path from the ro ot to the leftmost leaf In order to extract the minimum

element ie execute the deletemin op eration a bueremptying pro cess must rst

be p erformed on all no des on the path from the ro ot to the leftmost leaf After

the buer emptying the leftmost leaf consists of the B smallest elements and the

m

B smallest children of the leftmost no de in the buer tree consists of at least the

m

elements These elements can be kept in the internal memory and at least B

deletemins can be answered without doing any additional IO In order to obtain

correct results for future deletemins any new insertiondeletion must b e checked rst

with these elements in the internal memory This realization of a priority queue do es

not supp ort the changing of priorities on elements already in the queue

For a single disk n is the numberofdiskblocks in the problem m is the numb er of disk blo cks

that t into the memory size M

CHAPTER EXPERIMENTS

Theorem Arge The total cost of an arbitrary sequence of N insert delete

and deletemin operations on an initial ly empty buer tree is O n log n IO opera

m

tions

Implementation of the Buer Tree

The abTree The buer tree is an abtree with buers added to each tree

no de One source of co de for an abtree is LEDA It quickly b ecame clear

however that this co de was designed sp ecically for internal memory usage as no des

were linked by many pointers to supp ort a wide range of higher level op erations

Converting the various p ointers of the abtree implementation to external memory

representations turned out to b e time consuming and ineective for a data structure

that was required to be IO ecient Too many IO op erations were required to

up date a single eld in a child or parent of the no de in memory to give attractiveEM

p erformance

The Buers Each no de of the buer tree has an asso ciated buer which may

contain between and m blo cks of data which have not yet been inserted into

the abtree part of the buer tree The fanout of an internal no de is at most m

Therefore for a buer tree consisting of levels there mayexist up to m buers

each m bytes in size

Our implementation currently mo dels each buer as a Unix le Due to restrictions

on the number of Unix les that could be op en at a time each buer le is closed

after use and reop ened when necessary The time required by le op en and close

preliminary exp eriments op erations as measured in our tests was small However

suggest that this scheme may limit the eectiveness of asynchronous IO since the

le close op eration must wait for any outstanding IO to complete In this thesis we

rep ort primarily on our exp eriences using synchronous IO

Compatibility Usability and Accessibility Wewanted our implementation to

b e compatible with the use of LEDA and with TPIE The large number

of algorithms and data structures available in LEDA forms an attractive context for

implementation of internal memory algorithms which are often comp onents of a larger

external memory system For example the external memory priority queue uses an

internal memory priority queue as a comp onent The collection of ecien t external

memory techniques provided by TPIE forms an attractiveworkb ench for building and

testing new EM implementations We chose C as our implementation language



This may increase temp orarily during a buer emptying pro cess

CHAPTER EXPERIMENTS

to preserve p otential compatibility with these co de libraries In addition we adopted

the automated do cumentation to ols from LEDA

EM Sorting Using the Buer Tree A buer tree can be made to sort a data

set simply by inserting the data into the tree and then forceemptying the buers

Our implementation allows the leaves to b e read sequentially from left to right thus

the implementation of sorting was straightforward

An EM Priority Queue using the Buer Tree The buer tree can b e mo died

to construct an IOoptimal priority queue with insert and deletemin op erations in

EM Our implementation is obtained as follows

m

The leftmost leaf no de of the buer tree together with its asso ciated to m

leaf blo cks are kept in internal memory instead of on disk The abtree no des

on the path from the ro ot to the leftmost leaf are also kept in memory For

typical values of M N B a and b these abtree no des make a negligible

impact on internal memory consumption

The data records in memory are organized using an appropriate internal mem

ory priority queue In our initial implementation this is a conventional heap

originally obtained from LEDA

Any requests that would normally b e inserted into the ro ot buer of the buer

tree are rst compared to the leftmost splitters of the abtree no des on the

path from the ro ot to the leftmost leaf As argued ab ove this gives a small

constant number of comparisons in practice If a request would be routed to

the cached data blo cks it is inserted directly into the internal heap Otherwise

it is inserted into the buer tree in a normal fashion

The balance of the underlying abtree is maintained in a normal fashion If

the leftmost leaf no de underows sharing or fusing with siblings will o ccur If

it overows splitting mayoccur as it would in an abtree

Testing Platform and Parameters We p erformed most of our development

and exp erimental work on a network of sixteen MHz Pentiums each with

MB of internal memory and a pair of GB hard disks used exclusively for data

storage A central GB hard disk is available via NFS for program storage The

pro cessors each run the Linux op erating system Although we did not use it except

for NFS access to the central program storage the pro cessors are interconnected by

ere repro ducible a fast ethernet switch We found that our exp erimental timings w

CHAPTER EXPERIMENTS

between pro cessors and so we were able to run multiple timing tests indep endently

simultaneouslyand reliably on this platform

We chose k bytes as the record size for our tests A record sometimes we

will call it an element consisted of four integer byte elds a key an asso ciated

data eld a time stamp and an op eration typ e insertdeletequery eld We

chose kB bytes since this was the system page size Wechose kM KB

m which is small but allowed us to cho ose manageable problem sizes from a

disk space p oint of view yet still apply stress to our algorithms

Evaluating the Implementation

In order to obtain meaningful p erformance results we attempted to control the fol

lowing factors

Contention with other processes or users of the machine for machine resources such as

memory CPU and disks We p erform the testing on dedicated machines and so the

results were not aected by other users The Unix op erating system sp ontaneously

initiates various system tasks to p erform routine maintenance and monitoring func

tions thus it is not easy to avoid contending with these tasks for system resources

Smaller test runs may v ary signicantly due to these eects However on the larger

test runs the inuence of system tasks on the run time can generally b e ignored

Virtual memory eects such as the transparent behaviour of the operating system to

swap parts of the program image between main memory and secondary storage The

swapping of p ortions of a task to disk to make ro om for another activity can occur

without warning or notication In our tests we attempted to minimize the likeliho o d

of this o ccurring bycho osing kM the problem size in bytes to b e much smaller than

the physical memory size For instance on our Linux machines with MB of ph ysical

memorywe p erformed the ma jority of our tests using kM KB Another tactic

is to use the Linux mlockal l service call to lo ck the application into memory This

seems to work reasonably well in some cases and do es allow the application to use

up to of the physical memory without fear of b eing aected by virtual memory

eects However the requesting program must be running with ro ot privileges for

the request to b e honoured

Comparison to Quicksort We found that the buer tree easily outp erformed

the builtin internal memory quicksort technique A simple quicksort program was

Figure shows the results for a range written using the builtin C function qsort

of problem sizes For larger input sets the recursive quicksort program ran out

of stack space on our system but by that time the internal sort was slowing due to

CHAPTER EXPERIMENTS

350

300 Internal Quicksort Buffer Tree

250

200

Time (sec) 150

100

50

0 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8

Number of Items (x 1,000,000)

Figure Timings for Internal Quicksort and

Buer Tree Sort on Random Inputs For prob

lem sizes larger than million items the internal

quicksort failed due to lackofinternal memory

virtual memory eects and the buer tree was already outp erforming it

Tuning the Buer Tree We discovered that the value of b relative to m is

imp ortant to the p erformance of buer tree sort on random data For buers of

m m

size and a b m as suggested in we obtain partial blo ck sizes of

B

keys pushed to the next level for each of p erhaps m children of a approximately

no de whenever the parents buer is emptied Reducing the fanout while maintaining

the buer size increases the exp ected number of elements in each blo ck We found

m m

that a b gave the b est p erformance in our tests Smaller or larger values

of b resulted in longer run times Figure shows running time curves for Buer

Tree Sort BTS and for TPIE Merge Sort TMS Results for BTS are shown for

several values of b where a b in all cases Both TMS and BTS are running

with synchronous IO and single buering The TPIE MMB stream option is used

The buer tree is using the iostream access metho d BTS p erformance is b est for

m

and gets worse if b diers much from this value BTS with b m has ab out b

running times that change nearly linearly with the problem size for the problem sizes

shown

We caution that our exp eriments fo cussed on nding supp ort for the predictions of

asymptotic b ehaviour of Buer Tree Sort The actual running times of BTS may be

CHAPTER EXPERIMENTS

Different Fanouts (IOSTREAM) 25000

20000 IOS b=m IOS b=m/7 IOS b=m/8 IOS b=m/9 15000 TPIE Merge Sort

Time (sec) 10000

5000

0 0 5 10 15 20 25 30 35 40 45

Millions of Data Elements

Figure Timings for Buer Tree Sort and TPIE

Merge Sort TMS ran out of disk space at ab out

million elements b ecause it keeps its original data

le BTS do es not require that all of the input data

be available b efore it b egins and do es not require

a separate le for the original data

improved by further tuning and the p erformance of TPIE Merge Sort may improve

with other choices of parameters and options

We found the increased sp eed with smaller b intriguing and so we counted the

number of block pushes p erformed by the algorithm Ablock push o ccurs when data

is pushed to a child buer after the parents buer b ecomes full It may consist of

a partial blo ck a full blo ck or more than a blo ck of request elements Reducing

the fanout increases the exp ected size of the data in a blo ck push and therefore

may reduce the number required and the number of IO op erations as a result

Figure shows the relationship b etween several fanout values b and the number of

blo ck pushes over a range of input sizes The reduction in blo ck pushes seems to be

m

the ma jor reason for the improvement in running time b etween b m and b

Nonlinearities in the Running Time We observed that contrary to predictions

of the IO mo del for random input data our buer tree sort implementation tended

to have nonlinear run times as the problem size increased Actuallywe exp ect the

running time to increase more than linearly by a logarithmic factor However since

the base of this logarithm is large the predicted increase in running time is close to

CHAPTER EXPERIMENTS

Block Pushes for Different Fanouts 300000

250000 b=m b=m/3 b=m/5 b=m/6 200000 b=m/7 b=m/8 b=m/9 150000

100000 Number of Block Pushes

50000

0 0 5 10 15 20 25 30 35 40 45

Millions of Data Elements

Figure Number of Block Pushes for Buer Tree

Sort on Random Inputs

linear for the range of problem sizes considered

Figure shows a graph of problem size versus runtime for random input data

and b m Also shown in this graph are curves for the various activities of the

buer tree ie a breakdown of where this time is sp ent The total running time

app ears to be increasing sup erlinearly with the problem size Total Running Time

is the sum of running time for Insertions plus Force Empty A ll Buers Running

time for Insertions is comp osed of the sum of Insertion Empty Internal Buers plus

Insertion Empty Leaf Buers Both of these seem to be more than linear with the

problem size However referring to Figure the number of blo ck pushes is not

increasing sup erlinearly

Adjusting the parameter b in the buer tree b oth reduced the number of IO

op erations p erformed by BTS and apparently removed the nonlinear b ehaviour in

contrast our tests Figure shows the same graph as Figure for b m In

to Figure the Total Running Time curve is quite linear after ab out million

elements The comp onent curves in the gure are equally well b ehaved

While the constant represented by the slop e of the running time curve is larger

for BTS than for TMS we note that BTS is an online sorting technique and therefore

addresses a dierent situation than do es TMS See Figure

Exp eriments with Parallel Disks We exp erimented briey with storing the

buer tree on multiple disks by striping the buers and leaves across two disks We

CHAPTER EXPERIMENTS

Breakdown of Buffer Tree Activities b=m 25000

20000 Total Running Time Insertion Force Empty Buffers Insertion:Empty Internal Buffers 15000 Insertion:Empty Leaf Buffers

Time (sec) 10000

5000

0 0 5 10 15 20 25 30 35 40 45

Millions of Data Elements

Figure Timings for Buer Tree Sort with

m

m m Smal ler Fanout a b

B

Breakdown of Buffer Tree Activities b=m/8 12000

10000 Total Running Time Insertion Force Empty Buffers Insertion:Empty Internal Buffers 8000 Insertion:Empty Leaf Buffers

6000 Time (sec) 4000

2000

0 0 5 10 15 20 25 30 35 40 45

Millions of Data Elements

Figure Timings for Buer Tree Sort with

m m

m Larger Fanout a b

B

CHAPTER EXPERIMENTS

Pipelined Sort 12000

10000 Total Running Time Insertion Force Empty Buffers TPIE Merge Sort 8000

6000 Time (sec) 4000

2000

0 0 5 10 15 20 25 30 35 40 45

Millions of Data Elements

Figure Pipelining sort with generation of in

puts If the generation of the inputs is suciently

time consuming Buer Tree Sort can provide a

sp eed advantage over oine metho ds by p ermitting

the insertion time to b e hidden by the time to gen

erate its inputs The time to Force Empty Buers

then may b e the only time that remains visible

obtained a multiple disk driver the PDM API from Tom Cormen at Dartmouth

College and p orted it from the DEC Alpha environment to Linux without much

diculty To manage concurrent disk access the PDM API requires aPosix threads

implementation which we obtained from Florida State University Perhaps due to

the large data cache maintained by Linux we found that large data volumes were

required b efore two parallel disks outp erformed a single disk for a simple writeas

quickasyoucan application For buer tree sort this would require a single blo ck

push to be very large We tried increasing m to allow this and did see marginally

b etter p erformance with two disks for mo derate values of n Unfortunatelyasn grew

towards a more interesting size we b egan to see our p erformance degrade apparently

to virtual memory eects We concluded that we needed more real memory for due

this sort of exp eriment

Figure shows running times for a single disk under the PDM API and C

iostream access metho ds The PDM API could b e exp ected to b e slightly slower as it

intro duces some extra computation such as its use of threads This seems to b e true

m

in the case of b but its abilityto overlap computation with IO asynchronous

CHAPTER EXPERIMENTS

Comparison of PDM and Iostream Access Methods 25000

20000 IOS b=m PDM b=m IOS b=m/8 PDM b=m/8 15000

Time (sec) 10000

5000

0 0 5 10 15 20 25 30 35 40 45

Millions of Data Elements

Figure Running Times of PDM and Iostream

IO Access Methods The PDM uses asynchronous

IO and this seems to give it an advantage for the

larger fanouts

IO seems to allow it to outp erform in the case b m

Conclusions and Lessons Learned

In this thesis we describ e an implementation of a buer tree and two EM algorithms

based on the buer tree an external memory treesort and an external memory

priority queue

While the buer tree gives us an IOoptimal sort our timing studies of the

implementation indicate that its p erformance is sensitive to some nonlinearities in

the environmen t or algorithm Exp erimental results show that these nonlinearities

are reduced by an optimal choice of parameters

Our tests on random input sets lead to an exp erimental determination of parame

ter values dierent from those originally suggested in the design of the data structure

Although the running times of our treesort implementation BTS with parameter

m

b m clearly show nonlinearities b pro duced a running time curve which is

for practical purp oses a straight line when the problem size is more than million

elements The application was also heavily IO b ound This supp orts the prediction

of the algorithm and the mo del that the asymptotic running time is n log n

m

IOs and the number of IO op erations is the dominant issue in the algorithm

The nonlinear b ehaviour of BTS with parameter b m was manifested to various

CHAPTER EXPERIMENTS

degrees in some of the other fanouts whichwe tried While we exp ected some eect on

running time as this parameter was varied the sensitivity to nonlinearity is troubling

and we do not rule out implementation decisions as a p ossible cause

We conclude that a the buer tree as a generic data structure app ears to p erform

well in theory and practice and b measuring IO eciency exp erimentally is an

imp ortant topic that merits further attention

We conclude by noting that this implementation work was intended primarily

to test the validity of the asymptotic p erformance predictions of the buer tree

Naturally this was p ossible only to the extent that available disk space would allow

Nevertheless we b elieve the results are useful as wewere able to p erform exp eriments

where the internal memory size was as small as of the problem size We b eliev e

that the timing results therefore conrm the asymptotic p erformance predictions for

the buer tree

In terms of the absolute p erformance of our implementation we b elieveanumber

of eciency improvements could be made including a custom implementation of

buers using the le system for this likely intro duces unnecessary overhead and

improved physical IO copying buers in the C IO library

Another area for further investigation arises from our desire to keep the eective

internal memory size of the buer tree small In order to do this we restricted the

blo ck size to only kilobytes While the work of Stev ens suggests that such a

blo ck size achieves most of the eciencies one exp ects by blo cking IO to disk see

Figure other researchers have suggested that larger blo cksizes are b etter It has

b een suggested that the sensitivity to non linear increases in running time for a range

of problem sizes is related to this small blo ck size

CHAPTER EXPERIMENTS

Stevens’ Data 60

50

40

30 Time (sec.) 20

10

0 0 500 1000 1500 2000 2500 3000 3500 4000 4500

Block Size (bytes)

Figure Stevens measurements on the eects of

varying the block size

Simulation Case Study

Intro duction and Motivation

In this section we describ e implementation issues arising from a case study of the

deterministic simulation approach of Section

The main ob jectives of this implementation pro ject were as follows

explore the implementation issues raised in the simulation technique of Chapter

and determine if in practice there are any unforeseen diculties

p erform p erformance comparisons to validate the exp ectations Lemma of

linear IO and linear running time as the problem size grows

investigate the issues involved in an automatic translation of CGM algorithms

to EMCGM algorithms

In Section we sketch a design of a runtime system that allows a parallel

algorithm implemented in MPI Message Passing Interface standard to be exe

cuted on a target machine with one or more pro cessors and one or more disks p er

pro cessor Utilizing the ideas and techniques of Chapter or p otentially Chapters

or such a runtime library promises to allow large problems to be executed ef

ciently on a machine with limited internal memory by ensuring that the disks are

used eectively Such a capabilityis interesting from at least two p ersp ectives

CHAPTER EXPERIMENTS

from the p ersp ective of obtaining go o d external memory algorithms the prac

ticalityof the simulation techniques should b e determined

from the p ersp ective of parallel algorithms the simulation idea promises ecient

scalability p ermitting larger problems to be executed eciently on a given

machine than would otherwise b e the case This idea is describ ed more fully in

In Sections and we describ e some results of implementing an external

memory version of a parallel algorithm for sorting

We use the usual notation N is the problem size v is the number of virtual

pro cessors in the BSP algorithm p is the number of real pro cessors in the EMBSP

machine B is the disk blo ck size b is the communication message size each real

kN v

pro cessor has M O internal memory for k and each real pro cessor

v p

has D lo cal disks

An EM Runtime System for Parallel Algorithms

We rst give an overview of the design of a suitable runtime system to supp ort

execution of BSP algorithms on an EMBSP We describ e the following in terms of

the Linux op erating system although the issues and decisions taken should in most

cases apply equally well to other op erating systems We implement each virtual

as pro cessor as a separate pro cess or task and so we will refer to virtual pro cessors

pro cessestasks and viceversa In this particular context we use the terms virtual

processor and task interchangeably

In many cases each virtual pro cessor executes the same program co de or at least

contains the program co de for all roles that a virtual pro cessor may be required to

play It is common practice for the program co de to make decisions based on a unique

lab el asso ciated with every virtual pro cessor In our implementation we therefore did

not attempt to swap program co de of the virtual pro cessors b etween internal memory

and the disks

One Real Pro cessor

In the following we describ e the op eration of a single real pro cessor executing the

op erations required by a number of virtual pro cessors A t various p oints in the

discussion however we mention issues that arise when multiple real pro cessors are

present More discussion of the case of multiple real pro cessors is presented in Section

The discussion is broken down into three parts implementation of virtual

pro cessors implementation of communication between virtual pro cessors and

implementation of multiple disks

CHAPTER EXPERIMENTS

Implementation of virtual pro cessors We chose to represent the virtual

pro cessors of the CGM algorithm as threads or lightweight tasks A lightweight task

is a task whose memory is shared with other tasks and therefore its context do es

not have to b e swapp ed individually by the op erating system Wewilluse the terms

and lightweight task interchangeably In our implementation the lightweight

tasks representing virtual pro cessors relinquish control of the CPU to the next virtual

pro cessor voluntarily at predetermined points in their co de A virtual pro cessor

task might as well run to completion b efore any other virtual pro cessor task b egins

execution In fact p ermitting more than k of them to comp ete for memory would

violate a basic premise of our simulation technique as it would lead the op erating

kN

system to attempt to allo cate more than O virtual memory and the resulting

v

swapping could cause ne grained accesses to disk Accordinglywemust manually

swap the context of each virtual pro cessor in and out of the lo cal internal memory of

a real pro cessor

The virtual pro cessor tasks are represented by sequentially scheduled subtasks of a

single preemptively scheduled background thread By creating k such background

threads each p ermitting only a single virtual pro cessor to b e activeatonce k virtual

pro cesses can b e resident in the memory of a real pro cessor concurrentlyfor k

v

p

It is easy to determine when a virtual pro cessor should surrender control of the

CPU to the next one At the end of each computation sup erstep each virtual pro

cessor issues a communication request part of the next communication sup erstep

We discuss the implementation simulation of these communication functions b elow

The function provides a means for passing control from one virtual pro cessor to

the next

yield This function is called by each virtual pro cessor at the end of every

computation sup erstep In particular yield is called at the b eginning of every

communication function It acts as a barrier synchronization op eration It

ensures that several actions o ccur

The swappable context of the pro cessor is written to the lo cal disks and

then the related internal memory is freed

The memory areas required for the context of the next virtual pro cessor

are allo cated and its context is read from the lo cal disks

The messages sent to the new pro cessor in the previous sup erstep are read

from the lo cal disks

Control of the CPU is passed to the new virtual pro cessor

CHAPTER EXPERIMENTS

Thus the roundrobin scheduling of virtual pro cessors within a background thread

is accomplished via the yield function which in turn uses the C library setjmp and

long jmp functions Using setjmp a task can store its program counter and other

system context in a sp ecied sp ot Using long jmp the task can transfer control to a

task context which was previously saved by setjmp The stack pointer of a running

task p oints to a stack frame stored on its system stack A stack frame contains space

for each lo cal dynamic variable allo cated implicitly by the current function of the

running task It also contains a return address to the function that called the current

function of the running task Other stack frames for each of the functions in which

the current function of the running task is nested are also on the stack This imp osed

some constraints on our implementation The stack of the background thread cannot

be swapp ed out of memory by the simulation as it must b e available to the op erating

system at all times We therefore must minimize the amount of data that is stored on

this stack not p ermitting dynamic implicitly allo cated lo cal variables in our virtual

pro cessor tasks for instance In our implementation the stack contents cannot be

part of the virtual pro cessor context that is managedswapp ed by the simulation

technique

These considerations require us to maintain a minimal unswappable global con

text on the stack plus larger swappable lo cal context for each virtual pro cessor

The latter are managed according to the requirements of the simulation technique

The swappable context can consist only of that memory which is allo cated explicitly

by the user BSP co de at execution time ie via the C mallo c call In order to keep

track of the size and comp onents memory areas of a virtual pro cessor we require

the co de for the virtual pro cessors to use sp ecial library calls emmallo cemfree for

memory allo cationdeallo cation requests This permits us to determine the regions

of memory to b e swapp ed at the end of each sup erstep

emmallo cemfree In order to control the swapping of the lo cal context of a

virtual pro cessor we require that its lo cal variables b e allo cated dynamically via

a call to the the EM library function emmallo c and deallo cated if necessary using

a call to emfree The EM implementation keeps track of each allo cation and

deallo cation request in a data structure lo cal to each real pro cessor It ensures

that any such memory contents b elonging to a virtual pro cessor is swapp ed into

the real memory prior to its use and out to the lo cal disks after the resp ective

virtual pro cessor has nished its computation sup erstep

Figure shows the main services provided by the EM runtime library orga

nized into layers for explanation purp oses In the application layer the program co de

originally written for execution in each pro cessor of a BSP machine is run as a series

of pro cesses on an EMBSP machine Requests for communication b etween pro cesses

and implicit barrier synchronization of pro cesses are assumed to have been co ded as

CHAPTER EXPERIMENTS

Virtual processors / tasks Application Layer 0123 v-1

Inter-Task Communication Services: bcast, alltoall, Communication gather, allgather, scatter Layer

System Services Task scheduling: yield Layer Memory management: gmalloc, gfree Disk I/O Services: swapin, swapout, readfromdisk,

writefromdisk

Figure Organization of EM Library Services Services

inalayer are available to layers ab ove

MPI function calls We also assume that memory allo cation and deallo cation requests

have been co ded as calls to our emmallo c and emfree library services

Implementation of communication between virtual pro cessors The

exchange of messages between virtual pro cessors of a parallel algorithm is mo delled

in our implementation by the writing of the message data to disk by the sender and

the reading of message data from disk by the receiver In general it may also be

necessary to move message data from one real pro cessor to another We discuss the

latter issue in Section Here we fo cus on the case of a single real pro cessor

and the movement of message data b etween internal memory and disk

We pattern our implementation after the suite of communication services oered

by the MPI communication standard The functions describ ed b elowdonotcover the

entire sp ectrum of services available in MPI but they address the ma jor functionality

requirements and they are sucient for our case study We believe that the missing

pieces can b e designed without ma jor diculties Each EM library function replaces

an MPI communication function of the same name We omit details of the arguments

of each function for simplicity

sp ecied p ortions of a buer to v alltoall This function sends v dierent

dierent destination pro cessors It also implies a reception of v such messages

in each pro cessor by the b eginning of the next computation sup erstep On a

single real pro cessor the external memory version of alltoall involves i writing

the v outgoing buer p ortions to the lo cal disks in parallel as each virtual

CHAPTER EXPERIMENTS

pro cessor is simulated in turn ii calling yield to pass control to the next

virtual pro cessor in the current sup erstep iii reading the appropriate data

from the disks in parallel just prior to the next computation sup erstep of each

destination pro cessor The metho dology we use here is the topic of most of

Chapter On multiple real pro cessors the EM implementation must use a real

communication sup erstep after each simulation round to deliver messages to

the correct real pro cessor where they must b e received and stored on the lo cal

disks This issue arises indep endently of the particular MPI function used and

is discussed further in Section

b cast This function broadcasts the contents of a buer on a designated ro ot

no de to each of the v pro cessors On a single real pro cessor the external mem

ory implemen tation writes the contents of the buer to the message areas of the

destination pro cessors on the lo cal disks Since there can be only one broad

casting pro cessor in this sup erstep the EM implementation should arrange to

simulate only the broadcasting pro cessor and skip the others until the next

sup erstep For multiple real pro cessors the EM implementation should rst

conscript a virtual pro cessor on each of the real no des and p erform a partial

broadcast from the ro ot no de to each of these Then the broadcast can pro ceed

lo cally in parallel according to the technique used for a single real pro cessor

gather This function causes each of the v pro cessors to send a lo cal buer to

a common designated destination pro cessor On a single real pro cessor the

external memory implementation writes the outgoing message from a pro cessor

to a designated slot in the incoming message area of the destination pro cessor

on the lo cal disks The BSP algorithm being simulated must resp ect the fact

N

that overall no more than c data should b e sent to any single destination

v

for some known xed constant c In the case of multiple real pro cessors the

EM implementation should rst conscript a virtual pro cessor as an interme

diary on each real no de Simultaneous lo cal gathers should be done to these

intermediaries followed by a single MPI gather involving real communication

between the intermediaries and the destination virtual pro cessor

Implementation of parallel disks To implement parallel disk access we also use

lightweight tasks one p er physical disk These tasks whichwe refer to as disk drivers

do not require the services of the CPU for a prolonged p erio d at a time They need

prompt service when real events such as IO completion or requests for IO occur

but the amount of computation required b efore they b ecome blo cked should b e small

As in Section the disk driver tasks were implemented as individual preemp

tively scheduled threads reecting their role as high priority foreground pro cesses

CHAPTER EXPERIMENTS

This means that the access to the CPU is driven byevents such as the completion of

an IO request or the expiration of a time slice Unlike the virtual pro cessor tasks

the disk drivers do not relinquish control of the CPU voluntarily except when they

wait for an event to o ccur such as arrival of a request to read or write a blo ck or

arrival of a completion event on a previously p osted IO op eration

The preemptively scheduled threads background thread and disk drivers were

implemented by means of a Posix threads package included with the Linux op erating

system

Figure illustrates the relationships b etween the contexts for the various tasks

In the gure we dieren tiate between two typ es of context Application context

see item G in the gure This consists of all of the applications data that is

required for it to execute This is the usual meaning as used in other chapters of

this thesis System context see item J in the gure This is a small subset of

the application context consisting of the contents of the CPU registers of a running

task ie program counter stack p ointer etc This is the context saved restored by

the setjmp long jmp calls

Multiple Real Pro cessors

When multiple real pro cessors are present the EMMPI functions eg alltoall bcast

etc not only must write the indicated data to disk but must also send data to

the necessary real pro cessors using the corresp onding real MPI function Because

each original communication sup erstep is also broken up into two communication

substeps by the BalancedRouting pro cedure Section we can b e assured that in

each of these substeps a virtual pro cessor sends the same amount of data to every

other virtual pro cessor Theorem Since each real pro cessor simulates the same

number of virtual pro cessors the real pro cessors also receive equal amounts of data

from each other

The p real pro cessors require one real communication step p er computation round

v

per computation round of the original CGM algorithm Because of the p erfect

pk

balance in communication b etween real pro cessors we can p erform the corresp onding

real communication round at the end of each computation round of the real pro cessors

v

pro cessors are swapp ed out b efore the current batch of

pk

The steps in executing the EM library version of alltoall for instance is as follows

other EMMPI communication functions are similar Recall that alltoall sends v

not necessarily equal sized p ortions of a buer one p ortion to each virtual pro cessor

Let S represent the current set of k virtual pro cessors in the memory of the current

real pro cessor

execute BalancedRouting to route the v sp ecied buer p ortions using the real

MPI alltoall function

CHAPTER EXPERIMENTS

A: Context and code of EM-BSP program

D

1

0 Code, C: Disk drivers private data

stack

E: Common data structures (request queue, semaphores, etc.)

F: BSP algorithm J: FIFO queue of code system contexts for waiting virtual processors

I: Stack G : BSP application context (swapped by EM library services)

H: EM library services code and context

B: Private context and code of background thread

Figure System Design of the External Memory Li

brary

CHAPTER EXPERIMENTS

write the v buer p ortions received in step to the lo cal disks

call yield to pass control to the next set of k virtual pro cessors in the current

sup erstep

read the appropriate data from the disks in parallel ie just prior to the next

computation sup erstep of the pro cessors in S

A Deterministic EMCGM Samplesort

Wenow present a CGM sorting algorithm due to Shi and Schaeer whichwe call

CGMSampleSort This algorithm has the following features

It is asymptotically optimal in computation time for N v

The load balancing b etween pro cessors is nearly p erfect in practice and within

afactoroftwo in theory

It causes little memory and network contention We will use it as a basis for a

parallel external memory sort

Algorithm CGMSampleSort

Input The N items to be sorted are distributed evenly among the internal memories of

the v pro cessors of a CGM machine Each pro cessor has a unique lab el between

and v

N

Output The N items are sorted in the memories of the CGM pro cessors The smallest

v

N

items are lo cated in the memory of pro cessor the next smallest items are lo cated

v

in the memory of pro cessor etc

N

items in their lo cal internal memories The v pro cessors eachsortthe

v

Each pro cessor cho oses v equally spaced samples from the items in its p ossession

The pro cessors each send the v samples to pro cessor

Pro cessor sorts the v samples and cho oses from them v equally spaced splitter

elements

Pro cessor sends the v splitters to each of the other pro cessors

Each pro cessor divides its lo cal elements into v buckets determined by the splitter

elements

Each pro cessor sends the contents of bucket i to pro cessor i for all iv

Each pro cessor merges the v sets of items nowinitsp ossession to pro duce the nal sorted order

CHAPTER EXPERIMENTS

End of Algorithm

N

Lemma No processor receives more than items as a result of Step

v

N

of CGMSampleSort provided that v

v

N N N

log L computation time O g L Lemma CGMSampleSort uses O

v v v

N

communication time O local memory per processor and O communication

v

N

v supersteps provided that

v

N

Pro of Steps and require that v

v

N

lo cal memory is used by CGMSampleSort and Lemma ensures that O

v

there are clearly a constant number of sup ersteps

N N

Computation time for Step is O log Pro cessor takes O v log v time

v v

for Step Computation times for steps are O v and for steps the

N N

computation times are O b ecause of Lemma Since v the computation

v v

N N

time overall is O log

v v

The only communication sup ersteps are steps and Communication time

N N

step plus O g v steps which is b ounded by O g  is O g

v v

Lemma CGMSampleSort can be simulated on a p processor EMCGM machine

N log Nv

v N v

in O L computation time O g L communication time and

p p p p

N v

O G L IO time provided that p v N v B D B B v

pB D p

Pro of We use the results of Lemma and Theorem There are O sup ersteps

in CGMSampleSort so in Theorem The context size for a CGM

N

We assume k The slackness constraint required by Lemma algorithm is O

v

N

is v which is equivalent to N v B for B v 

v

Exp erimental Results

Preliminary implementation exp eriments with sorting supp ort our exp ectations of lin

ear running time Figure shows running times for a CGM sorting algorithm on a

single pro cessor a using virtual memory and LAM MPI see and b converted

to an EMCGM algorithm by our deterministic simulation Initially the simulated

algorithm runs slower than the LAM MPI algorithm The main reason is that the

simulated algorithm is p erforming IO to swap contexts and deliver messages b etween

the virtual pro cessors while the LAM MPI version do es little or no IO until the

pro cessors b egin to exceed the available virtual memory Once the mem virtual

ory images of the virtual pro cessors collectively get large enough however swapping

CHAPTER EXPERIMENTS

MPI vs. CGM-EM on Single Real Processor with 8 Virtual Processors 1800 MPI CGM-EM 1600

1400

1200

1000

800 Elapsed Time (sec) 600

400

200

0 0 2 4 6 8 10 12 14

Problem Size

Figure Running times for CGMSampleSort

and EMCGM SampleSort In each case virtual

pro cessors were simulated on a Pentium with

one disk In the CGM case LAM MPI was used

b egins to dominate the computation of the LAM MPI version The simulated algo

rithm continues steadilyhowever its running time apparently growing linearly with

the problem size

The anomalous p oint on the EMCGM timing curve of Figure can be elimi

nated and was included for purp oses of discussion in Section b elow

Discussion and Conclusions

The exp eriments done to date provide some conrmation of the value of our simulation

technique in practice but are clearly incomplete More exp erimen tal work would be

useful in this area esp ecially to determine the p erformance of the parallel EM cases

Our results show that for problems suciently larger than the available real mem

ory the EMCGM algorithm can outp erform the CGM algorithm by a wide and in

creasing margin We note that the exp eriments were p erformed by implementing

algorithm SeqComp oundSup erstep Chapter without the message balancing step

of algorithm BalancedRouting or equivalent As a result we were forced to ac

commo date messages up to M in size This resulted in the messaging matrix b eing

v N items in size rather than N items in size We exp ect that the inclusion of

the balancing step while increasing the computation time sligh tly should reduce the

overall execution time by reducing the IO requirements by nearly an order of mag

CHAPTER EXPERIMENTS

nitude when v The excessive size of the messaging matrix also limited the size

of problem that wewere able to accommo date due to limited disk size and we exp ect

that exp eriments with larger problems would b e instructive

There app ears to b e an anomalous p oint on the EMCGM timing curve of Figure

This point is both anomalous and repro ducible When the test runs are per

formed in a batch one after the other this observation can b e repro duced However

when the tests are run individually separated in time and with careful attention to

the elimination of transient disk and memory congestion we observe a running time

more in line with our exp ectations Such phenomena o ccur regularly in our tests and

can be attributed to the buer cache management p olicies of the Linux op erating

system As mentioned in Section Linux uses any available main memory as a

buer cache caching IO buers for p otential future use We also observed that

writes to disk were faster than reads in many cases suggesting that applications are

allowed to continue after a write request when data had b een written to the buer

cache but had not yet physically b een written to disk Naturally this complicates

the measurement pro cess

As mentioned ab ove the rst order eects of our techniques on p erformance were

clear and observable Ho wever as in Section we found that in practice it was

dicult to quantify the p erformance eects of many of our attempts at ne tuning

the co de due to the memory management and IO buering practices of the Linux

op erating system While the Linux buer cache management is very eective on

programs that are not IOaware we b elieve that the advent of IO aware applications

may justify a rethinking of the buer cache management p olicies of Unix to enable

further p erformance gains

Chapter

Conclusions and Future Work

It has b een said that Beauty is in the eye of the b eholder The same can be said

ab out what is interesting and what is not To me the relationship b etween parallel

algorithms and external memory algorithms is interesting and well worth the eort

that wentinto this research That a b o dy of existing work in parallel algorithms can

now be put to use in the eld of external memory algorithms is a satisfying eect

that has only been partially explored in this thesis

This work has demonstrated several ways to transform BSPlike algorithm with

appropriate attributes to an external memory algorithm that is ecient according to

our mo del and has identied a considerable number of example applications where

suitable BSPlike algorithms are already known

We b elieve that the ability of our simulations to accommo date a variable number

v

of virtual pro cessors in the memory of a real pro cessor allows k for k

p

our techniques to conform to a new memoryadaptive EM mo del recently prop osed

by Barve and Vitter In this mo del the available memory changes p erio dically

and unpredictable p erhaps due to the demands of unrelated applications More

investigation of this idea as it applies to our techniques seems warranted

Our new EMBSP algorithms have the advantage that they are essentially quite

simple compared to some previously known external memory algorithms for the same

problems and they accommo date multiple pro cessors on the target machine They

hav e the limitation that the ratio of internal memory size M to problem size N should

not be to o small This constraint comes rather directly from the slackness built into

the parallel BSP algorithms from which our EMBSP algorithms have generally b een

derived However it seems that in many cases this slackness constraint can b e reduced

at the cost of more communication and IO rounds This provides an opp ortunity

for further research

We argue however that even for very large problems the slackness constraint

is not an imp ediment in practice as it seems reasonable that the internal memory

CHAPTER CONCLUSIONS AND FUTURE WORK

size should grow as some function of the external memory size Even using our

deterministic simulation from Chapter terabytesized problems can b e solved with

internal memory which is only of the memory size

The fact that we use the term simulation to characterize our techniques raises

the obvious question of their p erformance in practice It should be emphasized that

implementations of parallel algorithms can in principle b e executed directly by sub

stituting a dierent runtime library for the original communications library This

was illustrated in a preliminary wayby the design and exp eriments describ ed in Sec

tion The most interesting issue seems to what extent the overheads incurred

during the swapping of the contexts of virtual pro cessors can b e reduced In a purely

automatic adaptation of existing parallel co des the prosp ects for achieving compara

ble p erformance to a handco ded external memory algorithm seem limited However

there may be p ossibilities for the programmer of a parallel program to provide ad

ditional information to the runtime library or even the compiler in order to make

such automatic ecient scalabilityachievable in practice This is an area for future

research

Another interesting direction is to view parallel algorithms as a series of decom

p ositions of a problem into subproblems which are indep endent and can be solved

in any order including simultaneously These are algorithms which we have argued

can b e executed eciently as external memory algorithms under the appropriate con

ditions The resulting algorithms have the distinct advantage that they continue to

be ecient when multiple pro cessors are present While it may turn out that the

individual instances of parallel external memory algorithms whichwe cite in Chapter

are ecient only in theory and not when compared to algorithms handco ded for

the purp ose the paradigm of designing external memory algorithm from a parallel

framework is comp elling This is an area which is receiving increasing attention

The techniques presented here are only analyzed with resp ect to their asymptotic

complexity and wehavenotinvestigated the size of the constant factors On the other

hand very few b enchmark studies have b een made to test the true practicality of some

of the known external memory algorithms Even without such exp erience it seems

clear that many of the existing external memory algorithms have not b een designed

for multiple pro cessors and may therefore p erform p o orly in such an environment

Further b enchmark studies are indicated both on the algorithms we prop ose here

and on many previously known algorithms

Furthermore if an existing external memory algorithm for a particular problem

can be shown to be intrinsically b etter than one obtained from a parallel algorithm

then p erhaps the external memory algorithm can b e used to pro duce a b etter parallel

algorithm The crossfertilization of the two areas seems to me to b e one of the most

exciting areas for future research If a b etter external memory algorithm already exists for a single pro cessor ma

CHAPTER CONCLUSIONS AND FUTURE WORK

chine but is not ecient on any available parallel machine p erhaps it is not b etter

Large problems often demand the sort of computing p ower that is most economically

available via parallel pro cessing Scalability to multiple pro cessors is an intrinsic

feature of the external memory algorithms derived from our simulation techniques

We have concentrated on the interaction between the main memory and the ex

ternal disk system Many of the same issues also arise between the cache memory

and main memory layers of the memory hierarch y This is an area which is b ecom

ing increasingly imp ortant as increases in pro cessor sp eed continue to outstrip the

improvements in the sp eed of aordable main memory The increasing discrepancy

in sp eed seems likely to prompt the addition of more layers in the cache hierarchy

Perhaps in particular parallel algorithms can provide an algorithmic basis for using

such caches more eectively

App endix A

Useful Probabilistic Results

In this App endix we list a number of results that are used in Chapter

A Tail Estimates

We use the following tail estimates

Lemma A If X is a nonnegative random variable and r we have

rX

E e

PrX u

ru

e

Pro of We use the following Markov inequality Let X by any random variable

Then for all t IR

E X

PrX t

t

For any p ositive real r

rX ru

e PrX u Pre

Applying the ab ove Markov inequalitytothe right hand side we have

rX

E e

PrX u

ru

e



Lemma A Let X X be independent random variables with X k

n i

P

n

X Then for u e and m E

i

i

n

X

m

X u m exp u Pr

i

k

i

APPENDIX A USEFUL PROBABILISTIC RESULTS

Pro of Ho eding showed that for this situation we have

m

n

k

X

e

Pr X m

i

i

Let u We have

m

k

m m

e

u u

k k

e u

m

u

m

k

u ln u

k

e e

m

u ln uu

k

e

Finallywe have u ln u u u if log u or u e 

Lemma A Given x bal ls and y bins If the bal ls are randomly and independently

distributed to the bins each bin contains more than lxy bal ls with probability at least

x

ln y l

y

e

Pro of The probability of the event that a bin receives exactly i balls is

i xi i

x x

i i y y y

Let X b e the event that a bin receives more than k balls and let Y denote the event

x

that at least one of the bins receives more than k l balls

y

Thus

i

x

X

x

PrX

y i

ik

We can conclude that

x

X

x

i

PrX

y i

ik

x

X

x

i

ix i y

ik

x x x

k k k

k x k y k x k y k x k y

x x k x k x k

k

k x k y k y k k y

APPENDIX A USEFUL PROBABILISTIC RESULTS

x x k x k

k

k y k y k y

X

xe x k

k i

ky k y

i

x

and l y we note that For k l

y

x l

x l

x k l

y y

y

x

k y l y l y

y x

and so

X

x k y

i

k y l

i

x

and le Thus for k l

y

X

xe x k

k i

PrX

ky k y

i

lx

e y

y

l l

x x

y

l ln l l

y y

e

l

x

Recall that Y denotes the event that at least one of the bins receives more than k l

y

balls

x x

y

l ln l l

y y

e PrY y PrX

l

x x

l l ln l ln l ln y

y y

e

So for x and y functions of the problem size

x

ln y l

y

for constant l e PrY e 

APPENDIX A USEFUL PROBABILISTIC RESULTS

A Balancing Lemmas

Lemma A Let R be the number of blocks in each bucket Let X be a random

jk

variable representing the number of tracks of disk k that belong to bucket j a track

belongs to bucket j when it contains a record of bucket j Then for any xed j k

we have the fol lowing

R

R

l

D

Pr X l e

jk

D

Pro of The pro of is similar to one describ ed in Let g denote the number of

t

disks written to from bucket j during write cycle t for t C where C is the

total number of write cycles used We have

X

g R A

t

iC

Let G be the number of tracks b elonging to bucket j that are assigned to disk

t

k in write cycle t Since only one track can be written to any disk in a write cycle

G is restricted to the values and We have PrG g D and PrG

t t t t

g D Let G z be the probability generating function for G

t G t

t

G z PrG z PrG z

G t t

t

g g

t t

z

D D

g

t

z A

D

Let G z be the probability generating function for X We can b ound X

X jk jk

jk

by the sum of indep endent random variables X G G G For

jk C

purp ose of b ounding let us consider that X G G G Since the sum

jk C

of indep endent random variables corresp onds to the pro duct of the corresp onding

probability generating functions and using A we have

G z G z

X G G G

C

jk

G z G z G z

G G G

C

Y

g

t

z A

D

tC

By the tail estimate Lemma A we have

E exp rX R

jk

A Pr X l

jk

D exp rlRD

APPENDIX A USEFUL PROBABILISTIC RESULTS

for each r We can express the numerator in A using A and the

denitions of exp ected value and probability generating function as

X

rX rX rt rt

jk jk

E e Pre e e

t

X

rt

PrX te

jk

t

r

G e

X

jk

Y

g

t

r

e A

D

tC

By A and convexity arguments we can maximize A by setting g RC for

t

each t Thus

r

Y

R e

rX

jk

E e

DC

tC

C

r

R e

DC

Substituting this bound into A we get

r C

R R e D C

Pr X l A

jk

D exp rlRD

b ab

From the b ound a e for a we can approximate the numerator in

A and get for r lnl

r

exp R e D R

Pr X l

jk

D exp rlRD

r

R e rlR

exp

D

R

lnll lnl

exp e A

D

Now since

l lnl l l A

for approximately l and since R and D are functions of the problem size

R

R

l

D

PrX l e

jk D

APPENDIX A USEFUL PROBABILISTIC RESULTS

for constant l 

Lemma A allows us to exploit the indep endence of the random exp eriments

p erformed during each comp ound sup erstep in order to prove that the success prob

ability for the entire simulation is as large as for the simulation of a single comp ound

sup erstep

Lemma A Let X X X beindependent random variables such that Pr X

z i

lT exp l log l m and PrX lT exp l log l m where m ln x and

i

l for i z Let T xT be the worstcase size of X for any i that is

wc i

X T for i z Then

i wc

z

X

l log l m

Pr X e l zT e

i

i

log log l

c c

Pro of We identify two cases z xm and z xm for c

log m

P

z

First we consider Case The mean of the quantity X can be bounded

i

i

from ab ove as follows for suitable constant l and m ln x

z z

X X

E X lT PrX lT T PrX lT

i i wc i

i i

zlT exp l log l m zxT exp l log l m

zlT zT exp l log l m lnx

l log l m l log l m

zlT e zxT e

l log l mln x

zlT zT e

l zT A

P

z

We can bound the mean E X from b elow as follows

i

i

z z

X X

E X Pr X lT lT PrX lT T

i i i

i i

l zT A

P

z

Ho eding using k T xT m E X l zT From Lemma A

wc i

i

c

and z xm

z

X

l zT

Pr X e m exp e

i

xT

i

z

X

c

X e l zT exp l m Pr

i

i

APPENDIX A USEFUL PROBABILISTIC RESULTS

So

z

X

l log l m

Pr X e l zT e

i

i

and

z

X

l log l m

Pr X e l zT e

i

i

c

Now we consider Case We rep eat the exp erimentonlyz xm times where

log log l

c Thus we have with m ln x

log m

z

X

z

Pr X zlT exp l log l m

i

i

c

xm exp l log l m

exp l log l m c ln m lnx

l log l m

e

So

z

X

l log l m

X z l T e Pr

i

i 

APPENDIX A USEFUL PROBABILISTIC RESULTS

A Summary of Notation and Index

Symbol Meaning

b the communication blo ck size

B the disk blo ck size

D the number of disk drives on a real pro cessor

g the time required for the router to deliver a packet

of size b see BSP Mo del

g the time required for the router to deliver a packet

of unit size see BSP mo del

G the time required for a pro cessor to transfer D blo cks between

its lo cal disks and its lo cal memory

k the number of virtual pro cessors simulated concurrently by a

single real pro cessor

L the minimum time required for synchronizing the pro cessors

M the memory size of a real pro cessor

N the number of data items in the problem

p the number of real pro cessors

v the number of virtual pro cessors

v

v

k

the maximum communication volume sent or received

by a virtual pro cessor in a single sup erstep

the number of BSP sup ersteps in an algorithm

the maximum size of the context of a virtual pro cessor

Bibliography

P Agarwal L Arge M Murali K Varadara jan and J Vitter IOecient

algorithms for contourline extraction and planar graph blo cking In Proceedings

of the th Annual ACMSIAM Symposium on Discrete Algorithms

A Aggarwal B Alp ern A K Chandra and M Snir A mo del for hierarchical

memory In Proc ACM Symp on Theory of Computation pages

A Aggarwal A K Chandra and M Snir Hierarchical memory with blo ck

transfer In Proc IEEE Symp on Foundations of Comp Sci pages

A Aggarwal and J S Vitter The InputOutput complexity of sorting and

related problems Communications of the ACM

R Ahulja K Mehlhorn J Orlin and R Tarjan Faster algorithms for the

shortest path problem Journal of the ACM

A Alexandrov M F Ionescu K E Schauser andCScheiman BSP vs LogP

In Proc ACM Symp on Paral lel Algorithms and Architectures pages

B Alp ern L Carter E Feig and Selker The uniform memory hierarchymodel

of computation Algorithmica

L Arge The buer tree Anew technique for optimal IOalgorithms In Proc

Workshop on Algorithms and Data Structures LNCS pages

A complete version app ears as BRICS technical rep ort RS University of

Aarhus

L Arge Ecient ExternalMemory Data Structures and Applications PhD

thesis University of Aarhus FebruaryAugust

L Arge M Knudsen and K Larsen A general lower bound on the IO

complexity of comparisonbased algorithms In Proc Workshop on Algorithms

and Data Structures LNCS pages

BIBLIOGRAPHY

L Arge D E Vengro and J S Vitter Externalmemory algorithms for pro

cessing line segments in geographic information systems In Proc Annual Eu

ropean Symposium on Algorithms LNCS pages A complete

version to app ear in sp ecial issue of Algorithmica app ears as BRICS technical

rep ort RS University of Aarhus

M Atallah F Dehne R Miller A RauChaplin and JJ Tsay Multisearch

techniques for implementing data structures on a meshconnected computer

Journal of Paral lel and

D Bader D Helman and J Jaja Practical parallel algorithms for p ersonalized

communication and integer sorting Journal of Experimental Algorithmics

httpwwwjeaacmorg B ader Per sona lize d

R Barve and J S Vitter External memory algorithms with dynamically chang

ing memory allo cation Technical Rep ort CS Duke University June

R D Barve E F Grove and J S Vitter Simple randomized mergesort on

parallel disks Paral lel Computing June

A Baumker and W Dittrich Fully dynamic search trees for an extension of

the bsp mo del In Proc ACM Symp on Paral lel Algorithms and Architectures

pages

A Baumker W Dittrich and F M auf der Heide Truly ecient parallel

algorithms coptimal multisearch for an extension of the bsp mo del In Proc

Annual European Symposium on Algorithms pages

G Bilardi K T Herley A Pietracprina G Pucci and P Spirakis BSP vs

LogP In Proc ACM Symp on Paral lel Algorithms and Architectures pages

G Brassard and P Bratley Algorithmics Theory and Practice Prentice Hall

Englewood Clis New Jersey

E Caceres F Dehne A Ferreira P Flo cchini I Reiping N Santoro and

S Song Ecient parallel graph algorithms for coarse grained multicomputers

and bsp In Proc Int Col loquium Algorithms Languages and Programming

LNCS pages

A Chan F Dehne and A RauChaplin Coarse grained parallel next element

search In Proc International Paral lel Processing Symposium pages

BIBLIOGRAPHY

P M Chen E K Lee G A Gibson R H Katz and D A Patterson

RAID highp erformance reliable secondary storage ACM Computing Surveys

June

YJ Chiang Dynamic and IOEcient Algorithms for Computational Geom

etry and Graph Problems Theoretical and Experimental Results PhD thesis

Brown University August

YJ Chiang Exp eriments on the practical IO eciency of geometric algo

rithms Distribution sweep vs plane sweep In Proc Workshop on Algorithms

and Data Structures LNCS pages

YJ Chiang M T Go o drich E F Grove R Tamassia D E Vengro and

J S Vitter Externalmemory graph algorithms In Proc ACMSIAM Symp

on Discrete Algorithmspages

A Colvin and T H Cormen ViC A Compiler for VirtualMemory C Tech

nical Rep ort PCSTR Dartmouth College Computer Science Hanover

NH November

D Comer The ubiquitous Btree ACM Computing Surveys

T H Cormen Virtual Memory for Data Paral lel Computing PhD thesis De

partment of Electrical Engineering and Computer Science Massachusetts Insti

tute of Technology

T H Cormen Fast p ermuting on disk arrays Journal of Paral lel and Distributed

Computing

T H Cormen and M T Go o drich Position Statement ACM Workshop on

Strategic Directions in Computing Research Working Group on Storage IO for

LargeScale Computing ACM Computing Surveys A Dec

T H Cormen and M Hirschl Early Exp eriences in Evaluating the Parallel Disk

Mo del with the ViC Implementation Paral lel Computing

T H Cormen and D Kotz Integrating theory and practice in parallel le sys

tems In Proceedings of the DAGSPC Symposium pages Hanover

NH June Dartmouth Institute for Advanced Graduate Studies Revised

as Dartmouth PCSTR on

BIBLIOGRAPHY

T H Cormen C E Leiserson and R L Rivest Introduction to Algorithms

The MIT Press Cambridge Mass

T H Cormen and D M Nicol Performing OutofCore FFTs on Parallel Disk

Systems Paral lel Computing

T H Cormen T Sundquist and L F Wisniewski Asymptotically tight b ounds

for p erforming BMMC p ermutations on parallel disk systems SIAM Journal of

Computing

T H Cormen J Wegmann and D M Nicol Multipro cessor OutofCore FFTs

with Distributed Memory and Parallel Disks Technical Rep ort PCSTR

Dartmouth College Computer Science Hanover NH January

A Crauser K Mehlhorn E Althaus K Brengel T Buchheit J Keller

H Krone O Lamb ert R Schulte S Theil M Westphal and R Wirth On the

p erformance of LEDASM Technical Rep ort MPII MaxPlanckInstitut

fur Informatik Nov

D E Culler R M Karp D Patterson A Sahay E E Santos K E Schauser

R Subramonian and T von Eicken Logp A practical mo del of parallel com

putation Communications of the ACM Nov

F Dehne X Deng P Dymond A Fabri and A Khokhar A randomized

parallel d convex hull algorithm for coarse grained multicomputers In Proc

ACM Symp on Paral lel Algorithms and Architectures pages

F Dehne W Dittrich and D Hutchinson Ecient external memory algo

rithms by simulating coarsegrained parallel algorithms In Proc ACM Symp

on Paral lel Algorithms and Architectures pages

F Dehne W Dittrich D Hutchinson and A Maheshwari Parallel virtual

memory In Proc ACMSIAM Symp on Discrete Algorithms pages

F Dehne W Dittrich D Hutchinson and A Maheshwari Reducing IO com

plexity by simulating coarse grained parallel algorithms In Proc International

Symposium pages Paral lel Processing

F Dehne A Fabri and A RauChaplin Scalable parallel computational geom

etry for coarse grained multicomputers International Journal on Computational

Geometry

BIBLIOGRAPHY

F Dehne and A RauChaplin Implementing data structures on a hyp ercub e

multipro cessor and applications in parallel computational geometry Journal of

Paral lel and Distributed Computing

E W Dijkstra Co op erating sequential pro cesses In F Genuys editor Pro

gramming Languages pages Academic Press

W Dittrich Communication and IO Ecient Paral lel Data Structures PhD

thesis University of Paderb orn

W Dittrich D Hutchinson and A Maheshwari Blo cking in parallel multisearch

problems In Proc ACM Symp on Paral lel Algorithms and Architectures pages

H Edelsbrunner and M Overmars Batched dynamic solutions to decomp osable

searching problems Journal of Algorithms

R W Floyd Permuting information in idealized twolevel storage In Complexity

of Computer Calculations pages R Miller and J Thatcher Eds

Plenum New York

S Fortune and J Wyllie Parallelism in random access machines In Proc ACM

Symp on Theory of Computation pages

A Gerb essiotis and C Siniolakis Communication Ecient Data Structures on

the BSP Mo del with Applications in Computational Geometry In Proceedings

of EUROPARAugust

A V Gerb essiotis and L G Valiant Direct BulkSynchronous Parallel Algo

rithms Journal of Paral lel and Distributed Computing

J E Go o dman and J ORourke Handbook of Discrete and Computational

Geometry CRC Press Bo ca Raton FL

M Go o drich Communication ecient parallel sorting In Proc ACM Symp on

Theory of Computation

M T Go o drich JJ Tsay D E Vengro and J S Vitter Externalmemory

computational geometry In Proc IEEE Symp on Foundations of Comp Sci

pages

H Hellwagner Design considerations for scalable parallel le systems The

Computer Journal

BIBLIOGRAPHY

W Ho eding Probability inequalities for sums of b ounded random variables

American Statistical Association Journal pages

S Huddleston and K Mehlhorn A new data structure for representing sorted

lists Acta Informatica

D Hutchinson A Maheshwari JR Sack and R Velicescu Early exp eriences

in implementing the buer tree In Proceedings of the Workshop on Algorithm

Engineering httpwwwdsiuniveit wa e pro ceed ings

D Knuth The Art of Vol Sorting and Searching

AddisonWesley

LAM MPI httpwwwoscedulamhtm l

K Mehlhorn Data Structures and Algorithms Sorting and Searching

SpringerVerlag EATCS Monographs on Theoretical Computer Science

K Mehlhorn and S Naher LEDA A platform for combinatorial and geometric

computing Communications of the ACM

Message Passing Interface Standards web site httpwwwmpiforumorg

K Munagala and A Ranade IO complexity of graph algorithms Proc ACM

SIAM Symp on Discrete Algorithms pages

M H No dine M T Go o drich and J S Vitter Blo cking for external graph

searching Algorithmica

M H No dine and J S Vitter Deterministic distribution sort in shared and dis

tributed memory multipro cessors In Proc ACM Symp on Paral lel Algorithms

and Architectures pages

M H No dine and J S Vitter Paradigms for optimal sorting with multiple

disks In Proc of the th Hawaii Int Conf on Systems Sciences

M H No dine and J S Vitter Greed sort Optimal deterministic sorting on

parallel disks Journal of the ACM

Y N Patt The IO subsystema candidate for impro vement Guest Editors

Introduction in IEEE Computer

D A Patterson G Gibson and R H Katz A case for redundant arrays of

inexp ensive disks raid Proc ACM Conf on Management of Data pages

BIBLIOGRAPHY

W Paul U Vishkin and H Wagener Parallel dictionaries on trees In Proc

Int Col loquium Algorithms Languages and Programming LNCS pages

J T Poole Preliminary survey of IO intensive applications Technical Rep ort

CCSF Scalable IO Initiative Caltech Concurrent Sup ercomputing Facilities

Caltech

A RauChaplin On Paral lel Data Structures and Paral lel Geometric Applica

tions for Multicomputers PhD thesis Carleton University

J H ReifEditor Synthesis of Paral lel Algorithms Morgan Kaufman

C Ruemmler and J Wilkes An intro duction to disk driv e mo deling IEEE

Computer

Scalable IO Initiative Concurrent Sup ercomputing Consortium

httpwwwcacrcaltechedu SI O

H Shi and J Schaeer Parallel sorting by regular sampling Journal of Paral lel

and Distributed Computing

J Sib eyn and M Kaufmann Bsplike externalmemory computation In Proc

rd Italian Conference on Algorithms and Complexity LNCS pages

R W Stevens Advanced Programming in the Unix Environment Addison

Wesley Don Mills Ont Canada

L G Valiant A bridging mo del for parallel computation Communic ations of

the ACM August

D E Vengro A transparent parallel IO environment In Proc DAGS

Symposium on Paral lel Computation

D E Vengro TPIE User Manual and Reference Duke University

Available via WWW at httpwwwcsdukeedu dev

D E Vengro and J S Vitter Supp orting IOecient scientic computation

in TPIE In Proc IEEE Symp on Paral lel and Distributed Computing

e University Dept of Computer Science technical rep ort App ears also as Duk

CS

D E Vengro and J S Vitter Ecient d range searching in external memory

In Proc ACM Symp on Theory of Computation pages

BIBLIOGRAPHY

J S Vitter External memory algorithms Proc ACM Symp Principles of

Database Systems pages

J S Vitter and E A M Shriver Algorithms for parallel memoryI Twolevel

memories Algorithmica

J S Vitter and E A M Shriver Algorithms for parallel memory I I Hierarchical

multilevel memories Algorithmica

K Yogaratnam CGM Algorithms as EMCGM Algorithms Technical rep ort

Carleton University Scho ol of Computer Science Ottawa Ontario Canada

Undergraduate Honours Pro ject

N Zeh An externalmemory data structure for short

est path queries Masters thesis Fakultat fur Mathematik

und Informatik FriedrichSchillerUniversitat Jena Germany

httpwwwscscarletonca nz eh Publ icat ions th esis ps gz Novem b er