Evaluating Parallel Algorithms

Theoretical and Practical

Asp ects

Lasse Natvig

Division of Computer Systems and Telematics

The Norwegian Institute of Technology

The UniversityofTrondheim

NORWAY

Novemb er

Abstract

The motivation for the work rep orted in this thesis has b een to lessen the

gap b etween theory and practice within the eld of

When lo oking for new and faster parallel algorithms for use in massively

paral lel systems it is tempting to investigate promising alternatives from

the large b o dy of research done on parallel algorithms within the eld of

theoretical These algorithms are mainly describ ed for the

PRAM Parallel Random Access Machine mo del of computation

This thesis prop oses a metho d for evaluating the practical value of PRAM

algorithms The approach is based on implementing PRAM algorithms for

execution on a CREW Concurrent Read ExclusiveWrite PRAM simula

tor Measuring and analysis of implemented algorithms on nite problems

provide new and more practically oriented results than those traditionally

obtained by asymptotical analysis O notation

The evaluation metho d is demonstrated byinvestigating the practical

value of a new and imp ortant parallel sorting algorithm from theoretical

computer scienceknown as Coles Paral lel Merge Sort algorithm Coles

algorithm is compared with the well known Batchers bitonic sorting algo

rithm Coles algorithm is asymptotically probably the fastest among all

known sorting algorithms and also cost optimal Its O log n time con



sumption implies that it is faster than bitonic sorting whichisO log n

timeprovided that the numb er of items to b e sorted n is large enough

However it is found that bitonic sorting is faster as long as n is less than



ie ab out Giga Tera items Consequently Coles logarithmic time

algorithm is not fast in practice

The thesis also gives an overview of complexity theory for sequential

and parallel computations and describ es a promising alternative for parallel

programming called synchronous MIMD programming

Preface

This is a thesis submitted to the Norwegian Institute of Technology NTH

for the do ctoral degree doktor ingenir dring The rep orted work has

b een carried out at the Division of Computer Systems and Telematics The

Norwegian Institute of Technology The UniversityofTrondheim Norway

The work has b een sup ervised by Professor Arne Halaas

Thesis Overview and Scop e

Ab out the contents

Chapter starts by explaining how the likely proliferation of computers with

a large numb er of pro cessors makes it increasingly imp ortant to knowthe

research done on parallel algorithms within theoretical computer science It

outlines how the work describ ed in the thesis was motivated byalargegap

between theory and practice within parallel computing and it presents some

asp ects of this gap The chapter also summarises the main contributions of

mywork

Chapter describ es various concepts whicharecentral when studying

teaching or evaluating parallel algorithms It providesanintro duction to

parallel complexity theoryincluding themes such as a class of problems

that may b e solved very fast in parallel and another class consisting of

inherently serial problems The chapter ends with a discussion of Amdahls

law and how the size of a problem is essential for the sp eedup that maybe

achieved by using parallel pro cessing

Chapter is devoted to the socalled CREW PRAM Concurrent Read

ExclusiveWrite Parallel Random Access Machine mo del This is a mo del

for parallel computations whichiswell known within theoretical computer

science It is the main underlying concept for this work After a general

description of the mo del it is explained how the mo del may b e programmed i

in a high level notation The programming style which has b een used

called synchronous MIMD programming contains some prop erties which

make programming a surprisingly easy task These nice features are out

lined At the end of the chapter it is given a more technical description

of how algorithms may b e implemented on the CREW PRAM simulator

prototyp ewhich has b een develop ed as part of the work

Chapter describ es the technically most dicult part of the worka

detailed investigation of a wellknown sorting algorithm from theoretical

computer science Coles parallel merge sort A topdown detailed under

standing of the algorithm is presented together with a description of howit

has b een implemented The implementation is probably the rst complete

implementation of this algorithm A comparison of the implementation with

several simpler sorting algorithms based on measurements and analytical

mo dels b eing presented

Chapter summarises what has b een learned from doing this work and

gives a brief sketchofinteresting directions for future research

Appendix A gives a brief description of the CREW PRAM simulator

prototyp e whichmakes it p ossible to implement test and measure CREW

PRAM algorithms The main part of the chapter is a users guide for howto

develop and test parallel programs on the simulator system as it is currently

used at the Norwegian Institute of Technology

Appendix B gives a complete listing of the implementation of Coles par

allel merge sort algorithm In addition to b eing the fundamental do cumen

tation of the implementationitisprovided to give the reader the p ossibility

of studying the details of mediumscale synchronous MIMD programming

on the simulator system

Issues given less attention

The thesis covers complexity theory detailed studies of a complex algorithm

and implementation of a simulator system This variety has made it imp or

tant to restrict the work I have tried to separate the goals and the to ols

used to achieve these goals It has b een crucial to restrict the eorts put

into the design of the CREW PRAM simulator system Prototyping and

simple ad ho c solutions have b een used when p ossible without reducing the

quality of the simulation results

The pseudo co de notation used to express parallel algorithms is not

meant to b e a new and b etter notation for this purp ose It is intended

only to b e a common sense mo dernisation of the notation used in Wyl ii

for high level programming of the original CREW PRAM mo del

Similarly the language used to implement the algorithms as running

programs on the simulator is not a prop osal of a new complete and b etter

language for parallel programming just a minor extension of the language

used to implement the simulator

Simplicity has b een given higher priority than ecient execution of par

allel programs in the development of the CREW PRAM simulator

Even though nearly all the algorithms discussed are sorting algorithms

the work is not primarily ab out parallel sorting It is not an attempt to

nd new and b etter parallel sorting algorithms nor is it an attempt to

survey the large numb er of parallel sorting algorithms Sorting has b een

selected b ecause it is an imp ortant problem with manywell known parallel

algorithms

Acknowledgements

I wish to thank my sup ervisor Professor Arne Halaas for intro ducing me

to the eld of complexity theory for constructive criticisms for continuing

encouragement and for letting me work on own ideas

The Division of Computer Systems and Telematics consists of a lot of

nice and helpful p eople Havard Eidnes Anund Lie Jan Grnsb erg and Tore

E Aarseth have answered technical question provided excellent computer

facilities and allowed me to consume an awful lot of CPU seconds Gunnar

Borthne Marianne Hagaseth Torstein Heggeb Roger Midtstraum Pet

ter Mo e Eirik Knutsen ystein Nytr Tore Ster ystein Torb jrnsen

and Havard Tveite have all read various drafts of the thesis and provided

valuable comments Thanks

The work has b een nancially supp orted bya scholarship from The Royal

Norwegian Council for Scientic and Industrial Research NTNF

Last but not least I am grateful to my family My parents for teaching

me to b elieveinmyown work my wife Marit for doing most of the research

in a much more imp ortant pro ject taking care of our children Ola and

Simen and to Ola and Simen whichhavebeenacontinuing inspiration for

nishing this thesis

Trondheim December iii

Lasse Natvig iv

Contents

Intro duction

Computation

Summary of Contributions

Motivation

Background

Parallel Complexity TheoryA Rich Source of Paral

lel Algorithms

The Gap Between Theory and Practice

Traditional Use of the CREW PRAM Mo del

Implementing On an Abstract Mo del Maybe

Worthwhile

Parallel Algorithms Central Concepts

Parallel Complexity Theory

Intro duction

Basic Concepts Denitions and Terminology

Mo dels of Parallel Computation

Imp ortant ComplexityClasses

Evaluating Parallel Algorithms

Basic Perfomance Metrics

Analysis of Sp eedup

Amdahls Law and Problem Scaling

The CREW PRAM Mo del

The Computational Mo del

The Main Prop erties of The Original PRAM

From Mo del to MachineAdditional Prop erties

CREW PRAM Programming v

Common Misconceptions ab out PRAM Programming

Notation Parallel Pseudo PascalPPP

The Imp ortance of Synchronous Op eration

Compile Time Padding

General Explicit Resynchronisation

Parallelism Should Make Programming Easier

Intro duction

Synchronous MIMD Programming Is Easy

CREW PRAM Programming on the Simulator

Step Sketching the Algorithm in PseudoCo de

Step Pro cessor Allo cation and Main Structure

Step Complete Implementation With Simplied

Time Mo delling

Step Complete Implementation With Exact Time

Mo delling

Investigation of the Practical Value of Coles Parallel Merge

Sort Algorithm

Parallel Sorting

Parallel SortingThe Continuing SearchforFaster

Algorithms

Some Simple Sorting Algorithms

Bitonic Sorting on a CREW PRAM

Coles Parallel Merge Sort AlgorithmDescription

Main Principles

More Details

Coles Parallel Merge Sort Algorithm Implementation

Dynamic Pro cessor Allo cation

Pseudo Co de and Time Consumption

Coles Parallel Merge Sort Algorithm Compared with Simpler

Sorting Algorithms

The First Comparison

Revised Comparison Including Bitonic Sorting

Concluding Remarks and Further Work

Exp erience and Contributions

The Gap Between Theory and Practice

Evaluating Parallel Algorithms

The CREW PRAM Simulator vi

Further Work

A The CREW PRAM Simulator Prototyp e

A Main Features and System Overview

A Program Development

A Measuring

A A Brief System Overview

A Users Guide

A Intro duction

A Getting Started

A Carrying On

A The DevelopmentCycle

A Debugging PIL Programs Using simdeb

A Mo delling Systolic Arrays and other Synchronous Computing

Structures

A Systolic Arrays Channels Phases and Stages

A Example Unit Time Delay

A Example Delayed Output and Real Numbers

B Coles Parallel Merge Sort in PIL

B The Main Program

B Variables and Memory Usage

B Colevar

B Colemem

B Basic Pro cedures

B Coleproc

B Coleproc

B Merging in ConstantTime

B OrderMergeproc

B DoMergeproc

B MaintainRanksproc

C Batchers Bitonic Sorting in PIL

C The Main Program

References

Index vii

Chapter

Intro duction

There are two main communities of design

ers the theoreticians and the practitioners Theoreticians have

b een developing algorithms with narrow theory and little prac

tical imp ortance in contrast practitioners havebeendeveloping

algorithms with little theory and narrow practical imp ortance

Clyde P Kruskal in Ecient Paral lel Algorithms Theory and

Practice Kru

Massively Parallel Computation

There is no longer any doubt that parallel pro cessing will b e used exten

sively in the future to build the at any time most p owerful general pur

p ose computers Parallelism has b een used for manyyears in manydif

ferentways to increase the sp eed of computations Bitparallel arithmetic

the ies multiple functional units ies pip eline d functional units

ies and multipro cessing ies are all imp ortanttechniques in this

context Qui

Today these and manyotherways of parallelism are b eing exploited by

avast numb er of dierent computer architecturesin commercial pro ducts

and research prototyp es One can not predict in detail which computer

architectures will b e the dominating for use in the most p owerful parallel

general purp ose computers in the future In the ies the sup ercomputer

market has b een dominated by the pip elined computers also called vector

computers such as those manufactured byCray Research Inc USA and

























 

 

 

 

 

 

 













Massively parallel computers 



















































































Computer 

































































 



 







 





 









 





 





sp eed 



 





 









Vector computers  









 







 









 









 









 











 













 







 

 

  



 

 







 

  







 







 













 



 





 



 

 



 



 

 



 

 























  

 







 





 





 



 



 















 





 





 





 





 

 





 

 











 









 





 





  



 

 

 

 











 













 

 

  





 







 





 

 

 

 



 



 

 



















Time

Figure Massively parallel computers are exp ected to b e faster than vector

computers in the near future

similar computers from the Japanese companies Fujitsu Hitachi and Nec

HJ However the next ten years maychange the picture

At the SUPERCOMPUTING conference in NewYork Novemb er

it seemed to b e a widespread opinion among sup ercomputer users computer



scientists and computational scientists that massively paral lel computers

will outp erform the pip eline d computers and b ecome the dominating archi

tectural main principle in year or earlier The main argument for this

b elief is that the computing p ower sp eed of massively parallel computers

has grown faster and is exp ected to continue to grow faster than the sp eed

of pip elined computers The situation is illustrated in Figure

The inertia caused by the existing sup ercomputer installations and soft

ware applications based on pip elined computers will result in a transitional

phase from the time when massively parallel computers are generally re

garded to b e more p owerful to the time when they are dominating the su

p ercomputer market Todaymassively parallel computers are dominating

in various sp ecial purp ose applications requiring high p erformance suchas

wavemechanics GMB and oil reservoir simulation SIA



Systems with over pro cessors were considered massively parallel for the purp ose

of Frontiers The rd Symp osium on the Frontiers of Massively Parallel Computation

Octob er This is consistent with earlier use of the term massive parallelism see for

instance GMB

The p ointofintersection in Figure is not p ossible to dene in a precise

and agreed up on manner Danny Hillis cofounder of Thinking Machines

Corp orationthe leading company in massively parallel computing b elieves

that the transition in computing p ower happ ened in Hil Other

researchers would argue that the vector machines will b e faster for general

purp ose pro cessing also in However the imp ortant thing is that it

seems to b e almost a consensus that the transition will take place in the

near future

The sp eed of the fastest sup ercomputers to dayisbetween and



GFLOPS This is exp ected to increase by a factor of ab out bythe



year giving Tera ie FLOPS computers Lun

Consequences for research in parallel algorithms

The continuing increase in computing p ower and the adoption of massively

parallel computing have at least two imp ortant implications for researchand

education in the eld of parallel algorithms

We wil l be able to solve larger problems Larger problem instances

increase the relevance of asymptotical analysis and complexity theory

We wil l be able to use algorithms requiring a substantial ly larger num

ber of processors One of the characteristics of parallel algorithms

develop ed in theoretical computer science is that they typically re

quire a large numb er of pro cessors These algorithms will b ecome

more relevant in the future

To conclude the p ossible proliferation of massively parallel systems will

make the research done on parallel algorithms within theoretical computer

science more imp ortant to the practitioners in the future The work rep orted

in this thesis is an attempt to learn ab out this researchfrom a practical

engineerlike p oint of view

Summary of Contributions

The research rep orted in this thesis has contributed to the eld of parallel

pro cessing and parallel algorithms in several ways The main contributions

 

GFLOPS oating p oint op erations p er second The precise sp eed dep ends

strongly on the sp ecic application or b enchmark used in measuring the sp eed

are summarised here to give a brief overview of the work and to motivate

the reader for a detailed study

A new or little known direction of research in parallel algorithms is

identied by raising the question Towhatextentareparal lel algo

rithms from theoretical computer scienceusefulinpractice

A possibly new method of investigating this kind of paral lel algorithms

is motivated and describ ed The metho d is based on evaluating the

p erformance of implemented algorithms used on nite problems

The useofthemethodonalarge example Coles parallel merge sort

algorithm Col Col is demonstrated This algorithm is theo

retically very fast and famous within the theory community The

metho d and the rst results achieved were presented in a preliminary

state at NIK Norwegian Informatics Conference in November

Nat One year after it was presented at the SUPERCOM

PUTING conference Natb

A thorough and detailed top down description of Coles paral lel merge

sort algorithm is given It should b e helpful for those who wanta

complete understanding of the algorithm The description go es down

to the implementation level ie it includes more details than other

currently available descriptions

Parts of Coles algorithm which had to b e made clear to b e able to

implement it are describ ed These detailed results were presented

at The Fifth International Symposium on Computer and Information

Sciences in Novemb er Nata

Probably the rst and p ossibly also the only complete parallel imple

mentation of Coles paral lel merge sort algorithm have b een develop ed

It shows that the algorithm may b e implemented within the claimed

complexity b ounds and gives the exact complexity constants for one

p ossible implementation



A straightforward implementation of Batchers O log n time bitonic

sorting Bat on the CREW PRAM simulator is found to b e faster

than Coles O log n time sorting algorithm as long as the number of



items to b e sorted n is less than This shows that Coles algorithm

is without practical value

Prototyp e versions of the necessary tools for doing this kind of algo

rithm evaluation have b een develop ed and do cumented The to ols have

b een used extensively for the work rep orted in this thesis and they

are currently b eing used for continued research in the same direction

Hag

The algorithms which are studied are implemented as highlevel syn

chronous MIMD programs This programming style is consistent with

early and central theoretical work on parallel algorithms however

it is seldom found in the current literature Synchronous MIMD pro

gramming seems to giveremarkably easy paral lel programming These

asp ects of the work were presented at the Workshop on the Under

standing of Paral lel Computation in Edinburgh July Natc

The prototyp e to ols for highlevel synchronous MIMD programming

have b een used in connection with teaching of paral lel algorithms

They have b een used for several student pro jects in the courses Highly

Concurrent Systems and Parallel Algorithms given by the Division

of Computer Systems and Telematics at the Norwegian Institute of

Technology

As background material the thesis provides an introduction to paral lel

complexity theory a eld which has b een one of the main inspirational

sources of this work It do es also contain a discussion of fundamental

topics suchasAmdahls law and problem scaling whichhave

b een widely discussed during the last few years

Motivation

This section describ es how the main goal for the rep orted work arose from

a study of sequential complexity theory and contemp orary literature on

parallel algorithms

Twoworlds of parallel algorithms

There are at least two main directions of research in parallel algorithms In

theoretical computer science parallel algorithms are typically describ ed in

a highlevel mathematical notation for abstract machines and their p erfor

mance are typically analysed by assuming innitely large problems On the

other side wehave more practically oriented research where implemented

algorithms are tested and measured on existing computers and nite prob

lems

The main dierences b etween these two approaches are also reected in

the literature One of the rst general textb o oks in parallel algorithms is

Designing Ecient Algorithms for Paral lel Computers by Michael Quinn

Qui It gives a go o d overview of the eld and covers b oth practical and

theoretical results A more recent textb o ok Ecient Paral lel Algorithms

by Gibb ons and Rytter GR rep orts a very large research activityon

parallel algorithms within theoretical computer science Comparing these

two b o oks it seems clear that parts of the theoretical work are not clearly

understo o d and sometimes neglected by the practitioners

Can the theory b e used in practice

During the spring of I taught a course on parallel algorithms based on

Quinns b o ok and other practically oriented literature Later that year I did

a detailed study of the fundamental b o ok on complexitytheoryby Garey

and Johnson GJ I found it dicult but also very fascinating It led

me to some pap ers ab out parallel complexity theory and suddenly a large

new area of parallel algorithms op ened up This area contained topics

suchasNicksClassofvery fast algorithms and a theory ab out inherently

serial problems However I was a bit surprised by the fact that these very

fundamental topics were seldomly mentioned by the practitioners

My curiositywas further stimulated when I got a copy of the b o ok by

Gibb ons and Rytter This b o ok showed the existence of a large and well

established research activity on theoretically very fast parallel algorithms

However little was said ab out the practical value of the algorithms

It then seemed natural to ask the following questions Are the theoreti

cally fast parallel algorithms simply not known by the practitioners is the

case that they are not understo o d or are they in general judged as without

practical value

These questions led to a curiosity ab out the p ossible practical use of

parallel algorithms and other results from theoretical computer science I

wanted to learn ab out the b order b etween theoretical asp ects with and with

out practical use as illustrated in Figure Contributions in this direction

might lessen the gap b etween theory and practice within parallel pro cess

ing In January the imp ortance of this goal was conrmed in the

pro ceedings of the NSF ARC Workshop on Opportunities and Constraints

of Paral lel Computing Sana There several of the more prominent re



 

 

 

  

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

  

   

   

   

   

   

 

  

   

   

   

   

  

   

 

  

   

   

   

   theory practice

  

  

  

   

   

   



   

  

   

   



   

  

   

   



   

  

   



   

  

    

 

  

  

   

  

 

  

   



  

   

   

  

  

  

  

  



 

  

  

   

   

   

   

 

 

   

  



   

   

   

  



 

  



 

  



 

  



 

   



  

   



  

   



  

   



  

   

   

  

  



 

  



 

  



 

  

   

 

 

   



 Complexity classes Parallel computers  

   



  

  



   



  

   



 

  

  



 

  



 

 



  



  

  

 

  



  

   



  

 

 

   



  

  

 

 



 

  



  Algorithm Asymptotical analysis



 

  



 

 



  

 

  

  



  

 

  

  



  

 

  

  



   



 

  

  



 

  



 



 

  



 

 

   

  

 

  



implementation Computational   

  

 

  



  

  

 

  



 

  



 



 

  



 



 

  



  

 

 

  

 

  

 

 

  

 

  

  

 

 

 

  

  



 

  measuring



  

mo dels etc 

 

 



  



 

  

 

 

  

 

  

 

 

  

 

  

 

  

  

 

 

  



   

 

 

   



  

  

 

  

 

 

   

 

  

    

 

  

   

  

  

etc  

   

   

 

  

   

  

  

  

 

  

 

  

 

  

  

 

  

  

 

   

 

   

   

 

   

  

  

   

  

  

   

  

   

 

   

 

 

   



  

   



   

  



   

  

   

  

  

    

  

  

   

  

  

    

  

  

  

  

   



   

  

 

  

  

  

  

 

    

     

  

  

   

    

   



    

   

   

  

  

    

    

  

  

  

    

    

   

   

 

   

    

   

    

   

 

   

   

    

    

   

    

   

 

    

  

    

    

   

    

   

    

    

  

    

   

   

    

   

    

    

   

    

    

   

    

   

    

    

   

    

   

    

    

   

    

    

   

    

   

    

    

   

    

    

   

    

   

    

    

   

    

   

    

    

   

    

    

   

    

   

    

    

   

    

    

   

    

   

    

    

   

    

  

  

  

 

 

















































































































































This work

Figure The main goal To learn ab out the p ossible practical value of parallel

algorithms from theoretical computer science

searchers in parallel algorithms argue that bridging the gap b etween theory

and practice in parallel pro cessing should b e one of the main goals for future

research

Background

This section describ es some of the problems encountered when trying to

evaluate the practical value of parallel algorithms from theoretical com

puter science The discussion leads to the prop osed metho d for doing such

evaluations

Parallel Complexity TheoryA Rich Source of Paral

lel Algorithms

Inthepastyears there has b een an increasing interest in parallel algorithms

In the eld of parallel complexity theory the so called Nicks Class NC has

b een given much attention A problem b elonging to this complexityclass



may b e solved by an algorithm with p olylogarithmic running time and a

 k

Ie of O log n where k is a constantandn is the problem size See also Section

p olynomial numb er of pro cessors Many parallel algorithms have recently

b een presented to provememb ership in this class for various problems This

kind of algorithms are frequently denoted NCalgorithms Although the

main motivation for these new algorithms often is to show that a given

problem is in NC the algorithms may also b e taken as prop osals for parallel

algorithms that are fast in practice

In the search for fast parallel algorithms on existing or future parallel

computersthe ab ovementioned researchprovides a rich source for ideas



and hints

The Gap Between Theory and Practice

Unfortunatelyitmaybevery hard to assess the practical value of NC

algorithms Some of the problems are

Asymptotical analysis

Ordernotation see Section is a very useful to ol when the aim is

to provememb ership of complexity classes or if asymptotic b ehaviour for

some other reason is detailed enough It makes it p ossible to describ e and

analyse algorithms at a very high level However it also makes it p ossible

to hide willingly or unwillingly a lot of details which cannot b e omitted

when algorithms are compared for realistic problems of nite size

Unrealistic machine mo del

Few theoretical algorithms are describ ed as implementations on real ma

chines Their descriptions are based on various computational mo dels A

computational mo del is synonymous to an abstract machine One such

mo del is the CREW PRAM Concurrent Read ExclusiveWrite Parallel

Random Access Machine mo del see Chapter

Details are ignored

Most NCalgorithms originating from parallel complexity theory are pre

sented at a very high level and in a compact manner One reason for this

is probably that parallel complexity theory is a eld that to a large extent

overlaps with mathematicswhere the elegance and advantages of compact



Many of these algorithms may b e found in pro ceedings from conferences suchas

FOCS and STOC and in journals suchasSIAM Journal on Computing and Journal of

the ACMFOCS IEEE Symp osium on Foundations of Computer Science STOC

ACM Symp osium on Theory of Computing



descriptions are highly appreciated A revised and more complete descrip

tion of many of these algorithms may often b e found a year or two after

the conference where it was presented in journals suchasJournal of the

ACM or SIAM Journal on Computing These descriptions are generally a

bit more detailed but they are still far from the granularity level that is



required for implementation There are certainly several advantages im

plied by using such high level descriptions It is probably the b est wayto

present an algorithm to a reader who is starting on a study of a sp ecic al

gorithm However there is evidently a large gap b etween such initial studies

and implementation of an algorithm

Traditional Use of the CREW PRAM Mo del

The certainly most used Agg Col Kar RS theoretical mo del

for expressing parallel algorithms is the PRAM Parallel RAM also called

CREW PRAM prop osed byFortune and Wyllie in FW and dis

cussed in Section Its simplicity and generalitymakes it p ossible to con

centrate on the algorithm without b eing distracted by the obstacles caused

by describing it for a more sp ecic and realistic machine mo del Its syn

chronous op eration makes it easy to describ e and analyse programs for the

mo del

This is exactly what is needed in parallel complexity theory where the

main fo cus is on the parallelism inherent in a problem

Implementing On an Abstract Mo del Maybe

Worthwhile

Implementing algorithms on an abstract machine may b e regarded as a

paradox One of the reasons for using abstract machines has traditionally

beenawishtoavoid implementation details

However implementing parallel algorithms on a CREW PRAM mo del

will provide a deep er understanding to lessen the gap b etween theory and

practice The following maybeachieved by implementing a theoretical

parallel algorithm on a CREW PRAM mo del



Another reason may b e found in the call for pap ers for the th FOCS Symp osium A

strict limit of pages was enforced on the submitted pap ers Considering the complexity

of most of these algorithms a compact description is therefore a necessity



As an example compare the description of Coles parallel merge sort algorithm Col

and Col with Section

Adeeper understanding Implementation enforces a detailed study of

all asp ects of an algorithm

Condence in your understanding Correctness verication of large

parallel programs is generally very dicult In practice the b est way

of getting condent of ones own understanding of a complicated algo

rithm is to make an implementation that works correctly Of course

testing is no correctness pro of but elab orate and systematic testing

maygive a larger degree of condence than program verication which

also may contain errors

Agood help in mastering the complexity involved in implementing a

complicatedparal lel algorithm on a real machine ACREWPRAM

implementation may b e a go o d starting p oint for implementing the

same algorithm on a more realistic machine mo del This is partic

ularly imp ortant for complicated algorithms Going directly from an

abstract mathematical description to an implementation on a real ma

chine will often b e to o large a step The existence of a CREW PRAM

implementation may reduce this to two smaller steps

Insight into the practicalvalueofanalgorithm How can an unrealistic

machine mo del b e used to answer questions ab out the practical value

of algorithms It is imp ortant here to dierentiate b etween negative

and p ositive results If a parallel algorithm A shows up to b e inferior as

compared with sequential andor parallel algorithms for more realistic

mo dels the use of an unrealistic parallel machine mo del for executing

A will only strengthen the conclusion that A is inferior On the other

hand a parallel algorithm can not b e classied as a b etter alternative

for practical use if it is based on a less realistic mo del

These advantages are indep endent of whether a parallel computer that re

sembles the CREW PRAM mo del ever will b e built

A New Direction for ResearchonParallel Algorithms

To summarise the practitioners measure and evaluate implemented algo

rithms for solving nite problems on existing computers while the theo

rists are analysing the p erformance of highlevel algorithm descriptions ex

ecuted on abstract machines for solving innitely large problems

The approach presented in this thesis is to evaluate implemented algo

rithms executed on an abstract machine for solving nite problems In Figure

 

 

 

 

 





 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 





 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

  Innitely

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

  large

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

  Theory Not p ossible

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

  problems

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 





 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 





 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 





 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Finite

problems

This work Practice

Abstract machines Existing machines

Figure How the approach of algorithm evaluation presented in this thesis can

b e viewed as a compromise b etween the two main approaches used by the theory

and practice communities

this is identied as a third approach for evaluating parallel algorithms

It is unlikely that the work rep orted here is the rst of this kind so I will

not claim that the approachforevaluating the practical value of theoretical

parallel algorithms is new HoweverIhave not yet found similar work To

b e b etter prepared for evaluating computer algorithms of the future this

direction of research should in my view b e given more attention

Chapter

Parallel Algorithms Central

Concepts

The rst half of this chapter presents some of the more imp ortant concepts

on the theory side mainly from paral lel complexity theory The second half

is devoted to issues that are central when evaluating paral lel algorithms in

practice

Most of the material in this chapter is intro ductory in nature and pro

vides the necessary background for reading the subsequentchapters Parts

of the text such as Section on Amdahls law and Section on

sup erlinear sp eedup do also attempt to clearify topics ab out which there

have b een some confusion in recentyears

Parallel Complexity Theory

The idea met with much resistance People argued that faster

computers would remove the need for asymptotic eciency Just

the opp osite is true however since as computers b ecome faster

the size of the attempted problems b ecomes larger therebymak

ing asymptotic eciency even more imp ortant

John E Hop croft in his Turing Award Lecture Computer Sci

ence The Emergence of a Discipline Hop

This section presents those asp ects of parallel complexity theory that are

most relevant for researchers interested in parallel algorithms It do es not

claim to b e a complete coverage of this large dicult and rapidly expanding

eld of theoretical computer science Most of the concepts are presented in

a relatively informal manner to make them easier to understand The reader

should consult the references for more formal denitions

Not al l of the material is used in the following chapters it is mainly

provided to give the necessary background but also to intro duce the reader

to very central results and theory that to a large extent are unknown among

the practitioners Parallel complexity theory is a relatively new and very

fascinating eldit has certainly b een the main inspiration for mywork

Intro duction

The study of the amount of computational resources that are needed to

solvevarious problems is the study of computational complexity Another

kind of complexity which has b een given less attention within computer

science is descriptive or descriptional complexity An algorithm whichis

very complex to describ e has a high descriptional complexity but maystill

havealow computational complexityFreaWW In this thesis when

I use the term complexityImeancomputational complexity

The resources most often considered in computational complexityare

running time or storage space but many other units may b e relevantin



various contexts

The history of computational complexity go es backtothework of Alan

Turing in the s He found that there exists so called undecidable prob

lems that are so dicult that they can not b e solved by any algorithm The

rst example was the Halting Problem GJ However as describ ed bytwo

of the leading researchers in the eld Stephen A Co ok and Richard Karp

Co o Kar the eld of computational complexity did not really start

b efore in the s and the terms complexity and complexity class were

rst dened by Juris Hartmanis and Richard Stearns in the pap er On the

computational complexity of algorithms HS

Complexity theory deals with computational complexity The most cen

tral concept here is the notion of complexity classes These are used to

classify problems according to how dicult or hard they are to solve The

most dicult problems are the undecidable problems rst found byTuring

The most wellknown complexity class is the class of NPcomplete problems

Other central concepts are upper and lower bounds on the amount of com

puting resources needed to solve a problem



As an example the total numb er of messages or bits exchanged b etween the pro

cessors is often measured for evaluating algorithms in distributed systems Tiw

Paral lel complexity theory deals with the same issues as traditional

complexity theory for sequential computations and there is a high degree of

similaritybetween the two And The main dierence is that complexity

theory for parallel computations allows algorithms to use several pro cessors

in parallel to solve a problem Consequently other computing resources

such as the and the degree of communication among

the pro cessors are also considered The eld emerged in the late s and

it has not yet reached the same level of maturity as complexity theory for

sequential computations

A highly referenced b o ok ab out complexity theory is computers and

intractability A Guide to the Theory of NPCompleteness by Garey

and Johnson GJ It givesavery readable intro duction to the topic a

detailed coverage of the most imp ortant issues with many fascinating exam

ples and provides a list of more than NPcomplete problems A more

compact survey is the Turing Award Lecture An Overview of Computa

tional Complexity given by Stephen A Co ok Co o The Turing Award

Lectures given by Hop croft and Karp Hop Kar are also very inspiring

Ian Parb errys b o ok Paral lel Complexity Theory Par gives a detailed

description of most parts of the eld but Chapter of Richard Andersons

PhD Thesis The Complexity of Paral lel Algorithms And is p erhaps more

suitable as intro ductory reading for computer scientists

Basic Concepts Denitions and Terminology

To b e able to discuss complexity theory at a minimum level of precision we

must dene and explain some of the most central concepts This section

may b e skipp ed by readers familiar to algorithm analysis and complexity

Problems algorithms and complexity

Garey and Johnson GJ dene a problem to consist of a parameter de

scription and a sp ecication of the prop erties required of the solution The

description of the parameters is general It explains their nature struc

ture not their values If we supply one set of values for all the parameters

wehaveaproblem instance As an example consider sorting A formal

description of the sorting problem mightbe

SORTING

parameter description A sequence of items S ha a a i

   n

and a total ordering which states whether a a istrue

j k

for any pair of items a a from the set S

j k 

i solution specification A sequence of items S ha a a

 k k k

n

 

which contains the same items as S and satises the given total



ordering ie a a a

k k k

n

 

If we restrict ourself to the set of integers which are totally ordered by our

standard interpretation of any nite sp ecic set of nite integers makes

one instance of the sorting problem

For most problems there exist some instances that are easy to solve and

others that are more dicult to solve The following denition is therefore

appropriate

Denition Algorithm GJ p

An algorithm is said to solve a problem if that algorithm can b e applied

to any instance I of and is guaranteed always to pro duce a solution for

that instance I 

We will see that some algorithms solve all problem instances equally well

while other algorithms have a p erformance that is highly dep endent on the



characteristics of the actual instance

In general weareinterested in nding the b est or most ecient al

gorithm for solving a sp ecic problem By most ecient one means the

minimum consumption of computing resources Execution time is clearly

the resource that has b een given most attention This is reasonable for se

quential computationsesp ecially to day when most computers have large

memories For parallel computations other resources will often b e crucial

Most imp ortant in general are the numb er of pro cessors used and the amount

of communication

The time used to run an algorithm for solving a problem usually dep ends

on the size of the problem instance A to ol for comparing the eciency of

various algorithms for the same problem should ideally b e a yardstick which



is indep endent of the problem size However the relative p erformance of

two algorithms will most often b e signicantly dierent for various sizes of

the problem instance This fact makes it natural to use a judging to ol

which has the size of the problem instance as a parameter A

function is such a to ol and p erhaps the most central concept in complexity

theory



As an example consider the concept of presortedness in sorting algorithms Man



As we shall see in Section complexity classes provide this kind of problem size

indep endent comparison

Denition Time complexity function

A time complexity function for an algorithm is a function of the size of the

problem instances that for each p ossible size expresses the largest number

of basic computational steps needed by the algorithm to solveany p ossible

problem instance of that size 

It is common to represent the size of a problem instance as a measure

of the input length needed to enco de the instance of the problem In this

thesis I will use the more informal term problem size See the discussion

of input length and enco ding scheme at pages and of GJ For

most sorting algorithms it is satisfactory to represent the size of the problem

instance as the numb er of items to b e sorted denoted nFor simplicitywe

will use n to represent the problem size of any problem for the rest of this

section

Worst case average case and b est case

Denition leads to another central issuethe dierence b etween worst

case average case and b est case p erformance Even if we restrict to problem

instances of a given xed size there may exist easy and dicult instances

Again consider sorting Many algorithms are much b etter at sorting a

sequence of n numb ers which to a large degree are in sorted order than a

sequence in total unorder McG The denition describ es that a time

complexity function represents an upp er b ound for all p ossible cases and a

more explicit term would therefore b e worst case time complexity function

The worst case function provides a guarantee and is certainly the most



frequently used eciency measure If not otherwise stated this is what we

mean by a time complexity function in this thesis

Best case time complexity functions might b e dened in a similar way

Average case complexity functions areofobvious practical imp ortance Un

fortunately they are often very dicult to derive

Denition states that time is measured in basic computational

steps It is therefore necessary to do assumptions ab out the underlying

computational mo del that executes the algorithm This imp ortant topic is

treated in Section For the following text it is appropriate to think of

a basic computational step as one machine instruction

Order notation and asymptotic complexity

There is an innite numb er of dierent complexity functions The concept



At least in the theory community

Table Examples on the use of bigO order notation

Algorithm complexity function order expression

 

A n n O n

 

B n n O n

C n n log n O n log n

D O

is therefore not an ideal to ol to make broad distinctions b etween algorithms

We need an abstraction of the exact complexity functions which makes

them more convenient to describ e and discuss at a higher level and easier to

compare without lo osing their most imp ortant information Order notation

has shown to b e a successful abstraction of this kind When wesay that

f x O g x we mean informally that f grows at the same rate or more



slowly than g when x is very large

Denition BigO Wil p

f x O g x if there exist p ositive constants c and x such that



jf xj cgx for all x x 



The order notation provides a compact way of representing the growth

rate of complexity functions This is illustrated in Table where wehave

shown the exact complexity functions and their corresp onding order expres



sions for four articial algorithms Note that it is common practice to

include only the fastest growing term Algorithm D is a so called constant

time algorithm whose execution time do es not grow with the problem size

It is imp ortant to realise that order expressions are most precise when

they represent asymptotic complexity ie the problem size n b ecomes in



nitely large The error intro duced by omitting the low order terms

approaches zero as n Note also that we omit the multiplicative

constantsalgorithms A and B are regarded as equally ecient This is



f xO g x is read as f x is of order at most g x or f x is bigoh of

g x

 

Consider algorithm C Itwould b e correct to say that its complexity function is O n



since n grows faster than n log nHowever that would b e a weaker statement and there

is in general no reason for saying less than weknow in this context

  

As done when describing A as O n instead of O n n

essential for making order notation robust to changes of technology and

implementation details

Upp er and lower b ounds

A substantial part of computer science research consists of designing and

analysing new algorithms that in some sense are more ecientthanthose

previously known Such a result is said to establish an upper bound Most

upp er b ounds fo cus on the execution time they describ e asymptotic b e

haviour and they are commonly expressed by using the BigO notation

Denition Upp er b ound Aklp

An upp er b ound on time U n is established by the algorithm that among

all known algorithms for the problem can solve it using the least number of

steps in the worst case 

Several sequential algorithms exist that sort n items in O n log n time

One example is the Heapsort algorithm invented by J Williams Wil

Consequentlywe know that sorting can b e done at least that fast on se

quential computers Each single of these algorithms implies the existence of

an upp er b ound of O n log n but note that the order notation is to o rough

to assess which single algorithm corresp onds to the upp er b ound

Lower bounds are much more dicult to derive as expressed byCookin

Co o The real chal lenge in complexity theory and the problem that sets

the theory apart from the analysis of algorithms is proving lower bounds on

the complexity of specic problems

Denition Lower b ound Aklp

Alower b ound on time Ln tells us that no algorithm can solvethe

problem in fewer than Ln steps in the worst case 

While an upp er b ound is a result from analysing a single algorithma lower

b ound gives information ab out all algorithms that havebeenormaybe

designed to solve a sp ecic problem If a lower b ound Ln has b een proved

for a problem wesaythatthe inherent complexity ofis Ln

Alower b ound is of great practical imp ortance when designing algo

rithms since it clearly states that there is no reason for trying to design

algorithms that would b e more ecient

The p ossible variations in factors such as computational mo del assumptions ab out

the input size and characteristics and charging of the dierent computational resources

imply the existence of several algorithms that all in some way are most ecient for solving

a general problem

The inuence of assumptions is illustrated by the following lower b ound

given by Akl in Akl This b ound is trivial but highly relevantinmany

practical situations See also A ZeroTime VLSI Sorter by Miranker et

al MLK

If input andor output are done sequentially then every parallel

sorting algorithm requires n time units

It is common to use bigomega to express suchlower b ounds n

in this context is read as omega of n or of order at least n It can

informally b e p erceived as the opp osite of O As done byseveral authors

Harel Har Garey and Johnson GJ we will stick to bigO notation



for expressing orders of magnitude

An algorithm with running time that matches the lower b ound is said to

b e time optimal

Mo dels of Parallel Computation

Thus even in the realm of unipro cessors one lives happily with

highlevel mo dels that often lie This should co ol down our am

bition for true to life realistic multipro cessor mo dels

Marc Snir in Paral lel Computation ModelsSome Useful Ques

tions Sni

While the RAM random access machine computational mo del intro duced

by Aho Hop croft and Ullmann in AHUhave b een dominating for se

quential computations we are still missing such a unifying mo del for parallel

pro cessing Val Sanb A very large numb er of mo dels of parallel com

putation have b een prop osed The state is further cluttered up by the fact

that various authors in the eld often use dierent names on the same mo del

They also often have dierent opinions on how a sp ecic mo del should op

erate Sanb

This section do es not aim at giving a complete survey of all mo dels It

starts by shortly mentioning some classication schemes and surveys and

then intro duces the PRAM mo del and its numerous variants



In fact the is the negation of o which is a more precise variantofO

often read as of order exactly is still more precise Exact denitions of the vesymb ols

O o and that are used to compare growth rates may b e found in Wil



This is to avoid intro ducing excessive notation f n may b e expressed as of order

at least O f n

The PRAM mo del is the most central concept in this thesis and is

therefore thoroughly describ ed in Chapter The section ends with a dis

cussion of imp ortant general prop erties of mo dels and some comments on

avery interesting mo del recently prop osed byLeslieValiantVal

Computing Mo delsTaxonomies and Surveys

There has b een prop osed a very large numb er of mo dels for parallel comput

ing ranging from realistic message passing asynchronous multipro cessors

to idealised synchronous mo dels These mo dels and all the

various kinds of parallel computers that have b een built form a diverse and

complicated dynamic picture This mess has motivated the development

of numerous taxonomies and classication schemes for mo dels of parallel

computing and real parallel computers

Flynns taxonomy

Without doubt the most frequently used classication of parallel comput

ers is Flynns taxonomy presented byMichael Flynn in Fly The

taxonomy classies computers into four broad categories

SISD Single Instruction stream Single Data stream

The von Neumann mo del the RAM mo del and most serial ie uni

pro cessor computers fall into this category A single pro cessor exe

cutes a single instruction stream on a single stream of data However

SISD computers may use parallelism in the form of pip elini ng to sp eed

up the execution in the single pro cessor The CRAY one of the most

successful sup ercomputers is classied as a SISD machine byHwang

and Brigg HB

SIMD Single Instruction stream Multiple Data stream

Mo dels and computers following this principle have several pro cess

ing units each op erating on its own stream of data However all

the pro cessing units are executing the same instruction from one in

struction stream and they may therefore b e realised with a single

control unit Array pro cessors are in this category Classical exam

ples are ILLIAC IV ICL DAP and MPP HB The Connection

Machine Hil HS is also SIMD Most SIMD computers op erates

synchronously using a single global clo ck

Many authors classify the PRAM mo del as SIMD but this is not

consistent with the original pap ers dening that mo del see Section

MISD Multiple Instruction stream Single Data stream

This class is commonly describ ed to b e without examples It can b e

p erceived as several computers op erating on a single data stream as

a macropip eline HB Pip elined vector pro cessors which com

monly are classied as SISD might also b e put into the MISD class

AG

MIMD Multiple Instruction stream Multiple Data stream

These machines haveseveral pro cessors each b eing controlled by its

own stream of instructions and each op erating on its own stream

of data Most multipro cessors fall into this category The pro ces

sors in MIMD machines maycommunicate through a shared global

memory tightly coupled or by message passing lo osely coupled

Examples are Alliant Cmmp CRAY CRAYXMPiPSCNcube

AG HJ

Some authors use the term MIMD almost as synonymous with asyn

chronous op eration Akl Ste Dun It is true that most MIMD

computers whichhave b een built op erate asynchronously but there is

nothing wrong with a synchronous MIMD computer See also Section

In practice Flynns taxonomygives only two categories of parallel com

puters SIMD and MIMD Consequently it has long b een describ ed as to o

coarse SnyAG

Duncans taxonomy and other classication schemes

Ralph Duncan Dunhasvery recently given a new taxonomy of parallel

computer architectures It is illustrated in Figure Duncans taxonomy

is a more detailed classication scheme than Flynns taxonomy It is a high

level practically or machine oriented schemewell suited for classifying

most contemp orary parallel computers into a small set of classes The reader

is referred to Duncans pap er for more details

Many other taxonomies have b een prop osed Basu Bas has describ ed

a taxonomy which is tree structured in the same way as Duncans taxon

omy but more detailed and systematic Other classication schemes have

b een given byHandler see for instance HB page Feng see HB

page KuckseeAG page Snyder Sny and Treleaven see

AG page

Vector

Pro cessor array

Synchronous

SIMD

Asso ciative memory

Systolic

Distributed memory

MIMD

Shared memory

MIMDSIMD

Dataow

MIMD

paradigm

Reduction

Wavefront

Figure Duncans taxonomy of parallel computer architectures Dun

Surveys

A go o d place to start reading ab out existing computers and mo dels for

parallel pro cessing is Chapter of Designing Ecient Algorithms for Paral



lel Computers written by Quinn Qui Surveys are also given byKuck

Kuc Miklosko and Kotov MK Akl Akl DeCegama DeC and

Almasi and Gottlieb AG H T Kung who is well known for his work on

systolic arrays has written a pap er Computational models for paral lel com

puters Kun which advo cates the imp ortance of computational mo dels

The scop e of the pap er is restricted to D ie linear pro cessor arrays but

Kung describ es dierent computational mo dels for this rather restricted

kind of parallel pro cessing This illustrates the richness of parallel pro cess

ing

Theoretical mo dels

The taxonomies and surveys mentioned so far are mainly representing par

allel computers that have b een built Some of the more realistic mo dels

for parallel computation may b e classied with these taxonomies However

many of the more theoretic mo dels do not t into the schemes Again we



For more detailed descriptions of the most imp ortant parallel computers that have

b een made see Hwang and Briggs HB Ho ckney and Jesshop e HJ

notice the gap b etween theory and practice

The pap er Towards a complexity theory of synchronous paral lel compu

tation written by Stephen Co ok Co o describ es many of the most central

mo dels for parallel computation that are used in the theory community

In Co oks terminology these are uniform circuit families alternating Turing

machines conglomerates vector machines parallel random access machines

aggregates and hardware mo dication schemes The pap er is highly math



ematical and the reader is referred to the pap er for more details A

Taxonomy of Problems with Fast Paral lel Algorithms also written byCook

Co o is mainly fo cusing at algorithms but do es also give an up dated

view of the most imp ortantmodels The pap er gives an extensivelistof

references to related work

The rst part of Routing Merging and Sorting on Paral lel Models of

Computation by Boro din and Hop croft BH and Paral lel Machines and

their Communication Theoretical Limits byReischuk Rei giveoverviews

with a bias which are closer to practical computers More sp ecically they

are giving more emphasis on the shared memory mo dels which are easier to

program but less mathematical convenient

Perhaps the most friendly introduction to theoretical mo dels for paral

lel computation for computer scientists is Chapter of James Wyllies PhD

Thesis The Complexity of Paral lel Computations Wyl Wyllie motivates

and describ es the PRAM mo del which probably is the most used theoretical

mo del for expressing parallel algorithms It is describ ed b elow

The PRAM Mo del and its Variants

Most parallel algorithms in the literature are designed to run

on a PRAM

Alt Hagerup Mehlhorn and Preparata in Deterministic Simu

lation of Idealised Paral lel Computers on MoreRealistic Ones

AHMP

The standard mo del of synchronous parallel computation is the

PRAM

Richard Anderson in The Complexity of Paral lel Algorithms

And



Ihave not studied all the details of this and some of the other mathematical pa

p ers which are referenced The purp ose of the short description and the references is to

intro duce the material and give exact p ointers for further reading

The parallel randomaccess machine PRAM is by far the most

p opular mo del of parallel computation

Bruce Maggs in Beyond Paral lel RandomAccess Machines Mag

The most p opular mo del for expressing parallel algorithms within theo

retical computers science is the PRAM mo del Agg Col Kar RS

The Original PRAM of Fortune and Wyllie

The paral lel random access machine PRAM was rst presented bySteven

Fortune and James Wyllie in FW It was further elab orated in Wyllies

wellwritten PhD thesis The Complexity of Paral lel Computations Wyl

The PRAM is based on random access machines RAMs op erating in par

allel and sharing a common memory Thus it is in a sense the mo del in

the world of parallel computations that corresp onds to the RAM Random

Access Machine mo del that certainly is the prevailing mo del for sequential

computations The pro cessors op erate synchronously The connection to

the RAM mo del should not b e a surprise since John Hop croft was the thesis

advisor of James Wyllie

Today a large number of variants of the PRAM mo del are b eing used

The original PRAM mo del of Fortune and Wyllie which is the main un

derlying concept of this work is describ ed in Chapter Belowwe describ e

its main variants and mention some of the other names that have b een used

on these

The EREW CREW and CRCW Mo dels

Today the mostly used name on the original PRAM mo del is probably

CREW PRAM SV CREW is an abbreviation for the very central con



current read exclusive write prop erty The main reason for this name is the

p ossibility to explicitly distinguish the mo del from the twoothervariants

EREW PRAM and CRCW PRAM

The EREW PRAM do es not allow concurrent read from the global mem

oryItmay b e regarded as more realistic than the CREW PRAM and some

authors present algorithms in twovariantsone for the CREW PRAM and

another for the EREW PRAM mo del The EREW PRAM algorithms are

in general more complex see for instance Col



This means that several pro cessors may at the same time step read the same vari

able lo cation in global memory but they may not write to the same global variable

simultaneousl y

The CRCW PRAM allows concurrent writing in the global memory and

is less realistic than the CREW PRAM It exists in several variants mainly

diering in how they treat simultaneous writes to the same memory lo

cation see b elow Reischuk Rei discusses the p ower of concurrent read

and concurrent write The ERCW variant which is the fourth p ossibil

ity is in general not considered b ecause it seems more dicult to realise

simultaneous writes than simultaneous reads Rei

The CREW PRAM mo del has probably b ecome the most p opular of

these variants b ecause it is the most p owerful mo del which also is well

denedthe CRCW variantexistsinmanysubvariants

The various CRCW mo dels

One of the problems with the CRCW PRAM mo del is the use of several

denitions for the semantics of concurrent writing to the global memory

The main subvariants for the handling of two or more simultaneous write



op erations to the same global memory cell are

Common value only An arbitrary numb er of pro cessors maywriteto

the same global memory lo cation provided that they all write the same

value Writing dierentvalues is illegal Fei Akl BHRei

Arbitrary winner One arbitrary of the pro cessors which are trying to

write to the same lo cation succeeds The value attempted written by

the other pro cessors are lost This gives nondeterministic op eration

Fei Rei BH

Ordered The pro cessors are ordered by giving them dierent priorities

and the pro cessor with highest priority wins Fei Rei Akl

BH

Sum The sum of the quantities written is stored Akl

Maximum The maximum of the values written is stored Rei

Garbage The resulting value is undened Fei

PRAC PRAM WRAM etcTerminology

The various variants of the PRAM mo del have b een used under several



This is often called a write conict

dierent names The following list is an attempt to reduce p ossible confusion

and to mention some other variants

PRAC corresp onds to EREW PRAM BH

PRAM is the original PRAM mo del of Fortune and Wyllie and has

often b een given rather dierent names In QD the term MIMDTCR

is used and in Qui SIMDSMR is used

WRAM corresp onds to CRCW PRAM BH GR

CRAM ARAM ORAM and MRAMhave b een used byReischuk Rei

to denote the resp ectively Common value only Arbitrary winner Ordered

and Maximum variants of the CRCW PRAM mo del see ab ove

DRAM is short for distributedrandom access machine and has b een pro

p osed by Leiserson and Maggs as a more realistic alternative to the PRAM

mo del Mag

General Prop erties of Parallel Computational Mo dels

Motivation



A computing model is an idealised often mathematical description of an

existing or hyp othetical computer There are manyadvantages of using

computing mo dels

Abstractions simplify When developing software for any piece of ma

chinery a computing mo del should makeitpossibletoconcentrate

on the most imp ortant asp ects of the hardware and hiding lowlevel

details

Common platform A computing mo del should b e an agreed up on

standard among programmers in a pro ject team This makes it eas

ier to describ e and discuss measured or exp erienced p erformance of

various parts of a software system A go o d mo del will increase soft

ware p ortability and also provide a common language for research

and teaching Sni See also the paragraph b elow ab out Valiants

bridging BSP mo del

Analysis and performancemodels In theoretical computer science

computational mo dels have played a crucial role by providing a com

mon base for algorithm analysis and comparison The RAM mo del has



Note that the literature use a wide variety of terms for this concept Examples are

computing mo del computer mo del computation mo del computational mo del and mo del

of computation

b een used as a common base for asymptotic analysis and compari

sonofsequential algorithms For parallel computations the PRAM

mo del of Fortune and Wyllie FW Wyl has made it p ossible

for a typical analysis of parallel algorithms to concentrate on a small

set of well dened quantities numb er of parallel steps time number

of pro cessors parallelism and sometimes also global memory con

sumption space More realistic mo dels typically use a larger number

of parameters to describ e the machine and on the other side of the

sp ecter wehave sp ecial purp ose p erformance mo dels which are highly

machine dep endent and often also sp ecic to a particular algorithm

See for instance AC MB

What is the right mo del

The selection of the appropriate mo del for parallel computation is certainly

one of the most widely discussed issues among researchers in parallel pro cess

ing This is clearly reected in the pro ceedings of the NSF ARC Workshop

on Opportunities and Constraints of Paral lel Computing Sana There are

many dicult tradeos in this context Some examples are easy highlevel

programming vs ecient execution eg shared memory vs message pass

ing Bil general purp ose mo del vs sp ecial purp ose mo del eg PRAM

vs machine or algorithm specic models easy to use vs mathematical

convenience eg PRAM vs boolean circuit familiesCo o and easy

to analyse vs easy to build eg synchronous vs asynchronous

Instead of arguing for some sp ecic typ e of mo del to b e the b est we

will describ e imp ortant prop erties of a go o d mo del Some of these prop erties

are overlapping and unfortunatelyseveral of the desired prop erties are

conicting

Easy to understand Mo dels that are dicult to understand or com

plex in some sense for instance bycontaining a lot of parameters or

allowing several variants will b e less suitable as a common platform

Dierentways of understanding the mo del will lead to dierent use and

reduced p ossibilities of sharing knowledge and exp erience The imp or

tance of this issue is examplied by the PRAM mo del In spite of

b eing one of the simplest mo dels for parallel computations it has b een

understo o d as a SIMD mo del bymany researchers and as a MIMD

mo del by others see Section

Easy to use Designing parallel algorithms is in general a dicult

task A go o d mo del should help the programmer to forget unneces

sary lowlevel details and p eculiarities of the underlying machine The

mo del should help the programmer to concentrate on the problem

and the p ossible solution strategiesit should not add to the di

culties of the program design pro cess Simple high level mo dels are

in general most easy to use They are well suited for teaching and

do cumentation of parallel algorithms However when designing soft

ware for contemp orary parallel computers one is often forced to use

more complicated and detailed mo dels to avoid lo osing contact with

the realities Synchronous mo dels are in general easier to program

than asynchronous mo dels This is reected by the fact that some

authors use the term chaotic mo dels to denote asynchronous mo dels

Gib Shared memory mo dels seem to b e generally more convenient

for constructing algorithms BH Par

Wel l dened A go o d mo del should b e describ ed in a complete and

unambigious way Thisisessential for acting as a common platform

The socalled CRCW PRAM mo del exists in manyvariants see page

and is therefore less p opular than the CREW PRAM mo del

General A mo del is general if it reects many existing machines and

more detailed mo dels The use of general mo dels yields more p ortable

results The RAM mo del has b een a very successful general mo del for

unipro cessor machines

Performancerepresentative Marc Snir has written an interesting pa

p er Sni discussing at a general level to what extent high level mo d

els lead to the creation of programs that are ecient on the real ma

chine Informally go o d mo dels should give a p erformance rating of

theoretical algorithms ie run on the mo del that is closely related

to the rating obtained by running the algorithms on a real computer

Sni See also Sny In other words the practically go o d algo

rithms should b e obtained by rening the theoretically go o d ones

Realistic Many researchers stress the imp ortance of a computing

mo del to b e feasible Agg Col Sny Mo dels that can b e

realised with currenttechnology without violating to o manyofthe

mo del assumptions havemany advantages Ab ove all they givea

mo del p erformance which is representative of real p erformance as dis

cussed ab ove With resp ect to feasibility there is a great dierence

between mo dels that are based on a xed connection network of pro

cessors and mo dels based on the existence of a global or shared mem

ory The xed connection mo dels are much easier to realise and in

fact the b est way of implementing shared memory mo dels with current

technologyBH RBJ Unfortunately the xed connection mo dels

are in general regarded as more dicult to program Similarly asyn

chronous op eration of the pro cessors is realistic but generally accepted

as leading to more dicult programming Ste On the other hand

there are researchers arguing for more idealised and p owerful mo d

els than the PRAM which is the standard shared memory mo del

RBJ Note also that there are reasonable arguments for a mo del

to b e far from currenttechnology Vis Val This is explained

in the following prop erty and in the next paragraph

Durable Uzi Vishkin Vis argues that computing mo dels should

b e robust in time in contrast to technological feasibility which rapidly

keeps advancing Again the RAM mo del is an example Changing

mo dels to o often will greatly reduce the p ossibilities of sharing infor

mation and building on other work However machines will and should

changenew technological opp ortunities continue to app ear In prac

tice this is an argument against realistic mo dels such as asynchronous

xed connection mo dels sometimes also called message passing mo d

els A similar view has b een expressed by Leslie ValiantVal

Mathematical convenience Stephen Co ok describ es shared memory

mo dels as unapp ealing for an enduring mathematical theory due to the

arbitrariness in its detailed denition He advo cates uniform Bo olean

circuit families as more attractive for such a theory Co o In my

view mo dels based on circuit families are not suited for expressing

large practical parallel systems Most algorithm designers use the

PRAM shared memory mo del and it should b e noted that it contains

two drastic assumptions whichareintro duced for mathematical conve

nience Wyl Assuming synchronous operation of all the pro cessors

makes the notion of running time well dened and is crucial for

analysing time complexity of algorithms Assuming unboundedparal

lelism ie an unb ounded numb er of pro cessors is available makes it

p ossible to handle asymptotical complexity

Valiants Bridging BSP Mo del

Leslie Valiant has recently written a very interesting pap er Val were

he advo cates the need for a bridging mo del for parallel computation and

prop oses the bulksynchronous paral lel BSP mo del as a candidate for this

role

He attributes the success of the von Neumann mo del of sequential com

putation to the fact that it has b een a bridge b etween software and hard

ware On the one side the software designers have b een pro ducing a diverse

world of increasingly useful and resource demanding software assuming this

mo del On the other side the hardware designers have b een able to exploit

new technology in realising more and more p owerful computers providing

this mo del

Valiant claims that a similar standardised bridging model for parallel

computation is required b efore general purp ose parallel computation can

succeed It must imply convenient programming so that the software p eople

can accept the mo del over a long time Simultaneouslyitmust b e suciently

powerful for the hardware p eople to continuously provide b etter implemen

tations of the mo del A realistic mo del will not act as a bridging mo del

b ecause technological improvements are likely to make it oldfashioned to o

earlyFor the same reason a bridging mo del should also b e simple and not

dened at a to o detailed level There should b e op en design choices allowing



b etter implementations without violating the mo del assumptions Val

Valiants BSP mo del of parallel computation is dened as the combi

nation of three attributes a numb er of pro cessing comp onents a message

router and a synchronisation facility The processing components p erform

pro cessing andor memory functions The message router delivers messages

between pro cessing comp onents The synchronisation facility is able to syn

chronise all or a subset of the pro cessing comp onents at regular intervals

The length of this interval is called the periodicity and it maybecontrolled

by the program In the interval b etween twosynchronisations called a su

p erstep pro cessing comp onents p erform computations asynchronouslyIn

this sense the BSP mo del may b e seen as a compromise b etween asyn

chronous and synchronous op eration Phillip B Gibb ons has expressed sim



An example from another eld of computer science is the success of the relational

mo del in the database world This may b e explained by the fact that the relational

mo del has acted as a bridge b etween users of database systems and implementors When

prop osed the mo del was simple and general high level and advanced Database system

designers haveworked hard for a long time to b e able to provide ecient implementation s

of the relational mo del This example was p ointed out to me byP Thanisch Tha

ilar thoughts and use the term semisynchronous for this kind of op eration

Gib

A computer based on the BSP mo del may b e programmed in manystyles



but Valiant claims that a PRAM language would b e ideal The programs

should have socalled paral lel slackness which means that they are written

for v virtual pro cessors run on p physical pro cessors vp This is necessary

for the compiler to b e able to massage and assemble bulks of instructions



executed as sup ersteps giving an overall ecient execution Valiants BSP

mo del is a totally new computing concept requiring drastically new designs

of compilers and runtime systems

The BSP mo del can b e realised in manyways and Valiant outlines im

plementations using packet switching networks or optical crossbars as two

alternatives Val

Imp ortant Complexity Classes

There are three levels of problems The simplest problem is what we meet in

everyday lifesolving a sp ecic instance of a problem The next level up is

met by algorithm designersmaking a general receipt for solving a subset

of all p ossible instances of a problem Ab ove that wehave a metatheoretic

level where the whole structure of a class of problems are studied Frea

At this highly mathematical level one is interested in various kinds of re

lationships b etween a large numb er of complexity classes see for instance

Chapter in Garey and Johnson GJ

A complexity class can b e seen as a practical interface b etween these

two topmost levels The complexity theorist classies various problems and

also extends the general knowledge of the dierent complexity classes while

the algorithm designer may use this knowledge once parts of his practical

problem have b een classied

P NP and NPcompleteness

The most imp ortant complexity classes for sequential computations are

P NP and the class of NPcomplete problems often abbreviated NPC

GJHar It is common to dene complexity classes in terms of Turing



Reading this at the end of the work rep orted in this thesis was highly encouraging

most of the practical work has b een the design and use of a prototyp e high level PRAM

language see Chapter



It seems to me that Coles parallel merge sort algorithm describ ed in Chapter may

b e a go o d example on a program with parallel slackness

Machines and language recognition GJ RS but more informal de

nitions will suce In the discussion of reasonable parallelism at page

we will see that these complexity classes also are highly relevant for parallel

computations

Denition The class P

The class P is the class of problems that can b e solved on a sequential

computer by a deterministic p olynomial time algorithm 

Denition Polynomial time algorithm GJ p

A p olynomial time algorithm is an algorithm whose time complexity function

is O f n for some p olynomial function f where n is the problem size 

Denition Exp onential time algorithm GJ p

An exp onential time algorithm is an algorithm whose time complexityfunc

tion can not b e b ounded by a p olynomial function 

The word deterministic mighthave b een omitted from Denition

since it corresp onds to our standard interpretation of what an algorithm

is The motivation for including it b ecomes clear when wehave dened

the larger class NP which also includes problems that can not b e solved by

p olynomial time algorithms

Denition The class NP

Standard formulation The class NP is the class of problems that can b e

solved byanondeterministic p olynomial time algorithm

Practical formulation The class NP is the class of problems which has

solutions that can b e veried to b e a solution or not by a deterministic

p olynomial time algorithm 

Note that weavoided the term reasonable computer in the standard

formulation ab ove This is b ecause the concept nondeterministic algorithm

intentionally contains a so called guessing stage GJ which informally is

assumed to guess the correct solution This feature is magical Harand

makes the prop erties of the underlying machine mo del irrelevant

The most celebrated complexity class is the class of NPcomplete prob

lems Informally the class can b e said to contain the hardest problems in NP

Though it is not proved it is generally b elieved that none of these problems

can b e solved by p olynomial time algorithms Today close to prob

lems are known to b e NPcomplete Har If you nd an algorithm that



 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

  NP

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

  

   

   

    

    

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

    

     

     

    

     

     

     

     

     

     

    

     

    

     

     

    

     

    

    

   

     

     

     

    

     

    

   

    

     

   

    

     

    

  

    

    

     

    

   

   

     

    

   

   

    

    

   

   

    

     

  

   

    

    

  

    

    

   

  

     

   

 

     

    

  

   

    

    

 

     

   

 

     

   

 

     

   

 

     

   

 

     

 P NPC   

 

     

   

 

     

 

   

     

 

   

    

 

    

   

 

     

  

  

     



    

   

 

     

  

  

     



    

   

 

     

 

   

     



    

  

  

     



    

  

  

     

 

   

    



     

 

   

    



     

 

   

     



    

 

   

     

 

   

    



     

  

  

    



     

  

  

    



     

  

  

    



     

  

  

    



     

  

  

     

 

   

  

  

     

  

  

    

 

    

  

  

     

 

   

    

 

    

  

  

     

 

   

     

 

   

   

  

    

  

  

     

  

  

     

  

  

     

  

  

     

  

  

     

  

   

    

   

  

    

   

  

    

    

 

     

   

   

   

    

   

  

     

   

   

   

     

   

   

   

    

   

    

    

   

    

   

    

    

   

     

   

   

    

     

    

     

   

  

     

     

    

     

   

    

    

     

    

   

     

     

     

    

     

    

     

     

     

     

    

     

     

    

     

     

     

     

     

     

    

     

     

     

    

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

    

   

   

   

  

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 



Figure The complexity classes NP Pand NPC

solves one of these problems in p olynomial timeyou have really done a

giant breakthrough in computer science This should come clear from the

denition of the class NPC

Denition NPcomplete problems NPC

NPC is the class of NPcomplete problems A problem is NPcomplete



if i NP andiiFor any other problem in NP there exists a



p olynomial transformation from to 

A polynomial transformation from A to B is a deterministic p olynomial

time algorithm for transforming or reducing any instance I of A to a

A

corresp onding instance I of B so that the solution of I gives the required

B B

answer for I As a consequence see Lemma at page of GJ it can

A

b e proved that a p olynomial time algorithm for solving problem B combined

with this p olynomial transformation yields a p olynomial time algorithm for

solving AThus Denition states that a p olynomial time algorithm for

a single NPcomplete problem implies that all problems in NP can b e solved

in p olynomial time The relationship b etween the classes NP P and the

class NPC containing the NPcomplete problems is shown in Figure

Reading part ii of Denition one might think that muchwork is

required to prove that a problem is NPcomplete Fortunatelyitisnot

necessary to derive p olynomial transformations from all other problems in

NP This and other asp ects of NPcompleteness will b ecome clear in Sec

tion at page where we describ e Pcompleteness and its strong

similarities with NPcompleteness

Fast Parallel Algorithms and Nicks Class

With resp ect to parallelisation the problems inside P can b e divided into

two main groupsthe class NC and the class of Pcomplete problems The

class NC contains problems which can b e solved by fast parallel algorithms

that use a reasonable numb er of pro cessors

Denition Nicks class NC

Nicks Class NC is the class of problems that can b e solved in p olyloga

k

rithmic time ie time complexity O log n where k is a constant with a

p olynomial numb er of pro cessors ie b ounded by O f nforsomepoly

nomial function f where n is the problem size 

NC is an abbreviation for Nicks Class and is now the commonly used

Co o name for this complexity class whichwas rst identied and char

acterised by Nicholas Pipp enger in Pip

It is relatively easy to prove that a problem is in NC Once one algorithm

k

k





with O log n time consumption using no more than O n pro cessors

that solves the problem has b een found the memb ership in NC is proved



Such an algorithm is often called a NCalgorithmThus Batchers O log n

time sorting network using n pro cessors see Section proves that

sorting is in NC Many other imp ortant problems are known to b e in NC

Some examples are matrix multiplication matrix inversion nding the min

imum spanning forest of a graph Co o and the shortest path problem

CM p The b o ok Ecient Paral lel Algorithms written byGib

b ons and Rytter GR provides detailed descriptions of a large number of

NCalgorithms

NC is robust

The robustness of Nicks Class is probably the main reason for its p opularity

It is robust b ecause it to a large extent is insensitive to dierences b etween

many mo dels of parallel computation NC will contain the same problems

whether we assume the EREW CREW or CRCW PRAM mo del or some



other mo dels such as uniform circuits Co o And HarGR The

robustness of NC allows us to ignore the p olylogarithmic factors that sepa

rate the various mo dels

 

More rened complexity classes suchasNC and NC may b e dened



within NC NC is used to denote the class of problems in NC that can b e



This is shown byvarious theoretical results describing how theoretical mo dels maybe

simulated by more realistic mo dels see for instance AHMP MV SV Upf

 

solved in O log n time whereas NC is the problems solvable in O log n

time These classes are not robust to changes in the underlying mo del and

should therefore only b e used together with a statement of the assumed

mo del Many other sub classes within NC are describ ed in Stephen Co oks

article ATaxonomy of Problems with Fast Paral lel Algorithms Co o

Reasonable parallelism

Classifying a p olynomial pro cessor requirement as reasonable in general

may need some explanation Practitioners would for most problems say that

a pro cessor requirement that grows faster than the problem size might not

b e used in practice However in the context of complexity theory it is com

mon practice to say that problems in P can b e solved in reasonable time as

opp osed to problems in NPC that currently only can b e solved in unreason

able ie exp onential time Similarlywe distinguish b etween reasonable

p olynomially and unreasonable exp onentially pro cessor requirements



NPcomplete problems can at least in the theory be solved in p oly

nomial time if we allow an exp onential numb er of pro cessors to remove the

magical nondeterminism see Har p However if werestricttoa

reasonable numb er of pro cessors the NPcomplete problems remain out



side P In this sense the traditional complexity class P is very robust P

contains the same problems whether we assume one of the standard sequen

tial computational mo dels or parallel pro cessing on a reasonable number of

pro cessors A p olynomial numb er of pro cessors maybesimulated with a

p olynomial slowdown on a single pro cessor This implies that all problems

in NC must b e included in the class P

NC may b e misleading

In theoretical computer science there has in the past years b een published

a large numb er of pap ers proving various subproblems to b e in NC These

results are of course valuable contributions to parallel complexity theory

However if you are searching for go o d and practical parallel algorithms the

titles of these pap ers may often b e misleading if you do not know the ter

minology or do not understand the limitations of complexity theoryFor



It is far from clear that it can b e done in practice due to inherent limitations of

threedimensional space Co o Har Fis



To see this assume the opp ositea parallel algorithm using p n pro cessors that



solves an NPcomplete problem in p n time where p nand p n are b oth p olynomi

  

als Simulating this algorithm on a single pro cessor would then give a p olynomial time

algorithm O p n p n the problem must b e in P and wemust have NPC P

 

instance Fast Parallel Algorithm for Solving may denote an algorithm

k

with O log n time consumption with very large complexity constants which

makes it slower in practice than well known and simpler algorithms for the

same problem Similarly EcientParallel Algorithm for Solving may



entitle a p olylogarithmic time algorithm with O n pro cessor requirement

resulting in a very high cost and a very low eciencyif we use the deni

tion of eciency which is common in other parts of the parallel pro cessing

community see page It is therefore p ossibly more appropriate to de

scrib e problems in NC as highly parallelisable GR Further the term

NCalgorithm is more precise and may often b e less misleading than to say

that an algorithm is fast or ecient

It has b een argued that to o much fo cus has b een put on developing NC

algorithms Kar The search for algorithms with p olylogarithmic time

complexity has pro duced a lot of algorithms with a large but p olynomially

b ounded numb er of pro cessors These algorithms typically haveavery

high cost compared with sequential algorithms for the same problemsand

consequently low eciency Richard Karp has therefore prop osed to dene

an algorithm to b e ecient if its cost is within a p olylogarithmic factor of

the cost of the b est sequential algorithm for the same problem Kar See

also the recentwork by Snir and Kruskal Kru

Inherently Serial Problems and PCompleteness

Pcompleteness

Consider the problems in P The subset of these whichareinNC maybe

regarded as easy to parallelise and those outside NC as dicult or hard

to parallelise Proving that a problem is outside NC ie proving a lower

b ound is however very dicult

An easier approach is to show problems to b e as least as dicult to

parallelise as other problems The notion of Pcompleteness helps us in

doing this Just as the NPcomplete problems are those inside NP which

are hardest to solve in p olynomial time the class of Pcomplete problems

are those inside P which are hardest to solve in p olylogarithmic time using



only a p olynomial numb er of pro cessors

Denition Pcomplete problems PC GR p

PC is the class of Pcomplete problems A problem is Pcomplete if i



Note that some early pap ers on NPcompleteness used the term Pcomplete for those

problems that are common to day to denote as NPcomplete see for instance SG



 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

NP  

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

  

   

   

    

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

    

     

     

     

     

     

     

     

    

     

     

     

     

     

     

     

     

     

    

     

    

     

     

     

       

         

P         

         

         

           

           

           

           

         

           

          

           

           

           

          

         

          

          

          

          

          

         

         

         

        

         

           

        

          

      

         

          

        

       

        

          

        

      

        

          

        

      

        

         

       

       

         

         

     

        

        

        

       

      

         

        

     

        

         

    

        

          

    

       

          

     

       

         

      

      

         

      

      

        

      

       

NPC       

        

      

       

        

      

     

         

    

        

      

      

        

       

    

          

    

       

         

    

        

       

   

           

   

       

         

  

          

     

     

          

    

       

       

    

          

     

     

          

  

         

     

     

          

  

         

      

    

          

    

       

       

     

        

    

       

       

     

        

    

NC PC        

       

       

      

      

       

      

       

     

       

      

      

       

      

       

      

      

        

     

      

        

    

        

      

     

        

      

      

      

     

        

       

     

       

     

        

      

     

         

   

         

     

     

         

      

      

      

     

        

       

     

         

   

        

      

      

        

      

      

         

   

        

      

     

         

      

      

         

     

      

         

     

       

       

    

          

     

      

         

     

      

         

      

      

         

      

      

         

        

    

         

        

      

        

       

      

        

        

       

        

      

        

       

       

          

      

       

           

    

         

          

        

    

         

           

        

       

         

       

       

          

           

         

       

          

           

        

          

        

          

           

         

          

        

           

         

           

           

       

          

         

        

      

      

    

    

    

    

    

     

     

   

     

   

    

    

     

    

    

     

     

     

     

   

     

    

     

     

     

     

    

     

    

     

     

     

     

     

     

     

    

     

     

     

    

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

    

   

   

   

  

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 



Figure The complexity classes NP P NPC NC and PC



P andiiFor any other problem in P there exists a NCreduction



from to 

Denition NCreduction GR p

A NCreduction from A to B written as A B is a deterministic

NC

p olylogarithmic time b ounded algorithm with p olynomially b ounded pro

cessor requirement for transforming or reducing any instance I of A to a

A

corresp onding instance I of B so that the solution of I gives the required

B B

answer for I 

A

Consequently if one single Pcomplete problem can b e solved in p olyloga

rithmic time on a p olynomial numb er of pro cessorsthen all problems in P

can b e solved within the same complexity b ounds using the NCreduction

required by the denition

Just as there is no known pro of that P NP nobody has been able

to prove that NC PHowever most researchers b elieve that the hard

est parallelisable problems in P the Pcomplete problems are outside NC

A Pcompleteness result has therefore the same practical eect as a lower

b ounddiscouraging eorts for nding NCalgorithms for solving the prob

lem The relationship b etween the classes P NC and the class PC contain

ing the Pcomplete problems is shown in Figure The gure also shows

the three main complexity classes for sequential computations

NCreductions

A NCreduction is technically dierent from the transformation used in clas

sical denitions of Pcompleteness GR As done by Gibb ons and Rytter

GR we will use the notion of NCreductions to avoid intro ducing ad

ditional details ab out space complexity Classical denitions use so called

logarithmic space reductions and Pcomplete problems are often termed as



logspace complete for P However by using the so called paral lel com



putation thesis it can b e shown that a logspace reduction corresp onds to

a NCreduction and they will b e used as synonyms in the rest of the text

A NCreduction may informally b e p erceived as a fast problem reduc

tion executable on a reasonable parallel mo del It is interesting to note

that Gibb ons and Rytter GRhave observed that NCreductions typi

cally work very lo callyFor example an instance G of a graph problem may



often b e transformed reduced to a corresp onding instance G of another

graph problem by transforming the lo cal neighb ourho o d of every no de in G

This is intuitively not surprising There exists many examples where the

ability to do computations based on lo cal information instead of global

central resources signicantly improves the p ossibilities of achieving highly

parallel implementations

Avery imp ortant prop erty of the NCreducability relation denoted

is that it is transitivePar And As describ ed b elow this prop

NC

ertyiscentral in reducing the eorts needed for proving a problem to b e

Pcomplete

Pcompleteness pro ofs

Part ii of Denition of Pcomplete problems maymakeusbelieve that

the work needed to prove a new problem to b e Pcomplete is substantial

since wemust provide a NCreduction from every problem known to b e in

PFortunately this is not true due to the following observation

Observation Corol lary in Par

If A is Pcomplete B P andA B then B is Pcomplete

NC

Proof Wewanttoshow that B is Pcomplete Since B P what remains

to show is that for every problem P there exists a NCreduction from

toB ie B Since A is known to b e Pcomplete Denition

NC

implies A The assumption A B and the transitivityof

NC NC NC

then imply that B From the pro of of Lemma in Garey and

NC



A logspacereduction is a transformation in the same sense as the p olynomial transfor

mation describ ed at page but with a space consumption which is b ounded by O log n

GJ And



This result informally states that sequential memory space is equivalent to parallel

time up to p olynomial dierences Har And Par

Johnson GJ but adapted here to Pcompleteness 

Observation tells us that wemay use the following much more practical

approachtoprove that a problem is Pcomplete

Showthat P

Show that there exists a NCreduction from some known Pcomplete

problem to

However we still have the chicken b efore egg problemwe simply can not

use this approachtoprove some rst problem to b e Pcomplete



The rst problem shown to b e Pcomplete was the PATH problem pre

sented by Stephen Co ok in Co o Shortly after Jones and Laaser

JL showed six other problems to b e Pcomplete Co ok proved the P

completeness of the PATH problem by describing a generic logspace re

duction of an arbitrary problem in P to the PATH problem Similarly

Jones and Laaser provided a generic reduction for the UNIT RESOLUTION

problemJL

Informallysuch a generic reduction to eg the PATH problem is a

description of how an arbitrary deterministic p olynomial time computation

on a Turing machine may b e simulated by solving a corresp onding instance

of the PATH problem The generic reduction to the PATH problem invented

by Co ok is an extremely imp ortant result in the history of parallel com

plexity theory This and other similar generic reductions are complicated

mathematical descriptionswhichwe do not need to know in detail to use

the Pcompleteness theory in practice

In addition to the original pap ers Co o JL interested readers are

referred to Chapter in GR which describ es a generic reduction to the

GENERABILITY problem See also the very go o d description of Stephen

Co oks seminal theorem Cooks Theorem found in Section of GJ

This theorem provided the rst NPcomplete problem SATISFIABILITY

by a generic reduction from an arbitrary problem in NP

Part of the pro of pro cedure to show that the problem is in P is

often a straightforward task All we need is to show the existence of some

deterministic p olynomial time unipro cessor algorithm solving the problem



Called PATH SYSTEM ACCESSIBILITY by Garey and Johnson GJ p The

problem is also describ ed in JL

One exception is linear programming whichwas shown to b e in P by

Khachian as late as in Kha

To nd a NCreduction from some known P complete problem is in gen

eral not so easy A go o d way to start is to study do cumented P completeness

pro ofs of related problems Chapter in the b o ok by Garey and John

son GJ contains a lot of fascinating problem transformations and it is

highly recommendedin spite of the fact that the problem transformations

rep orted there are p olynomial reductions and not necessarily NCreductions

However Anderson rep orts that most of the transformations used in NP

completeness pro ofs can also b e p erformed as NCreductions And

What is dicult is to select a well suited known Pcomplete problem

and to nd a systematic way of mapping instances from that problem into

instances of the problem wewant to prove Pcomplete Once the reduction

has b een found it is often relatively easy to see that it can b e p erformed by

a NCalgorithm That part is therefore omitted from most Pcompleteness

pro ofs

The Circuit Value Problem

Since the pioneering work byCookinwehave got an increasing number

of known Pcomplete problems The most frequently used problem in P

completeness pro ofs is the circuit value problem CVP and its variants

And The circuit value problem was shown to b e Pcomplete by Ladner

in Lad

Informally an instance of the circuit value problem is a combinational

circuit ie a circuit without feedback lo ops built from Bo olean twoinput

gates and an assignment to its inputs To solve the problem means to

compute its output Par The kind of gates allowed in CVP plays an

imp ortant role CVP is Pcomplete provided that the gates form a so

called complete basis Par And The most common set of gates is

fAND OR NOTgTwo imp ortantvariants of CVP are the monotone circuit

value problem MCVP where we are restricted to use only AND or OR

gates and the planar circuit value problem PCVP where the circuit can

b e constructed on the plane without wire crossings MCVP and PCVP

were b oth proved Pcomplete by Goldschlager Gol Other variants of

CVP are describ ed in GSS And Par Note that small changes in

a problem denition can make dramatic eects on the p ossibility ofafast

parallel solution As an example b oth MCVP and CVP are Pcomplete

but the monotone and planar CVP can b e solved byaNCalgorithm

Pcomplete problemsExamples

The set of problems proved and do cumented to b e Pcomplete contains a

growing set of interesting problems Below is a short list containing some

of the more imp ortant of these The reader should consult the references

for more detailed descriptions of the problems and their pro ofs see also

GR

Maximum network ow proved Pcomplete by Goldschlager et al in

GSS

Linear programming proved as member of P by Khachian Kha and proved



as Pcomplete or harder by Dobkin et al in DLR

General dead lock detection proved Pcomplete by Spirakis in Spi



Depth rst search proved Pcomplete by John Reif in Rei

Unication proved Pcomplete byDwork Kanellakis and Mitchell

DKM Kanellakis describ es in Kanhow this result dep ends

on the representation of the input

Unit resolution for propositional formulas proved Pcomplete among sev

eral other problems by Jones and Laaser in JL

Two player game proved Pcomplete by Jones and Laaser in JL

A Pcomplete problem is often termed as inherently serialit often con

tains a subproblem or a constraint which requires that any solution metho d

must follow some sort of serial strategyFor some problems suchastwo

player game and p erhaps also general deadlo ck detection it is relatively

easy to intuitively see the inherent sequentialityFor others such as maxi

mum network ow it is much harder

In general it is dicult to claim that a problem is inherently serial b e

cause it in our context means that all p ossible algorithms for solving it are

inherently serial On the contrary it is often easier to identify inherently

serial algorithms This is captured in the notion of Pcomplete algorithms

discussed in the following paragraph



This is often termed Phard Informally it means that the problem may b e outside P

but if it can b e proved as member of P then Phardness implies Pcompleteness



It is more correct to term depth rst searchasaPcomplete algorithm see page

Pcomplete algorithms

Avast amount of research has b een done on sequential algorithms When

designing a parallel algorithm for a sp ecic problem it is often fruitful to

lo ok at the b est known sequential algorithms Many sequential algorithms

have fairly direct parallel counterparts

Unfortunatelymany of the most successful sequential algorithms are

very dicult to translate transform into very fast parallel algorithms Such

algorithms may b e termed inherently serial They are often using a strategy

which is basing decisions on accumulated information A go o d example

is J B Kruskals minimum spanning tree algorithm Kru whichisa

typical greedy algorithm see for instance AHU pages Greedy

algorithms are in general sequential in naturethey typically build up a

solution set item by item and the choice of which item to add dep ends on

the previous choices

In his thesis The Complexity of Paral lel Algorithms And Richard

Anderson shows that greedy algorithms for several problems are inherently

serialhe proves them to b e Pcomplete Anderson gives the following def

inition of a Pcomplete algorithm And

An algorithm A for a search problem is Pcomplete if the problem

of computing the solution found by A is Pcomplete

Another example of an inherently serial algorithm is depthrst search

DFS Tar which is used as a subalgorithm in many ecient sequential

algorithms for graph problems John Reif has shown that the DFS algorithm



is Pcomplete by proving that the DFSORDER problem is Pcomplete

Rei

Fortunately the fundamental dierence b etween problems and algo

rithms makes the detection of an algorithm to b e Pcomplete to a less

discouraging result A Pcomplete algorithm do es not say that the cor

resp onding problem is inherently serialit merely states that a dierent

approach than parallelising the sequential algorithm must b e attempted to

p ossibly obtain a fast waytosolve the problem with parallelism For exam

ple the natural greedy algorithm for solving the maximal indep endent set

problem is Pcomplete but a dierent approach can b e used to solvethe

problem with a NCalgorithm And GS



Informally describ ed this problem is to nd the depthrst search visiting order of

the no des in a directed graph

Table Summary of complexity theory for sequential and parallel computations

sequential parallel

computations computations

The main class of prob NP P

lems studied

The easiest problems The class P The class NC

in the main class ie

the class of problems

that maybesolved

eciently

The fundamental Is P NP Is NC P

question

Denition of a problem The existence of a The existence of a

that maybesolved sequential algorithm parallel algorithm

eciently solving the problem solving the problem in

in p olynomial time p olylogarithmic time

using a p olynomial

numb er of pro cessors

The hardest problems The NPcomplete The Pcomplete

in the main class ie problems problems

problems that probably

can not b e solved

eciently

A rst problem in the satisfiability sat path

hardest class

Technique for Preducibili ty NCreducibility

proving memb ership in

the hardest class

NPcompleteness vs Pcompleteness

Table shows the similarities of Pcompleteness and NPcompleteness It

do es also give a quick summary of the theory

Evaluating Parallel Algorithms

Basic Perfomance Metrics

In the rest of this thesis it is assumed that algorithms are executed on the

CREW PRAM mo del This assumption makes it p ossible to dene the basic

p erformance metrics for evaluating parallel algorithms

Time Pro cessors and Space

Denition Time



The time running time or execution time for an algorithm executed on

a CREW PRAM is the numb er of CREW PRAM clo ck p erio ds elapsed

from the rst to the last CREW PRAM instruction used to solve the given

instance of a problem The time is expressed in CREW PRAM time units

or simply time units 

This denition corresp onds to what many authors call parallel running

time Akl A large numb er of pro cessors may p erform the same or dif



ferent CREW PRAM instructions in one CREW PRAM clo ck p erio d

Denition makes it p ossible to use the same concept of time for parallel

algorithms and serial algorithms executed as unipro cessor CREW PRAM

programs Note that the denition implies that time used by an algorithm

strongly dep ends on the size and p ossibly also the characteristics of the

actual problem instance

On mo dels which reect parallel architectures based on message passing

it is common to dierentiate b etween computational steps and routing steps

see for instance Akl Akl As describ ed in Chapter access to the

global memory from the pro cessors in a CREW PRAM is done by one single



More precisely a CREW PRAM implementation of an algorithm



Each pro cessor is assumed to have a rather simple and small instruction set Nearly

all instructions use one time unit some few such as divide use more The instruction set

time requirement is dened as parameters in the simulatorand therefore easy to change

See App endix A

instruction and it is therefore less imp ortant to dierentiate b etween lo cal

computations and interpro cessor communication

Denition Pro cessor requirement

The pro cessor requirement of an algorithm is the maximum number of

CREW PRAM pro cessors that are active in the same clo ck p erio d during

execution of the algorithm for a given problem instance 

The pro cessor requirementiscentral in the evaluation of parallel algorithms

since the numb er of pro cessors used is a go o d representative of the hardware

cost which is needed to do the computations

Denition Space

The space used by a CREW PRAM algorithm is the maximum number of

memory lo cations in the global memory allo cated at the same time during

execution of the algorithm for solving a given problem instance 

Note that we do not measure the space used in the lo cal memories of the

CREW PRAM pro cessors This is b ecause it is easy to attachcheap mem

ories to the lo cal pro cessors while it is dicult and exp ensive to realise the

global memory which is accessible by all the pro cessors Little emphasis is

put on analysing the space requirements of the algorithms evaluated in this

thesis

Cost Optimal Algorithms and Eciency

It is often p ossible to obtain faster parallel algorithms byemploying more

pro cessors Many of the fastest algorithms require a huge numb er of pro ces

sorsoften far b eyond what can b e realised with presenttechnologyItis

therefore frequently the case that the fastest algorithm is not the b est

algorithm

One way to get the question of feasibilityinto the analysis is to measure

or estimate the cost of the parallel algorithm

Denition Cost Akl p Qui p

The cost of a parallel algorithm is the pro duct of the time Denition

and the pro cessor requirement Denition The cost is expressed in

numb er of CREW PRAM unittime instructions 

Note that this denition assumes that all the pro cessors are active during

the whole computation This is a simplication whichmakes the cost an

upp er b ound of the total numb er of CREW PRAM instructions p erformed

In contrast to Akl Akl Akl which dene cost to representworst case

b ehaviour we dene cost to b e dep endent on the actual problem instance

see Denitions and

There are manyways to dene a parallel algorithm to b e optimal for a

given problem dep ending on what resources are considered most critical It

is most common to say that a parallel algorithm is optimal or cost optimal

if its cost for a given problem size is the same as the cost of the b est

known sequential algorithm for the same problem sizewhichisknown to

be O n log n Akl Knu

Denition Optimal parallel sorting algorithm

A parallel sorting algorithm is said to b e optimal with resp ect to cost if

its cost is O n log n 

The fastest parallel algorithms often use redundancy and are therefore

not optimal with resp ect to cost It is often desirable with a gure which

tells how go o d the utilisation of the pro cessors is This is achieved bythe

measure eciency

Denition Eciency

The eciency of a parallel algorithm is the running time of the fastest known

sequential algorithm for solving a sp ecic instance of a problem divided by

the cost of the parallel algorithm solving the same problem 

Eciency may also b e dened as the speedup divided by the pro cessor re

quirement see Section For a discussion of the p ossibility of eciency

greater than one see Section

The term ecient is used in connection with parallel algorithms in many

ways Gibb ons and Rytter GR denes an ecient parallel algorithm to

b e an algorithm with p olylogarithmic time consumption using a p olynomial

numb er of pro cessors According to this an algorithm with an extensive use

of pro cessors maybetermedasecienteven though it mayhaveavery low

eciency

Richard Karp and others Karhave advo cated an alternative meaning

of ecient paral lel algorithm which do es not allow parallel algorithms that

is grossly wasteful of pro cessors to b e termed as ecient They dene

informally a parallel algorithm to b e ecient if its cost is within a p oly

logarithmic factor of the cost ie execution time of the b est sequential

algorithm for the problem

Analysis of Sp eedup

What is Sp eedup

The most frequently used measure of the eect that can b e obtained by

parallelisation is sp eedup The sp eedup do es express howmuch faster a

problem can b e solved in parallel than on a single pro cessor Many dierent

denitions of sp eedup can b e found in the literature on parallel pro cessing

A precise denition of the most common use of sp eedup is given b elow

Denition Sp eedup

Let T b e the execution time of the fastest known sequential algorithm





foragiven problem run on a single pro cessor of a parallel machine M

Let T b e the execution time of a parallel algorithm for the problem

N

run on the same parallel machine M using N pro cessors The sp eedup

achieved by the parallel algorithm for the problem run on N pro cessors

is dened as

Speedup N T T

 N



Some variants of this denition should b e mentioned The denition given

by Akl in Akl sp ecies that b oth the sequential and parallel execution

times are worst case execution times

Miklosko in MK requires that the sequential algorithm must b e the

fastest possible algorithm for a single pro cessor which often is unknown

Comparing with the fastest known algorithm is more practical

Quinn in Qui emphasises the use of a single pro cessor of the parallel

machine for execution of the sequential algorithm The alternativeisto

use the fastest known serial computer whichmay b e regarded as giving a

more correct gure of the sp eedup However few researchers in parallel

pro cessing have access to the most p owerful serial computers

One p opular alternative exists that often gives b etter sp eedup results

than Denition In this alternative denition the sequential running

time is measured or estimated as the time used bytheparal lel algorithm

run on a single pro cessor The main error intro duced is that the redundancy

often existing in go o d parallel algorithms in this case also must b e p erformed

in the unipro cessor solution giving a p o orer sequential running time The

main motivation for this strategy is that it do es reduce the researchtoa



It is implici tly assumed that all pro cessors on this machine are of equal p ower

study of only the parallel algorithm One example of this alternative sp eedup

denition can b e found in Moh

Sup erlinear Sp eedupIs It Possible

The term linear sp eedup often app ears in the literature and there is a

discussion whether socalled sup erlinear sp eedup is p ossible or not Quinn

Qui denes linear speedup to b e a sp eedup function S N whichisN



N and of order exactly N where N is the numb er of pro cessors Thus



N b oth express linear sp eedup according to this denition Some other

authors only regard S N N as linear sp eedup

Superlinear speedup is used to denote the case if it exists where a prob

lem is solved more than N times faster using N pro cessorsie a sp eedup

greater than linear Note that the chosen denition of linear sp eedup is

crucial for such a discussion Parkinson in Par gives a very simple exam

ple with sp eedup S N N According to the chosen denition of linear

sp eedup this example is or is not exhibiting sup erlinear sp eedup See also

HM

Amdahls Law and Problem Scaling

This section starts with a description of the reasoning b ehind Amdahls

lawanditisshown that a recently announced alternativelaw is strongly

related to Amdahls law This discussion leads to the concept of problem

scaling The theory is then illustrated by a simple example The last part

of the section shows why it is dicult to use Amdahls law on practical

algorithms We see that the law is only useful as a coarse mo del for giving

an upp er b ound on sp eedup The discussion reveals various issues which

should b e covered in a more detailed analysis metho d

Background

In Gene Amdahl in Amdgave a strong argument against the use

of parallel pro cessing to increase computer p erformance Today this short

article seems rather p essimistic with resp ect to parallelism and his claims

have b ecome partly obsolete

However some of his predictions haveshown to b e rightinnearlytwo

decades and the reasoning is considered as the origin of a general sp eedup

formula called Amdahls law

In his article Amdahl rst describ es that practical computations havea



signicant sequential part The eect of increasing the numb er of pro ces

sors to sp eed up the computation will so on b e limited by this more and more



dominating sequential part By using Groschs law he then argues that

more p erformance for the same cost is achieved by upgrading a sequential

pro cessor than by using multipro cessing Amdahl further describ es that it

has always b een dicult to exploit additional hardware in improving a se

quential computer but that the problems havebeenovercome He predicts

that this will continue to happ en in the future

In short Amdahl considers a given computation having a xed amount

of op erations that must b e p erformed sequentially and how the computation

maybespedupby using several pro cessors This is called Amdahl reasoning



in the following If the sequential fraction of the computation is called s

by Amdahls law the maximum sp eedup that can b e achieved by parallel

pro cessing is b ounded bys This observation has often b een used as a



strong argument against the viabilityofmassive paral lelism

The implicit assumption underlying Amdahls law has b een criticised as

inappropriate in the past An alternativeinverse Amdahls law has b een

suggested by a research team from Sandia National Lab oratories Gus

GMB San They haveachieved very go o d sp eedup results for three

practical scientic applications They use an inverted form of Amdahls

argumenttodevelop a new sp eedup formula for explaining their results

This new formula is also claimed to show that it in general is mucheasierto

achieve ecient parallel p erformance than implied by Amdahls paradigm

In my opinion this new formula just as Amdahls law can b e derived

from a very standard sp eedup calculation As will b e shown the two for

mulas are strongly related Nevertheless the new law looks much more

optimistic with respect to the viability of paral lelism The reason for this is



Amdahl found that ab out of the op erations are forced to b e sequential is

due to data management housekeeping and to problem irregulariti es



Groschs law states that the sp eed of computers is prop ortional to the square of their

cost It is discussed at page in Qui



Amdahl assumed that the sequential part and therefore also the parallel part of the

total computation b oth are unaected by the number of processors used in the parallel

execution In retrosp ect I think this implicit assumption is the crucial part of the article

and the reason why the law has b een named after Amdahl Several authors Qui Gus

refer to Amd as the source of Amdahls law However the arguments of Amdahl are

rather informal and are formulated as a b elief prophecy No explicit law is dened



GMB uses the term massive parallelis m for general purp ose MIMDsystems with

or more autonomous oating p oint pro cessors

an implicit assumption which is hidden by the derivation of the law Fur

thermore standard sp eedup analysis gives the same more optimistic picture

as the new formula provided that the same assumptions are made In this

light I think it is appropriate to discuss in more detail Amdahls lawand

the alternative sp eedup formula suggested by the team from Sandia

Amdahls Law

Denition Serial work parallel work

Consider a given algorithm Let T b e the total time used for executing



on a single pro cessor The part of the time T used on computations





which must b e p erformed sequentially is denoted s The remaining part

w

of the time T is used on computations that may b e done in parallel and



is denoted p s is called the serial part of the unipro cessor execution

w w

time It is sometimes called the serial work Corresp ondingly p is called

w

the paral lel partorparal lel work By denition s p T 

w w 

Denition Serial fraction parallel fraction

Let s denote the serial fraction of the unipro cessor execution time and p

the paral lel fraction of the unipro cessor execution time s and p are given

by

p s

w w

p s

s p s p

w w w w



Now consider executing the same algorithm on a parallel machine with

N pro cessors each with the same capabilities as the single pro cessor used

ab ove The time used is T Assume that the parallel work p can

N w

b e split evenly among the N pro cessors without any extra overhead The

execution time by this direct paral lelisation of is then T s p N

N w w

The sp eedup achieved is given by

T s p

 w w

S

Amdahl

T s p N s pN s sN

N w w

has often b een used to plot the p ossible sp eedup as a function of N for

xed s or as a function of s for xed N see Figure



Sequentially here means that this part of the computation must b e p erformed step

by step ie sequentiall y and alone No other parts of the total computation maybe

executed while the serial part is p erformed

S S

 

 

 

 

 

 

 

 

 

 

 

 

   

   

   

   

   

   

   



N

s

 



























 





























































 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

N s

Speedup S N

Speedup S s

a b

Figure Twointerpretations of S

Amdahl

Inverted Amdahls Law

Amdahls law can b e viewed as a result of considering the execution time of

a given serial program run on a parallel machine GMB The team at

Sandia has derived a new sp eedup formula here called S byinverting

S andia

this reasoning They consider howmuch time is needed to execute a given

parallel program on a serial pro cessor

Denition

Consider a given parallel algorithm Let T b e the total time used

N

for executing on a parallel machine using N pro cessors The part of the

time T used on computations which must b e p erformed sequentially is

N



The remaining part of the time T is used on computations denoted s

N

w

 

is called the serial s that may b e done in parallel and is denoted p

w w



part of the Npro cessor execution time Corresp ondingly p is called the

w

 

paral lel part of the Npro cessor execution time By denition s p

w w

T 

N

Denition

 

Let s denote the serial fraction of the N pro cessor execution time and p

 

the paral lel fraction of the N pro cessor execution time s and p are given

S S

 

 

 

 

 

 

 

 

 

 

 

 

   

   

   

   

   

   

   













N 

































  

 

slop e s

 

























 

 

 

 

 

 

 

 

  

 

 

 

 

 

 

 

s N



Speedup S s Speedup S N Speedup S N

a a b b

Figure Twointerpretations of S

Sandia

by

 

s p

w w

 

s p

   

s p s p

w w w w



 

On a single pro cessor the time used to execute T is s p N



w w

This gives the following alternative sp eedup formula

 

T s p N



     w w

S s p N s s N N N s

S andia

 

T s p

N

w w

 

S is shown as a function of N for xed s and as a function of s for

S andia

xed N in Figure

Discussion

At rst sight it seems likethisinverted Amdahls reasoning gives a new and

much more optimistic law S expresses that sp eedup increases linearly

S andia



with N for a xed s The upp er b ound for p ossible sp eedup expressed by

Amdahls law has b een removed Wemay b e led to b elieve that Amdahls

law is without practical value when we consider unipro cessor execution of

a parallel computation or even worse that the law is wrong

It is imp ortant here to realise the eects of keeping s xed instead of



keeping s xed when N is varied This is in fact the only source for the



dierence b etween the twoformulas In Amdahls reasoning where we

have a xed unipro cessor execution time keeping s xed while N increases

corresp onds to computing a xed amount of parallel work p faster and

w

faster by using more pro cessors However in the inverted reasoning which



assumes xed N pro cessor execution time keeping s xed while N increases



corresp onds to p erforming a larger and larger parallel work in xed time



p at the parallel system while the serial work remains unchanged

w



These two situations arerather dierent Keeping s xed when N in



creases corresp onds to decreasing s and thus lifting the upp er b ound in

part a of Figure Similarlykeeping s xed with varying N corresp onds



to increasing s and thus reducing the slop e of the curve in part a of Figure

Problem Scaling and Scaled Sp eedup

Discussion

Which sp eedup formula is correct Both are The team at Sandia has an

nounced their formula b ecause they think it b etter describ es parallelisation

eorts as it is done in practice Gustafson in Gus argues that Amdahls

law is most appropriate for academic research

One do es not take a xedsized problem and run it on various

numb ers of pro cessors except when doing academic research in

practice the problem size scales with the number of processors

Agreeing with this statement or not there certainly exist situations

where it is natural to scale the problem with the numb er of pro cessors avail

able Often the xed quantity is not the problem size but rather the amount



of time a user is willing to wait for an answer In handling three real prob

lems wave mechanics uid dynamics and b eam strain analysis the team



Both for the direct paralleli sati on leading to S and the direct serialisati on giving

Amdahl

 

S wehave in general that s s and p p N Ifwe in Equation substitute

Sandia w w

w w

 

s and p with s and p N resp ectivelywe get the new sp eedup formula given by

w w

w w

Equation Similarly S can b e directly derived from S

Amdahl S andia



This work is the time needed to execute the parallel computation on a single pro cessor



ie p p N Thus the parallel work increase linearly in N

w

w



This is seen directly from the denition of s when s is xed and p increases

w w



A good example is weather forecasting which illustrates b oth the xedsized mo del

Amdahl reasoning and the scaledsized mo del inverted Amdahl reasoning Assume that

the weather prediction is guided byasimulation mo del of the atmosphere Two imp ortant

at Sandia found it natural to increase the size of the problem when using

more pro cessors This is what they call problem scaling

In their work they found that it was the parallel part of the program that

scales with the problem size Therefore as an approximation the amount

of work p that can b e done in parallel varies linearly with the number of

w



pro cessors N In other words since p p N and p varies linearly with

w w

w



N such problem scaling corresp onds to keeping s xed Therefore under

these conditions Figure gives a correct picture of the p ossible sp eedup

The team at Sandia called this sp eedup measure for scaledspeedup to reect

that it assumes problem scaling

This section was written in Decemb er Similar thoughts have

later b een expressed byseveral authors see for instance the discussion in

HWG and ZG and the detailed treatmentgiven in SN

A Simple Example Floyds Algorithm

The intention of the following example is to illustrate parts of the theory

describ ed ab ove It also gives a background for a further discussion of Am

dahls law in Section



The famous Floyd Warshall algorithm for the al l pairs shortest path

problem is shown in Figure The reader is referred to pages of

CM for further details ab out the algorithm The variable names in Figure

are the same as used in CM Other descriptions of the algorithm can

b e found in pages of AHU and pages of Law



The execution of this algorithm on a single pro cessor requires O n



time There are exactly n instances of the assignment in line CM



shows that the n assignments given by line for a xed value of k in

the outermost forlo op areindependent Therefore they can b e executed

in any order or simultaneously ie in parallel Thus the algorithm may

factors for the quality of the forecasting are how new the inputs weather samples to

the simulation mo del are and how detailed the simulation is If the computer used for

the simulation is upgraded by increasing the numb er of pro cessors there are principall y

two main alternatives for quality improvement of the forecasting Amdahl reasoning Solve

the same problem faster ie use newer weather data in the simulation InvertedAmdahl

reasoning Solve a bigger problem within the same execution time ie use a more detailed

simulation mo del



The names Floyds algorithm and the Warshal lFloyd algorithm do also app ear in the

literature The algorithm was rst presented by Rob ert W Floyd in Flo and is based

on a theorem on Bo olean matrices published by Stephen Warshall in War

SEQUENTIAL pro cedure Floyd

f Initially all di j W i j g

begin

for k to n do

for i to n do

for j to n do

di j mindi j di k dk j

end

Figure Floyds algorithm



b e executed in O n time on a synchronous machine with O n pro cessors

See program P in Section in CM

We will now use the theory presented in this section on the parallelisation

describ ed ab ove for the algorithm Floyd Assume that the lo op control

for one iteration of a for lo op takes one time unit and that one execution

of line requires four time units The serial work s of the computation

w

consists of line taking n time units The parallel work p consists of

w

  

line and taking n n n time units resp ectively T Floyd



 

s p n n n Assuming that the parallel part of the computation

w w

can b e distributed evenly among N pro cessors without extra overhead we

 

have T Floyd s p N n n n N

N w w

Now consider a xed problem size for instance n The serial

 

fraction of the unipro cessor execution time is s nn n n

Hence we observe that this algorithm has a very small

serial part even for relatively small problem sizes and that lim s

n

The sp eedup as a function of the numb er of pro cessors used N is then

given by Amdahls law in Equation

N

S

Amdahl

s sN N

This corresp onds to the general sp eedup curve shown in part a of Figure

For N Equation do es express a nearly linear speedup



and for N n weget S N Further we see that the

maximum sp eedup is limited bys using an innitely large number

of pro cessors for this xed sized problem

Letusnow consider the sp eedup formula derived by the team at Sandia



Wehave s n and the time used to execute the parallel part of the

w

  

computation using N pro cessors is p p N n n N Further

w

w

the serial fraction of the paral lel execution time is

  

s nn n n N

When we used Amdahls lawabovewe assumed a xed problem size n

corresp onding to a xed value for the serial fraction s of the unipro cessor



execution time In this case we assume that s is kept xed which requires

that the problem size is scaled with the numb er of pro cessors The necessary

relation b etween n and N can b e found by solving Equation with resp ect

  

to n or N for instance N n ns s Now assume that this





Using the Sandia sp eedup relation holds b etween N and n and that s



formula in Equation we get the sp eedup as a function of N

  

n n S s s N

Sandia



 

For large nwheren n keeping s xed at corresp onds to scaling the



p

p

problem size bysetting n N N In this case Equation



gives S N which corresp onds to part a of Figure



This example shows that a rather mo derate problem scaling gives an

unlimited sp eedup function The upp er b ound for the sp eedup given by

Amdahls law o ccurs b ecause only a limited numb er of pro cessors can b e

used eectively on a xed amount of parallel work This upp er b ound is

removed if we increase the amountofparallelwork in corresp ondence with

the numb er of pro cessors used It should therefore b e no surprise that it is

easier to obtain optimistic sp eedup results if we allow problem scaling

Using Amdahls Law in Practice

There are many problems asso ciated with the use of Amdahls law in prac

tice This section aims at illuminating some of the problems not solving

them

Choice of parallelisation

Consider an arbitrary given sequential program and the question What

sp eedup can b e achieved by parallel pro cessing this program The rst

issue encounteredisthechoiceofparal lelisation In this context an optimal

parallelisation may b e dened as one whichgives the b est sp eedup curve

But what is the b est sp eedup curve Is it most imp ortant to obtain a high

sp eedup for reasonably lowvalues of N tens or hundreds of pro cessors or

should one strive for a parallelisation giving the highest asymptotic N

sp eedup value

Even if one could decide up on a denition of an optimal sp eedup curve

we still have the generally dicult problem of nding the parallelisation

giving this sp eedup and proving that it is optimal However there exists a

large numb er of nonoptimal parallelisations for most problems and most

of them are interesting in some manner In the literature the most usual

situation is speedup analysis of a given paral lelisation which is the approach

adopted here

Calculation of serial and parallel work

Assume a given parallelisation of a sequential program In the general case

identifying the amount of serial and parallel work will b e a more complicated

task than examplied by the analysis of Floyds algorithm in Section

The calculation of s and p for the describ ed parallelisation of Floyd was

w w

straightforward partly due to the simplicity of the algorithm and partly due

to some implicit and simplifying assumptions

In Figure we assumed that all the work in lines could b e done

in parallel The amount of parallel work for line was calculated by

accumulating the eect of the three enclosing forlo ops Now supp ose that

each single execution of line has to b e p erformed serially What is now

the amount of serial work describ ed by line If we use the accumulated

 

amount of work for line whichisn we get a serial work s n n

w

 

If this is used together with the corresp onding p n n Amdahls

w



lawgives a constant sp eedup of approximately for large values of n and



 

N n However using n pro cessors makes Floyd into a simple lo op with



n iterations each of constant time Thus N n pro cessors gives a sp eedup



of order O n Therefore the ad hoc method for calculating s and p used

w w

in Section can not beused in the general case

These problems are caused by a mix of serial and parallel parts in the

nested program structure Branching and pro cedures will certainly intro duce

further diculties in the calculation of s and p

w w

Another question which arise is to what extent it is correct to calculate

one global measure of the parallel work p as the sum of the parallel work

w

in various parts of the program

Howmany pro cessors can eectively b e exploited

Now assume that the parallel work p has b een calculated and that over

w

lapping of all parallel work is p ossible According to the theory in this

case N p numb er of pro cessors may b e used to execute the parallel work

w

in one time unit This is in most practical cases very unrealistic Distribut

ing al l the parallel work evenly among the N pro cessors may b e imp ossible

A more detailed analysis which takes into account the varying number of

pro cessors whichmay b e used in the execution of the various parts of the

parallel work is needed

Chapter

The CREW PRAM Mo del

This chapter is devoted to the CREW PRAM mo del and howitmay b e pro

grammed Section starts by describing the mo del as it was prop osed for

use in theoretical computer science byFortune and Wyllie FW Wyl

This description is then extended by some few details and assumptions that

are necessary to make the mo del to a vehicle for implementation and exe

cution of algorithms The following two sections describ e how the CREW

PRAM mo del may b e programmed and a high level notation for express

ing CREW PRAM programs is outlined Section ends the chapter by

describing a top down approach that has shown to b e useful for developing

parallel programs on the CREW PRAM simulator system The approachis

describ ed by an example

The Computational Mo del

Large parallel computers are also dicult to program The

situation b ecomes intolerable if the programmer must explicitly

manage communication b etween pro cessors

A G Ranade S N Bhatt and S L Johnsson in The Fluent

Abstract Machine RBJ

The reason is that most all parallel algorithms are describ ed in

terms of the PRAM abstraction which is a practice that is not

likely to change in the near future

Tom Leighton in What is the Right Model for Designing Paral lel

Algorithms Lei





 





 





 





 







 





 

 

 













 









 

 

 

 

 

 

 

 







 





 

 

 

 

 

 

 

 

 

 

 

 



 

Program  

 

 

 

 





 

 

 

 

 

 

 

 

 

 



     

     

     

     

     

     

     

   

  

  

  

  

  

  

  

  

  

  

  

P P P

  i

  

  

  

  

     

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

     

  

  

  

  

  

  

  

L L L

  

  

  

  

  

  

  

  

  

  

  

           

  

  

  

  

  

  

  

  

  

  

  

    

     

     

     

     

     

     

     



P pro cessor no i

i

 

 

 

 

  

   

 

 

  

  



 

  

  





 









 



 





  



 



 

  



 



 

  

  

 L lo cal memory 

 













 



 

 



  



  

 



  



 

  

    

 Global Memory





















































Figure The CREW PRAM mo del

Based on quite a few years of exp erience in designing paral

lel algorithms the author b elieves that diculties in designing

parallel algorithms will makeany parallel computer that do es

not supp ort the PRAM mo del of parallel computation less than

comp etitive as a general purp ose computer with the strongest

non parallel computer that will b ecome available at the same

time

Uzi Vishkin in PRAM Algorithms Teach and Preach Vis

In this section we rst describ e the original CREW PRAM mo del as it

was prop osed byFortune and Wyllie FW Wyl This is followed by

a do cumentation of some additional prop erties that have b een necessary

to dene for making the CREW PRAM mo del into a vehicle for executing

and measuring parallel algorithms The main characteristics of the CREW

PRAM have b een collected in small denitions with the heading CREW

PRAM prop erty This have b een done for easy reference

The Main Prop erties of The Original PRAM

A CREW PRAM is a very simple and general mo del of a parallel computer

as outlined in Figure It consists of an unb ounded numb er of pro cessors

Instruction Function

LOAD operand Transfer operand tofrom the

STORE operand accumulator fromto memory

ADD operand Addproper subtract the value of

SUB operand the operand tofrom the accumulator

JUMP label Unconditional branch to label

JZERO label Branch to label if accumulator is zero

READ operand See text

FORK label See text

HALT See text

Figure CREW PRAM instruction set From Wyl

which are connected to a global memory of unb ounded size The pro cessors

are controlled by a single nite program

Pro cessors

A CREW PRAM has an unb ounded numb er of equal pro cessors Each has

an unb ounded lo cal memoryanaccumulator program counter and a ag

indicating whether the pro cessor is running or not

Instruction set

In the original description of the PRAM mo del the instruction set for the

pro cessors was informally outlined as shown in Figure This very small

and simple instruction set was suciently detailed for Fortune and Wyllies

use of the PRAM as a theoretical mo del Each op erand may b e a literal

an address or an indirect address Each pro cessor may access either global

memory or its lo cal memory but not the lo cal memory of any other pro

cessor Indirect addressing may b e through one memory to access another

There are three instructions that need further explanation

READ Reads the contents of the input register sp ecied by the op erand and

places that value in the accumulator of the pro cessor executing the

instruction

FORK If this instruction is executed by pro cessor P P selects an inactive

i i

pro cessor P clears P s lo cal memory copies P s accumulator into

j j i

P s accumulator and starts P running at the lab el whichisgiven as

j j

part of the instruction

HALT This instruction causes a pro cessor to stop running

Global memory

A CREW PRAM has an unb ounded global memory shared by all pro cessors

Each memory lo cation may hold an arbitrary large nonnegativeinteger

CREW PRAM prop erty Concurrent read exclusive write

Simultaneous reads of a lo cation in global memory are allowed but if two

or more pro cessors try to write into the same global memory lo cation

simultaneously the CREW PRAM immediately halts Wylp

CREW PRAM prop erty Reads before writes

Several pro cessors may read a lo cation while one pro cessor writes into it all

reads are completed b efore the value of the lo cation is changed Wylp

CREW PRAM prop erty Simultaneous reads and writes

An unb ounded numb er of pro cessors may write into global memory as long

as they write to dierent lo cations An unb ounded numb er of pro cessors

may read any lo cation at any time Implicit in Wyl

Lo cal memory

Each pro cessor has an unb ounded lo cal memory Each memory lo cation

may hold an arbitrary large nonnegativeinteger

Input and output

The input and output features of the original PRAM reects its intended

use in parallel complexity theory The input to a CREW PRAM is placed



in the input registers one bit p er register When used as an acceptor the

CREW PRAM delivers its output or in the accumulator of pro cessor

P When used as a transducer the output is delivered in a designated area



in the global memory



In some contexts Wyllie p ermitted the input registers to hold integers as large as the

size of the input ob ject

One nite program

All pro cessors execute the same program The program is dened as a nite

sequence of lab elled instructions from the instruction set shown in Figure

Fortune and Wyllie allowed several o ccurrences of the same lab el

resulting in nondeterministic programs

Op eration

The CREW PRAM starts with all memory cleared the input placed in the

input registers and with the rst pro cessor P storing the length of the



input in its accumulator The execution is started by letting P execute the



rst instruction of the program

CREW PRAM prop erty Synchronous operation

Ateach step in the computation each running pro cessor executes the in

struction given by its program counter in one unit of time then advances

its program counter by one unless the instruction causes a jump Wyl

p

CREW PRAM prop erty Processor requirement

The numb er of pro cessors required to accept or transform an input ie to

do a computation is the maximum numberofprocessorsthatwere active

started byaFORK and not yet stopp ed byaHALT instruction at any instant

of time during the PRAMs computation Wyl p

From Mo del to MachineAdditional Prop erties

Diculties Encountered When Implementing

Algorithms on a Computational Mo del

The original work byFortune and Wyllie do es not give descriptions of the P

RAM mo del which are detailed enough for our purp osewhichistosimulate

real parallel algorithms as if they were executed on a materialised PRAM

The simplicity of computational mo dels is achieved by adopting technically

unrealistic assumptions and by hiding a lot of uninteresting details

Hidden details are typically howvarious asp ects of the machine mo del

should b e realised If it is commonly accepted that a certain function f

may b e realised within certain limits that do not aect the kind of analysis

that is p erformed with the mo delthe details of how f is realised maybe

omitted For instance if the mo del is used for deriving orderexpressions for

the running time of programs the detailed realisation of the function f is

uninteresting as long as it is known that it may b e done in constant time

It is therefore no surprise that it is dicult to nd detailed descriptions for

how the CREW PRAM op erates

Nevertheless some implicit help may b e found in Wyllies thesis Wyl

In addition to its contributions to complexity theory it describ es well how

the PRAM maybeprogrammed to solvevarious example problems Al

though these descriptions are at a relatively high level they indicate quite

well how the PRAM was intendedtobe used for executing parallel algo

rithms

Implementing parallel algorithms as complete and measurable programs

made it necessary to make a lot of detailed implementation decisions These

decisions required mo dications and extensions of the original CREW PRAM

prop erties dened in Section They are intended to b e a common

sense extension of Wyl lies CREW PRAM philosophy and are presented

in the following

Prop erties of The Simulated CREW PRAM

In the currentversion of the simulator the time consumption must with a

few exceptions b e sp ecied by the programmer for the various parts of the

program see Section In a more complete system this should b e done

automatically by a compiler

Extended instruction set

CREW PRAM prop erty Extended instruction set

The original instruction set shown in Figure is extended to a more com

plete but still simple instruction set Nearly all instructions use one time

unit some few such as divide use more The time consumption of each

instruction is dened as parameters in the simulatorand it is therefore

easy to change

Input and output

Our use of the CREW PRAM mo del makes it natural to read input from

the global memory and to deliver output to global memory

Op eration

Nondeterministic programs are not allowed in the simulator

CREW PRAM prop erty CostofProcessor Allocation

Allo cation of n pro cessors on a CREW PRAM initiated by one other

pro cessor can b e done bythesen pro cessors in at most C C blog nc

 

time units

In the currentversion of the simulator wehave C and C time

 

units for n These values have b een obtained by implementing an asyn

chronous CREW PRAM algorithm for pro cessor allo cation The algorithm

uses the FORK instruction in a binary tree structured chainreaction

Pro cessor set and number

When programming parallel algorithms it is often very convenienttohave

direct access to the pro cessor number of each pro cessor This feature was

not included in the original description of the PRAM In a PRAM used

for developing and evaluating parallel algorithms it seems natural to include

such a feature by letting each pro cessor havea localvariable which holds

the pro cessor number

There are also cases where the pro cessors need to know the address



in global memory of its processor set data structure The pro cessor set

concept is intro duced in Section See also Section A It is natural

to assign this information to each pro cessor when it is allo cated

CREW PRAM prop erty Processor set and processor number

After allo cation each pro cessor knows the pro cessor set it currently is mem

b er of and its logical numb er in this set numb ered The pro cessor

numb er is stored in a lo cal variable called p

The use of the simulator and related to ols is do cumented in App endix A

CREW PRAM Programming

This section describ es how the CREW PRAM mo del may b e programmed

in a high level synchronous MIMD programming style The style follows the

original thoughts on PRAM programming describ ed by James Wyllie in his



One such case is the resynchronisati on algorithm outlined in Section

thesis Wyl We present a notation called Parallel Pseudo Pascal PPP

for expressing CREW PRAM programs The section starts with a descrip

tion of some other opinions on the PRAM mo del and its programmingthey

are included to indicated that the stateoftheart is far from consistent

and welldened

Common Misconceptions ab out PRAM Programming

The PRAM mo del is SIMD

A common misconception ab out the CREW PRAM mo del is that it is

a SIMD mo del according to Flynns taxonomy Fly see page In a

recent b o ok on parallel algorithms Akl the author writes at page

sharedmemory SIMD computers This class is also known in the literature

as the Parallel Random Access Machine PRAM mo del It may b e that

a large part of the algorithms presented for the CREW PRAM mo del is

SIMD style Michael Quinn Qui and Gibb ons and Rytter GR do

also classify the PRAM mo del as SIMD

Using the PRAM as a SIMD mo del is a choice made by the programmers

not a limitation enforced by the mo del The CREW PRAM mo del implies a

single program whichdoesnot necessarily imply a single instruction stream

The crucial p oint here is the lo cal program counters as describ ed in Wyllies

PhD thesis

Wyl page each processor of a PRAM has its own

program counter and may thus execute a portion of the program

dierent from that being executedbyotherprocessors In the

notation of Fly the PRAM is thereforeaMIMD

Stephen Co ok express the same view

Coo page The PRAM of Fortune and Wyl lie dierent

paral lel processors can beexecuting dierent parts of their pro

gram at once so it is multiple instruction stream

The PRAM mo del is asynchronous

In Sab pp the author compares his own mo del for parallel pro

gramming with the PRAM mo del He starts by stating that the pro cessors

in the PRAM mo del op erates asynchronouslywhich certainly is not the

case for the original PRAM mo del Further he writes ab out the PRAM

mo del that it is confusing and hardtoprogram Probablyitis

asynchronous shared memory multipro cessors the author considers as the

PRAM mo del

The author states that the asynchronous MIMD p ower is overwhelm

ing and that PRAM programmers deal with this p ower by writing SIMD

style programs for it he claims that al l parallel algorithms in the pro ceed

ings of the ACM Symp osia on Theory of Computing STOC and

are of data parallel form ie SIMD This is not true Algorithms

in the pro ceedings of the STOC and other similar Symp osia are typically

describ ed in a highlevel mathematical notationwhere the choicebetween

SIMD or MIMD programming style to a large extent is left open to the pro

grammer Parts of an algorithm are most naturally co ded in SIMD style

while other parts most naturally are expressed in MIMD style

Notation Parallel Pseudo PascalPPP

Wyllie outlined a high level pseudo co de notation called paral lel pidgin algol

for expressing algorithms on a CREW PRAM The notation intro duced here

called Parallel Pseudo Pascal PPP is intended to b e a mo dernisation of

Wyllies notation PPP is an attempt to combine the pseudo language nota

tion called sup er pascal used for sequential algorithms by Aho Hop croft

and Ullmann in AHU with the ability to express parallelism and pro

cessor allo cation as it is done in parallel pidgin algol

Informal sp ecication of PPP

The PPP notation is outlined in Table Since PPP is pseudo co de it

should only b e used as a framework for expressing algorithms The PPP

programmer should feel free to extend the notation and use own construc

tions See the last statement in the list b elow

A Statementmaybeany of the alternatives shown in the table or a list

of statements separated by semicolons and enclosed within begin end

Variables will normally not b e declared their denition will b e implicit from

the context or explicitly describ ed

It is crucial to separate lo cal variables from global variables This is

Localvariables and pro cedure done in PPP by underlining global variables

names are written in italics

Statement to should b e selfexplainin g and statement should b e

familiar to all whichhave exp erienced pseudo co de Here only statement

and need some further explanation

Table Informal outline of the Parallel Pseudo Pascal PPP notation

a LocalVariableName Expression

Expression b GlobalVariableName

if Condition then Statement f else Statement g

while Condition do Statement

for Variable InitialValue to FinalValue do Statement

pro cedure ProcedureName FormalParameterList Statement

ProcedureName ActualParameterList



assign Pro cessorSp ecication f to Pro cSetName g



for each pro cessor Pro cessorSp ecication f where Condition g do

Statement

any other well dened statement

Pro cessor allo cation

The assign statement is used to allo cate pro cessors An example mightbe

assign a pro cessor to each element in the arrayA to Apro cs

The to Pro cSetName clause may b e omitted in simple examples when only

one set of pro cessors is used The assign statement is realised by the pro

cessor allo cation pro cedure Section which assigns lo cal pro cessor

numb ers to the pro cessors Thus pro cessor no i may refer to the ith ele

ment in the global arrayA byAp since p i Rememb er that p is the lo cal

variable whichalways stores the pro cessor numb er The actual number of

pro cessors allo cated by the statementmay b e run time dep endent

Pro cessor activation

Statementtyp e is the way to sp ecify that co de should b e executed in par

allel The pro cessor sp ecication may simply b e the keyword in followed by

the name of a pro cessor setin that case all the pro cessors in that pro cessor

set will execute the statement in parallel

The where clause is used to sp ecify a condition whichmust b e true for

a pro cessor in the pro cessor set for the statement to b e executed by that

pro cessor A rather lowlevel PPP example which mighthavebeentaken

from a PPP program of an o ddeven transp osition sort algorithm is shown

in Figure

for each pro cessor in Apro cs where p is o dd and p n do b egin

p temp A

if temp Ap then b egin

Ap Ap Ap temp

end

end

Figure Example program demonstrating the where clause A is a global array

temp and n are lo cal variables and p is a lo cal variable holding the pro cessor

numb er



In this case a Pro cessorSp ecicati on may b e a text which implicitl y describ es a num

b er of pro cessors to b e allo cated or simply IntegerExpression pro cessors



Here Pro cessorSp ecicati on may b e prose describing text or more formally in

Pro cSetName

The Imp ortance of Synchronous Op eration

Synchronous op eration simplies

Parallel programs are in general dicult to understand However requiring

that the pro cessors op erate synchronously may often makeitmuch easier to

understand parallel algorithms Consider the simple program in Figure

This example is taken from Wyl

for each pro cessor in Pro cSet do

if condp then

x f y

else

y g x

Figure Example program with unpredictable b ehaviour

The pro cessor activation statement line describ es that the if state

ment shall b e executed in parallel by all pro cessors in the set called Pro cSet

Assume that Pro cSet contains two pro cessors Further assume that one pro

cessor evaluates the condition condp to true while the other evaluates

it to false This may happ en b ecause the condition may refer to variables

lo cal in each pro cessor or to the pro cessor numb er x and y are global

variables

Now consider the else clause Dep ending on the relative times required

will b e used to compute the functions f and g either the old or new value of x

in the else clause Thus it is dicult to argue ab out the precise b ehaviour

of this program As Wyllie p oints out Wyl the co de generated from the

program in Figure is a legitimate program for the PRAM mo delbut

this kind of programming is not recommended

A program with more selfevidentbehaviour can easily b e obtained from

Figure as shown in Figure It is here assumed that condp is

unaected by lines and

Line describ es an explicit synchronisation which is necessary to guar

antee that the old values of x and y are used in the computation of f and

g Such synchronisations are in general valuable to ols to restrict the space

of p ossible b ehaviours and make it easier to understand parallel programs

This observation probably led Wyllie to dene what is the most imp ortant

prop erty of highlevel programming of the PRAM

for each pro cessor in Pro cSet do b egin

if condp then

tx f y

else

tyg x

wait for b oth pro cessors to reach here

if condp then

x tx

else

ty y

end

Figure Example program transformed to predictable b ehaviour by using ex

plicit resynchronisation line Line may b e omitted from a PPP pro

gramsee the text

CREW PRAM prop erty Synchronous statements

If each pro cessor in a set P b egins to execute a PPP statement S at the same

time all pro cessors in P will complete their executions of S simultaneously

Adapted from Wyl p

This surely is a severe restriction which puts a lot of burden on the pro

grammer However the prop erty is crucial in making the PPP programs

understandable and easy to analyse From now on we will assume the exis

tence of a PPP compiler that whenever p ossible automatically ensures that

the PPP statements are executed as synchronous statements Therefore we

may omit statement in Figure since the PPP compiler on reading the

if statement in line will pro duce co de so that all pro cessors in Pro cSet

simultaneously will start executing the if statement in line

Given the prop erty synchronous statements the time needed to execute

a general probably comp ound PPP statement S is the maximum over all

pro cessors executing S plus the time needed to resynchronise the pro cessors

b efore they leave the statement

The need for resynchronisation after each highlevel statementmay seem

to give a lot of overhead asso ciated with the highlevel programming As

will b e shown b elow this is not the case A very simple resynchronisation

co de may b e added automatically by the compiler in most cases

Compile Time Padding

Assuring the synchronous statements prop ertymay in most cases b e done by

so called compile time padding When the PPP compiler knows the execution

time of the then and else clausesinserting an appropriate amountof

waiting No ops in the shortest of these will make the whole if statement

to a synchronous statement

Automatic compile time padding is in fact a necessity for making it

p ossible to express algorithms in a synchronous high level language The

exact execution times of the various constructs should b e hidden in a high

level language Neither is it desirable nor should it b e p ossible that the

prop ertyofsynchronous statements is preserved by the programmer

Compile time padding should b e used wherever it is p ossible by the

compiler It is simple and it do es not increase the execution time of the

pro duced co de It is only the uncritical execution paths which are padded

till they have the same length as the critical longest execution path The

alternative which is to insert co de for explicit resynchronisation maybe

simpler for the compilerbut should b e avoided since it intro duces a signif

icant increase in the running time see Section

Helping the Compiler

There are cases where compile time padding cannot b e used but where the

use of general pro cessor resynchronisation can b e avoided by clever program



ming Consider the following example taken from Wyl The program

of nonnegative in Figure is a simpleminded program to set an arrayA

integers to all zeros

In this example it is not p ossible for the compiler to know the execution

time of the while lo op It must therefore insert co de for general resynchro

nisation just after the while lo op This may lead to a substantial increase

in the execution time

However if the programmer knows the value of the largest elementin

the arrayA M max Ai this fact may b e used to help the compiler as

i

shown in the program in Figure Here the compiler knows that wewill

iterations of the for lo op for all the pro cessors Further it can easily get M

makethe if statement synchronous by inserting an else wait twheret is



A less articial example is given on page

begin

assign a pro cessor to each elementinA

for each pro cessor do

while Ap do

p Ap A

end

Figure Example program which requires general resynchronisation to b e in

serted by a PPP compiler

begin

assign a pro cessor to each elementinA

for each pro cessor do

for i to M do

if A p then

ApAp

end

Figure Example program where the need for general resynchronisation has

b een removed

the time used to execute the then clause All pro cessors will simultaneously

exit from the for lo op

Synchronous MIMD ProgrammingExample

Assume that we are given the task of making a synchronous MIMD pro

gram which computes the total area of a complex surface consisting of var

ious ob jects We will consider three cases that illustrate various asp ects of

synchronous MIMD programming

Case

Assume that the surface consists of only three kinds of ob jects Rectangle

Triangle or Trapezium The program is sketched in PPP in Figure

Statement demonstrates a multiwaybranch statement which utilise

the multiple instruction stream prop ertytogive faster execution and b etter



See for instance HS or KR

f Each pro cessor holds one ob ject g

if typ e of ob ject is Rectangle then

Compute area of Rectangle ob ject

else if typ e of ob ject is Triangle then

Compute area of Triangle ob ject

else

Compute area of Trapezium ob ject

f All pro cessors should b e here at the same time unit g



Compute total sum f O log n standard parallel prex g

Figure Synchronous MIMD program in PPP case

Rectangle wait

before Triangle no wait after

Trapezium wait

Figure Compile time padding

pro cessor utilisation than on a SIMD machine If the time used to compute

the area of each kind of ob ject is roughly equal MIMD execution will b e

roughly three time faster than SIMD execution for this case with three

dierent ob jects

As expressed by the comment just b efore statement the pro cessors

should leave the multiwaybranch statementsimultaneously In this case this

is easily achieved by compile time padding since the time used to compute

the area of each kind of ob ject may b e determined by the compiler If we

assume that the most lengthy op eration is to calculate the area of Triangle

the eect of compile time padding can b e illustrated as in Figure

Case

Now assume that each ob ject is a p olygon with from to edges As

shown in Figure the area of each ob ject p olygon is computed bya

f Each pro cessor holds one p olygon g

Area ComputeAreathis p olygon

f All pro cessors should b e here at the same time unit g

Compute total sum f O log n standard parallel prex g

Figure Synchronous MIMD program in PPP case

f Each pro cessor holds one p olygon g

Area ComputeAreathis ob ject

SYNC f O log n time g

f All pro cessors should b e here at the same time unitg

Compute total sum f O log n standard parallel prex g

Figure Synchronous MIMD program in PPP case

function called ComputeAreaKnowing the maximum numb er of edges in

a p olygon the compiler may calculate the maximum time used to execute

the pro cedure and pro duce co de whichwillmake all calls to the pro cedure



use this maximum time

In many cases the compiler may calculate the maximum time needed by

a pro cedure and insert co de to make the time used on the pro cedure to a

xed known quantity This is imp ortanttoprovide highlevel synchronous

programming CREW PRAM Prop erty

Case

Wenow consider the case that each ob ject is an arbitrary complex p olygon

ie with an unknown numb er of edges and that the size of the most com



Technically this enforcing of the maximum time consumption may b e done byinsert

ing dummy op erations andor iterations or by measuring the time used by reading the

global clo ck

plex p olygon is not known by the pro cessors In this situation it is imp ossi

ble for the compiler to know the maximum execution time of ComputeArea

The b est it can do is to insert an explicit synchronisation statement

just after the return from the pro cedure SYNC is a synchronisation mech

anism provided by the CREW PRAM simulator and it is describ ed in the

next subsection It is able to synchronise n pro cessors in O log n time In

this case the use of SYNC do es not inuence on the time complexity of the

computation since the succeeding statement also requires O log n time

General Explicit Resynchronisation

There are cases where general resynchronisation cannot b e avoided In gen

eral the numb er of pro cessors leaving a statementmust match the number

of pro cessors whichentered it The number of entering pro cessors is known

at run time so the task reduces to count the numb er of pro cessors which

have nished the statement and are ready to leaveit

One of the main purp oses of our CREW PRAM mo del simulator is to

be a vehicle for p olylogarithmic algorithms Manysuch algorithms use at

least O n pro cessors where n is the problem size The use of a simple syn

chronisation scheme with a linear O n time requirementwould makeit

imp ossible to implement parallel algorithms using explicit pro cessor resyn

chronisation with sublinear time requirement

However as Wyllie claims synchronisation of n pro cessors may b e done

in O log n time He do es not describ e the implementation details but

they are rather straightforward One implementation is outlined in Section

Since pro cessor synchronisation may b e necessary in CREW PRAM pro

grams the time requirement mo delling should use a realistic measure for the

time used by resynchronisations One will so on realise that the time needed

by most resynchronisation algorithms dep ends on when the various pro

cessors starts to execute the algorithm Thus the time requirement cannot

b e computed as an exact function of solely the numb er of synchronising

pro cessors One way to exactly mo del the time usage is to implement the

synchronisation

Knowing the size of the most complex p olygon would make it p ossible to use the

techniques describ ed for case

Synchronisation Algorithm

We will now outline the implementation of pro cessor resynchronisation which

is used in the CREW PRAM simulator It should b e noted that this is just

one of many p ossible solutions It may b e omitted by readers who are not

interested in what happ ens b elow the highlevel language PPP

Assume that n pro cessors nish a statement at unpredictable dierent

times When all have nished they should as fast as p ossible pro ceed

simultaneously synchronously to the next statement Before this happ ens

they must wait and synchronise

As is often the case for O log n algorithms the necessary computation

maybedoneby organising the pro cessors in a binary tree structure Assume

that the pro cessors are numbered n The waiting time may b e used

to let each pro cessor count the numb er of pro cessors in the subtree b elow

and includin g itself whichhave nished the statement This is done in

parallel by all nished pro cessors The numb er of nished pro cessors will

bubble up to the ro ot of the whole tree which holds the total number of

nished pro cessors

When all have started to execute the resynchronisation algorithm the

time used b efore this is known by the ro ot pro cessor is given by the time

used to propagate the information in parallel from the leaves to the ro ot of

the tree This is surely O log n A similar O log n time algorithm is the

synchronisation barrier rep orted in Lub

Figure illustrates the p erformance of the resynchronisation algo

rithm for pro cessor sets of various sizes numb er of pro cessors For each

size ten cases have b een measured In each case every pro cessor uses a

random numb er of time units in the range b efore it calls the syn

chronisation pro cedure The synchronisation time is measured as the time

from the last pro cessor calls the synchronisation pro cedure to all the pro

cessors have nished the synchronisation

Parallelism Should Make Programming

Easier

This section describ es some of the thoughts presented at the Workshop

On The Understanding of Parallel Computation under the title Paral lelism

Should Make Programming Easier Natc

The motivation for the section is twofold First of all it gives a brief ex

planation of my opinion that parallelism maybeusedtomake programming

easier and it shortly mentions some related work Secondlytwo examples

of easy parallel programs are provided These examples illustrate the use

of the PPP notation intro duced in the previous section

Intro duction

Parallelism should not add to the software crisis



The socalled software crisis is still a ma jor problem for computer indus

try and science Contemp orary parallel computers are in general regarded to

be even more dicult to program than sequential computers Nevertheless

the need for high p erformance has made it worthwhile to develop complex

program systems for sp ecial applications on parallel computers

The use of parallelism should not add to the software crisis For general

purp ose parallel computing systems to b ecome widespread I think it is

necessary do develop programming metho ds that strive for making parallel

programming easier

How can parallelism make programming easier

There are at least two main reasons that the use of parallelism in future

computers maymake programming easier

Themainmotivation for using parallel pro cessing is to provide more

computing p ower to the user This p ower may in general make it p ossible for

the programmer to use simpler and dumb er algorithmsto some extent

It is well know that co ding to achieve maximum eciency is dicult and

programmertime consuming Consequently paral lelism may increase the

number of cases wherewecan aord to use simple brute force techniques

Also increased computing p ower or a large numb er of pro cessors may



The use of the PPP notation is the reason for placing this relatively highlevel

section b etween twotechnical sections and



Bo o ch in Bo o page The essence of the software crisis is simply that it is much

more dicult to build software systems than our intuition tells us it should b e



Synchronization time

Numb er of pro cessors



Figure Performance of the resynchronisation algorithm provided bythe

CREW PRAM simulator The symbol  is used to mark the average time used

of ten random cases A  is used to mark each random case See the text

make it more common to use redundancy in problem solving strategies

which again may imply conceptually simpler solutions See the example in

Section b elow

Many of our programs try to reect various parts of or pro cesses in

the real world There is no doubt that the real world is highly parallel

Therefore the possibility of using paral lelism wil l in general make it easier

to reect real world phenomena and natural solution strategies See the

example in Section b elow

Related work

Most computer scientists would of course prefer easy programming Conse

quently there have b een made several prop osals of programming languages

that aim at providing this In this paragraph I will mention some of these

to provide a few p ointers to further reading I will not judge the relative

merits of the various approaches

The August issue of IEEE computer was titled Domesticating

Paral lelism and contains articles describing several of these new languages

Linda ACG Concurrent Prolog Sha ParAla parallel functional

language Hud and others

Linda consists of a small set of primitives or op erations that is used

by the programmer to construct explicitly parallel programs It supp orts a

style of programming that to a large extentmakes it unnecessary to think

ab out the coupling b etween the pro cessors See also CG

Concurrent Prolog is just one of several logic programming languages

designed for parallel programming and execution Numerous parallel vari

ants of LISP have also b een develop ed one of the rst was Multilisp Gel

Hal

Strand is a general purp ose language for concurrent programming FT

and is currently available on various parallel and unipro cessor computers

UNITY by Chandy and Misra CM is a new programming paradigm

that clearly separates the program design which should b e machine inde

p endent from the mapping to various architectures serial or parallel See

also BCK

SISAL is a socalled single assignment language derived from the dataow

language VAL Unlike dataow languages SISAL may b e executed on se

quential machines shared memory multipro cessors and vector pro cessors

as well as dataowmachines OC See also CF and Lan

For traditional sequential programming there is far from any consensus

ab out which of the various programming paradigms b eing most successful in

providing easy programming Wemust therefore exp ect similar discussions

for manyyears ab out the corresp onding parallel programming paradigms

Avery imp ortant approachtoavoid that the use of parallelism will add

to the software crisis is simply to let the programmers continue to write

sequential programs The parallelising may b e left to the compiler and

the runtime system A lot of work has b een done in this area see for

instance Pol Gup KM The metho d has the great advantage that

old programs can b e used on new parallel machines without rewriting

Synchronous MIMD Programming Is Easy

The use of synchronous MIMD programming to implement the algorithms

studied and describ ed in this thesis was motivated by a wish to program in

a relatively high level notation whichwas consistent with the programming

philosophy used by James Wyllie Wyl It came as a surprise that high

level synchronous MIMD programming on a CREW PRAM mo del seems to

b e remarkably easy

Easy programming

The CREW PRAM prop erty Synchronous statements describ ed at page

implies that the b enets of synchronous op eration are obtained at the

source co de level and not only at the instruction set level It gives the

advantages of SIMD programming ie programs that are easier to reason

ab out Ste However wehavekept the exibili ty of MIMD program

ming This was illustrated by the example in Section where Figure

shows a MIMD program that may compute the area of three dier

ent kinds of ob jects simultaneously Such natural and ecient handling of



conditionals is not p ossible within the SIMD paradigm

At rst making the statements synchronous may b e p erceived as a heavy

burden to the programmer However my exp erience is that it is a relatively

easy task after some training and that it may b e help ed a lot with simple

to ols

Note that the synchronous op eration prop ertymay b e violated bythe

programmer if wanted Asynchronous b ehaviour can b e mo delled on a

CREW PRAMwith the only restriction that the timedierence b etween



On a SIMD system nested conditional s can in principle reduce pro cessor utilisatio n

exp onential ly in the numb er of nesting levels Ste

anytwoevents must b e a whole numb er of CREW PRAM time units This

discretisation should not b e a problem in practice

Easy analysis

Synchronous programs exhibit in general a much more deterministic b e

haviour than asynchronous programs This implies easier analysis

The pro cess of making high level statements synchronous do es also sim

plify analysis This is demonstrated by case of the example in Section

In that example compile time padding was used to assure that a

computational task would use an equal amount of time for three dierent

cases Here redundant op erations ie waitop erations simplify

Making high level statements synchronous may also make it necessary

to co de a pro cedure such that its time consumption is made independent of

its input parameters A natural approach is then to co de the pro cedure for

solving the most general case and to use this co de on all cases This removes

the p ossibilityofsaving some pro cessor time units on simple casesbut more

imp ortant it simplies the programming of the pro cedure

A consequence of synchronous statements is that the time consumption

of an algorithm will typically b e less dep endent on the actual nature of the

problem instance such as presortedness Man for sorting This makes it

easier to use the algorithm as a substep of a larger synchronous program

Easy debugging

The main problems asso ciated with debugging concurrent pro

grams are increased complexity the prob e eect nonrep eata

bility and the lackofasynchronised clo ck

McDowell and Helmbold in

Debugging Concurrent Programs MH

The debugging of the synchronous MIMD programs develop ed as part of

the work rep orted in this thesis has b een quite similar to debugging of uni

pro cessor programs The synchronous op eration and the execution of the

programs on a simulator have eliminated most of the problems with con

current debugging The CREW PRAM mo del implicitly provides a syn

chronised clo ck known by all the pro cessors Synchronous op eration will

in general imply an increased degree of rep eatability and full rep eatabil

ity is p ossible on the simulator system The simulator do es also provide

monitoring of the programs without any disturbance of the program b eing

evaluatedie no prob e eect

Not all of these b enets would b e p ossible to obtain on a CREW PRAM

machine However a prop er programming environment for suchamachine

should contain a simulator for easy debugging in early stages of the soft

ware testing

Having exp erienced the b enets of synchronous MIMD programming after

only a few months of training and with the use of very mo dest prototyp e

to ols I am convinced that this paradigm for parallel programming is worth

further studies in future research pro jects

Example Oddeven Transp osition Sort

The purp ose of this example is to illustrate how parallelism through in

creased computing p ower and redundancy may give a faster and simpler

algorithm Weknow that sorting on a sequential computer can not b e done

faster than O n log n Akl Knu This is achieved byseveral algo

rithms for instance Heapsort invented by Wil The famous Quicksort

algorithm invented by C A R Hoare HoaisO n log n time on average



but only O n in the worst case AHU

Howeverifwehave n pro cessors available for sorting n items it is

very easy to sort faster than O n log ntimeifwe allow a higher cost than



O n log n O n time sorting is easily obtained byseveral O n cost parallel

sorting algorithms

Perhaps the simplest parallel sorting algorithm is the oddeven transpo

sition sort algorithm The algorithm is now b eing attributed Qui Akl

to Howard Demuth and his PhD thesis from See Dem The algo

rithm is so well known that I will only outline it briey Readers unfamiliar

with the algorithm are referred to one of Qui Akl Akl BDHM

for more detailed descriptions

The o ddeven transp osition sort algorithm is based on allo cating one

pro cessor to each element of the array which is to b e sorted See the example

in Figure which shows the sorting of items If n the numb er of items to

b e sorted is an even number weneed n iterations Each iteration consists

of an o dd phase followed byaneven phase The pro cessors are numbered by

starting at In the o dd phase each pro cessor with o dd numb er compares

its element in the array with the element allo cated by the pro cessor on the

Iteration

odd phase

even phase

Iteration

odd phase

even phase

Figure Sorting numbers byoddeven transp osition sort The op eration of a

pro cessor comparing twonumb ers are marked with

right side In the example pro cessor compares with and pro cessor

compares with If a pro cessor nds that the two elements are out

of order they are swapp ed In even phases the same is done bytheeven

numb ered pro cessors

A PPP program for the algorithm is given in Figure Note that



n pro cessors are needed

Iwould guess that most readers familiar with this algorithm or similar

parallel algorithms will agree that it is simpler than heapsort and quicksort

In my opinion it is also simpler than straight insertion sortwhich is com

monly regarded to b e one of the simplest unipro cessor sorting algorithms

see the next example

Example Parallel Insertion Sort

This example is provided to showhow the use of parallelism maymakeit

easier to reect a natural solution strategy

One of the simplest unipro cessor algorithms is straight insertion sort

This is the metho d used by most card players Wir A go o d description

is found in Bentleys b o ok Programming pearls Ben

The metho d is illustrated in Figure One starts with a sorted se

quence of length one The length of this sequence is increased with one



The reader mayhave observed that only n pro cessors are activeatany time during

the execution of the algorithm A slightly dierent implementation that takes this into

account is describ ed in Section

assign n pro cessors to Pro cArray

for each pro cessor in Pro cArray b egin

for Iteration to n do b egin

if p is o dd then

p Ap then SwapAp Ap if A

if p is even then

if Ap Ap then SwapAp Ap

end

end

Figure Odd even transp osition sort expressed in the PPP notation

start sorted unsorted

after iteration

after iteration

after iteration

Figure Straight insertion sort of numb ers

T for Next to n do b egin

f Invariant A Next is sorted g

T j Next

FMI while j and Aj Aj do b egin

j Aj SwapA

j j

end

end

Figure Straight insertion sort with one pro cessor Ben

numb er in each iteration Each iteration can b e p erceived as consisting of

four tasks

T Take next This corresp onds to picking up the card to the right of the

in Figure

F Find the p osition where this card should b e inserted in the sorted

sequence the numb ers to the left of

M Make space for the new card in the sorted sequence

I Insert the new card at the new free p osition

The algorithm is shown in Figure The co de corresp onds to the

simplest version of the algorithm presented at top of page in Bentleys

b o ok Ben To the left of each statement the letters T F M and

I have b een used to showhow the various parts of this co de mo del the

four main tasks just describ ed The three tasks of nding the right p osition

making space and inserting the next card are all taken care of by the simple

while lo op

In my opinion the straight insertion sort algorithm as p erformed by

card players is easier to mo del and representby using parallelism See

Figure Pro cessor p is allo cated to A p The p osition of the next

card to b e inserted in the sorted subsequence is stored in the variable Next

which is lo cal to each pro cessor The sorted subsequence is represented by

the pro cessors to the left of Next Statementa assures that the following

co de is p erformed only by the pro cessors allo cated to numb ers in the sorted

assign n pro cessors to Pro cArray

for each pro cessor in Pro cArray

T for Next to n do

a if p Next then b egin

NextVal ANext

Fb if NextVal A p then b egin

Mc Ap Ap

Fd if NextVal Ap then

Ie Ap NextVal

end

end

Figure Parallel CREW PRAM implementation of straight insertion sort ex

pressed in PPP See the text

sequence Each of these pro cessors start by reading the value of the next

card into the lo cal variable NextVal

The two statements marked F mo del how the right p osition for in

serting the next card is found Statementb selects all cards in the sorted

sequence with higher value than the next card Tomake space for the next

card these are moved one p osition to the right as describ ed by statement

c Statementd sp ecies that if the value of the next card is larger or

equal to the card on the left side A p then wehave found the right

p osition for inserting the card Statemente

I will not claim that this program is easier to understand but it seems

to me that its b ehaviour do es more clearly reect the sorting strategy p er

formed by card players The pattern matching used bycardplayers to

nd the right p osition for insertion and esp ecially the making of space for



the next card are typical parallel op erations I hop e to nd more striking

examples in the future



See also the VLSI parallel shift sort algorithm by Arisland et al AAN

CREW PRAM Programming on the

Simulator

Intro duction

This section outlines an approach for top down developmentofsynchronous

MIMD programs in the CREW PRAM simulator environment It is placed

in this chapter since it describ es CREW PRAM programming However

it do es refer to technical details of the CREW PRAM simulator whichis

describ ed in App endix A For a detailed study Appendix A should beread

rst Alternatively use the index provided at the end of the thesis

It should b e noted that this approach is not an ideal approach for syn

chronous CREW PRAM programming b ecause it reects limitations of the

used CREW PRAM simulator prototyp e

The recommended development approach

The approach for top down development is a natural extension of the top

down stepwise renement strategy often used when developing sequential

algorithms It is explained howearly incomplete versions of the program

may b e made and kept synchronous in a convenient manner Further it

is describ ed why the exact mo delling of the time consumption should b e

p ostp oned to the last stage in the development pro cess

The approach is illustrated by showing four p ossible steps ie repre

sentations in the development of an o ddeven transp osition sort CREW

PRAM algorithm More complicated algorithms will probably require a

further splitting of one or several of these steps

Step Sketching the Algorithm in PseudoCo de

The algorithm used as an example for outlining the development approach

is a slightly dierentversion of the o ddeven transp osition sort algorithm

presented in Section In that version only one half of the pro cessors

are participating in each stage of the algorithm the other half is waiting

That implementation using n pro cessors was chosen b ecause it is very

similar to the o ddeven transp osition sort algorithms presented for a linear

arrayof n pro cessors such as Akl pp Qui pp or

Akl pp

However doing o ddeven transp osition sort on a CREW PRAM there

is no need to use more than n pro cessors The n pro cessors mayact

as o dd and even pro cessors in an alternating style Pro cessor number

Array element no i n

Pro cessor no

i n

Figure Pro cessor allo cation using n pro cessors for o ddeven transp osition

sort of n elements

i may b e assigned to the elementinpositioni of the array whichistobe

sorted see Figure In even numb ered stages pro cessor i acts as an

even numb ered pro cessor assigned to elementi and compares this element

with the element in p osition i In o dd numb ered stages pro cessor i acts

as an o dd numb ered pro cessor assigned to the elementi and compares

that element with the elementinpositioni High level pseudo co de for the

algorithm expressed in the PPP notation is shown in Figure

Step Pro cessor Allo cation and Main Structure

A crucial part of every parallel algorithm is the processor al location and

activation The exact numb er of pro cessors that will b e used should b e

expressed as a function of the problem size Further the main structure of

the algorithm including the activation of the pro cessors should b e clear at

this stage

A p ossible rst version of a program for our o ddeven transp osition sort

algorithm is shown in Figure This is a complete program that maybe

compiled and executed on the simulator It is written in a language called

PIL see Section A which is based on SIMULA BDMNPoo

The program demonstrates various essential features of PIL The

include statement includes a le called sorting whichcontains the pro

cedure GenerateSortingInstance The assignment of pro cessors and pro

cessor activation are done by statements which are quite similar to the PPP

notation The READ n FROM statementshows the syntax for reading

from the global memory The corresp onding WRITE statementisshown in

Figure T ThisIs outputs the identication of the calling pro cessor

TO is a pro cedure for program tracing The and is used for debugging T

pro cedure call Use sp ecies that each pro cessor should consume one time

CREW PRAM pro cedure OddEvenSort

begin

assign n pro cessors to Pro cArray

for each pro cessor in Pro cArray begin

Initialise lo cal variables

for Iteration to n do b egin

Compute address of array element asso ciated with each

pro cessor during this iteration

Read that array element and also its right neighb our

element if it exists

Compare the two elements just read and swap

them if needed

end

end

end

Figure Pseudo co de PPP for o ddeven transp osition sort using n pro ces

sors % oesNew1.pil % Odd Even Transposition sort with n/2 processors. #include PROCESSOR_SET ProcArray; INTEGER n; ! problem size ; ADDRESS addr; BEGIN_PIL_ALG n := 10; addr := GenerateSortingInstance(n, RANDOM); ASSIGN n//2 PROCESSORS TO ProcArray; FOR_EACH PROCESSOR IN ProcArray BEGIN ! parallel block ; INTEGER Iteration; ! Initialize local variables ; ! n is written on the UserStack by GenerateSortingInstance ; READ n FROM UserStackPtr-1; FOR Iteration := 1 TO n DO BEGIN T_ThisIs; IF IsOdd(Iteration) THEN T_TO(" Odd iteration") ELSE T_TO(" Even iteration"); Use(1); END_FOR; END of parallel block; END_FOR_EACH PROCESSOR;

END_PIL_ALG

Figure Main structure of PIL program for our o ddeven transp osition sort

algorithm It contains pro cessor allo cation and activation

unit b efore it nishes the last iteration of the FOR Iteration lo op For

the purp ose of this program this is sucient time mo delling For further

details consult Section A

Step Complete Implementation With Simplied

Time Mo delling

The next main step in the development pro cess should b e to make a com

plete implementation of the algorithm Normally this should b e done in

several substeps The resulting PIL program mightbeasshown in Figure

and This version of the program implements the alternation of

the pro cessors b etween o dd and even by adjusting the value of ThisAddr in

each iteration The twovalues to b e compared byevery pro cessor in each

iteration are read by the pro cedure ReadTwoValues and later pro cessed by

CompareAndSwapIfNeeded ArrayPrint is a built in pro cedure which prints

a sp ecied area of the global memoryin this case the array of sorted num

bers

Note that it is general ly recommendedtodoonlyasimplied time mod

el ling at this stage The exact mo delling of the time used by the parallel

algorithm should b e p ostp oned to the last stage of the development There

is no need to go into ful l detail with resp ect to the time consumed b efore

a complete and correct version of the algorithm have b een made However

whenever it is p ossible all early versions of the program should b e made

synchronous The absence of prop er compile time padding makes it neces

sary to use manual metho ds for achieving synchronous programs This is

discussed b elow

Making programs synchronous using SYNC

The currently used PIL compiler pilc is unfortunately not able to p erform

compile time padding to make the statements synchronous Therefore this

dull work must b e done by the PIL programmer However the burden may

to a large extent b e alleviated by using a stepwise renement strategy on

the time mo delling

Consider the IF statement in the pro cedure CompareAndSwapIfNeeded in

the PIL program in Figure The ELSE clause could have b een omitted

by the programmer if the compiler was able to do compile time padding

The compiler would then automatically insert the ELSE clause to guarantee

that all pro cessors executing the IF statementwould leave that statement % oesNew2.pil % Odd Even Transposition sort with n/2 processors, % complete algorithm with simplified time modelling. #include PROCESSOR_SET ProcArray; INTEGER n; ADDRESS addr; BEGIN_PIL_ALG n := 10; addr := GenerateSortingInstance(n, RANDOM); ASSIGN n//2 PROCESSORS TO ProcArray; FOR_EACH PROCESSOR IN ProcArray BEGIN ! parallel block; INTEGER n, Iteration; INTEGER ThisVal, RightVal; ADDRESS ThisAddr; #include ! n is written on the UserStack by GenerateSortingInstance ; READ n FROM UserStackPtr-1; ThisAddr := (UserStackPtr - 1 - n) + ((p*2)-1);

FOR Iteration := 1 TO n DO BEGIN IF IsOdd(Iteration) THEN ThisAddr := ThisAddr - 1 ELSE ThisAddr := ThisAddr + 1; ReadTwoValues; CompareAndSwapIfNeeded; END_FOR; END of parallel block; END_FOR_EACH PROCESSOR; ArrayPrint(addr, n);

END_PIL_ALG

Figure PIL main program for our o ddeven transp osition sort algorithm

The program contains a simplied time mo delling whichissucienttokeep it

synchronous The included le oesNewproc is shown in Figure % oesNew2.proc PROCEDURE ReadTwoValues; BEGIN READ ThisVal FROM ThisAddr; IF ThisAddr + 1 >= UserStackPtr - 1 THEN BEGIN Use(t_READ); RightVal := INFTY; END ELSE READ RightVal FROM ThisAddr+1; END_PROCEDURE ReadTwoValues; PROCEDURE CompareAndSwapIfNeeded; BEGIN IF RightVal < ThisVal THEN BEGIN WRITE RightVal TO ThisAddr; WRITE ThisVal TO ThisAddr+1; END ELSE ! Numbers compared are in right order, need not write; Wait(t_WRITE + t_WRITE); ! to preserve synchrony ;

END_PROCEDURE CompareAndSwapIfNeeded;

Figure Pro cedures used in Figure



simultaneously The parameter value in the Wait call must b e exactly

equal to the time used in the THEN clause of the IF statement It is necessary

to recalculate this parameter each time a mo dication that changes the time



consumption in some part of the p ossibly nested THEN clause is made

Such mo dications o ccur frequently during development of an algorithm

and the recalculation of this parameter is typical compiler work

The only motivation for the Wait call is to assure that all pro cessors

leave the IF statementsimultaneously This prop ertymaybeachieved in

an alternativeway Insert a call to the general resynchronisation pro cedure

SYNC just after the IF statement This makes it p ossible to omit the ELSE

part and the co de executed in the THEN clause maybechanged without

destroying the synchronous prop ertyofthe IF statement considered as one

statement including the app ended call to SYNC

The use of SYNC will result in an increase in the time used by the algo

rithm This is no problem since the exact time mo delling have b een p ost

p oned to a later stage



This parameter is written as t WRITE t WRITE to reect that it represents the time

used bytwo succeeding WRITE statements For the use of symb olic constants suchas

t WRITE see also Section



In this simple example the time used in the THEN clause is easy to calculatehowever

most real examples will imply more elab orate calculations

Strive for a correct ordering of the global events

All time consumption may not b e omitted from early versions The crucial

issue is to distinguish b etween lo cal and global events Local events are

those o ccurring inside a pro cessor They cannot aect other pro cessors and

they cannot b e aected by other pro cessors Global events suchasaccess

to the global memorymay aect writes or b e aected by reads other

pro cessors

Early versions of the program should use a simplied time model ling

but should try to obtain the same ordering of the global events

that is expectedinacomplete implementation with exact time



model ling

For the correctness of an algorithm it is irrelevant whether a set of com

pletely lo cal computations take one time unit or x time units as long as the

order of al l global events implied by the algorithm is kept unaltered

The READ and the WRITE statement b oth take one time unit see page

in Section A This time consumption is automatically mo delled

by the simulator In addition the simpliedtimemodel ling in Figure

READ in consists of one call to Wait as discussed ab ove and a call Uset

the pro cedure ReadTwoValues The latter call ensures that all pro cessors

will leave that pro cedure synchronously

Step Complete Implementation With Exact Time

Mo delling

When a complete implementation of the algorithm has b een tested and found

correctit is time to extend the program with more exact mo delling of the

time consumption This is a relatively trivial task whichmay b e split into

two phases

Sp ecifying time consumption with Use

A complete program with exact time mo delling for our example algorithm

is shown in Figure The main dierence b etween this program and the

previous version Figure is that a lot of calls to the Use pro cedure

havebeenintro duced Note that this explicit sp ecication of the time used

implies the advantage that the programmer is free to assume any particular

instruction set or other way of representing the time used by the program



This would in general b e very dicult for an asynchronous algorithm For synchronous

programs it will generally b e much easier % oesNew3.pil % Odd Even Transposition sort with exact time modelling. #include PROCESSOR_SET ProcArray; INTEGER n; ADDRESS addr; BEGIN_PIL_ALG n := 10; addr := GenerateSortingInstance(n, RANDOM); ClockReset; !****** Start of exact time modelling ; ASSIGN n//2 PROCESSORS TO ProcArray; FOR_EACH PROCESSOR IN ProcArray BEGIN ! of parallel block; INTEGER n, Iteration; INTEGER ThisVal, RightVal; ADDRESS ThisAddr; #include ! SetUp: ; Use(t_LOAD + t_SUB); READ n FROM UserStackPtr-1; Use((t_LOAD + t_SUB + t_SUB) + t_ADD + (t_LOAD + t_SHIFT + t_SUB) + t_STORE); ThisAddr := (UserStackPtr - 1 - n) + ((p*2)-1);

Use(t_LOAD + t_SUB + t_JZERO); FOR Iteration := 1 TO n DO BEGIN Use(t_SHIFT + t_JZERO); IF IsOdd(Iteration) THEN BEGIN Use(t_SUB); ThisAddr := ThisAddr - 1; END ELSE BEGIN ! even Iteration; Use(t_ADD); ThisAddr := ThisAddr + 1; END; ReadTwoValues; CompareAndSwapIfNeeded; Use(t_LOAD + t_ADD + t_STORE + t_SUB + t_JZERO); END_FOR; END of parallel block; END_FOR_EACH PROCESSOR; ArrayPrint(addr, n);

END_PIL_ALG

Figure Complete PIL program for o ddeven transp osition sort with exact time

mo delling The included le oesNewproc is shown in Figure % oesNew3.proc PROCEDURE ReadTwoValues; BEGIN READ ThisVal FROM ThisAddr; Use(t_ADD + t_IF); IF ThisAddr + 1 >= UserStackPtr - 1 THEN BEGIN Use(t_READ); RightVal := INFTY; END ELSE READ RightVal FROM ThisAddr+1; END_PROCEDURE ReadTwoValues; PROCEDURE CompareAndSwapIfNeeded; BEGIN Use(t_LOAD + t_SUB + t_JNEG); IF RightVal < ThisVal THEN BEGIN WRITE RightVal TO ThisAddr; Use(t_ADD); WRITE ThisVal TO ThisAddr+1; END ELSE ! Numbers compared are in right order, need not write; Wait(t_WRITE + t_ADD + t_WRITE); ! to preserve synchrony ;

END_PROCEDURE CompareAndSwapIfNeeded;

Figure Pro cedures used in Figure

The following working rules may b e useful when intro ducing more

exact time mo delling in a PIL program

Use symbolic time constants

When sp ecifying time consumption as a parameter to Use or Wait the

programmer should try to make also this part of the co de as readable

as p ossible Using predened constants suchast ADD t SHIFT etc

is a simple waytodocumenthow the programmer imagines that the

various parts of the high level PIL program would have b een repre



sented in CREW PRAM machine co de Also using such sums of

symb olic constants instead of socalled magic numb ers makes it easier

to do the necessary mo dications to the time mo delling when changes

to the PIL program are made

In larger programs it may b e practical to dene constants that reect

the time used byvarious logical subparts

Cluster local time consumption

Bearing in mind the dierence b etween lo cal and global events dis

cussed at page the time used on several subsequent lo cal compu

tations may in general b e collected into one call to Use Consider for

instance the Use call at the end of the FOR lo op in Figure which

mo dels the time used to control that lo op

Work systematical ly from time zero

It is natural to intro duce the exact time mo delling by following the

co de structure from execution time zero in a top down depth rst

approach

Do smal l changes and test systematical ly

Since the time used in various parts of a parallel system mayaect

the order of the global events it may indirectly aect the correctness

of the algorithm Exp erience haveshown that minimal changes of the

timing may cause disastrous changes to the program Therefore the

intro duction of the exact time mo delling should b e split into several

substepswith systematically testing in b etween



These constants are dened in the simulator source le timessim stored in the src

directory see App endix A They are used in central parts of the simulator and should

therefore not b e changed

Replacing SYNC by CHECK

If exact time mo delling has b een intro duced prop erly together with the nec

essary Wait calls for achieving synchronous op eration it should b e p ossible

to removeall explicit resynchronisation whichhave b een inserted as discussed

at page Simply removing all the calls to SYNC may b e to o optimistic

It is likely that you will do future mo dications to your program that may

change the time used Therefore farsighted programmers will replace

SYNC by CHECK CHECK checks whether the pro cessors are synchronous at the

given p oint and may detect cases where synchrony have b een lost due

to erroneous program mo dications Replacing calls to SYNC with calls to

CHECK should b e done one by one with testing after each replacement

Closing remarks

The reader should have noticed that exact time mo delling requires the sp ec

ication of a lot of b oring details that are strictly related to the details

of the program implementing the algorithm Considering the amountof

changes normally done during the development of a program it should b e

clear that a lot of unnecessary work maybeavoided by p ostp oning the exact

time mo delling to the very end of the program development Please note

that most of these details would havebeentaken care of by a prop er PIL

compiler

Chapter

Investigation of the

Practical Value of Coles

Parallel Merge Sort

Algorithm

A fundamental choice is whether to base the message routing

strategy of suchanemulation on sorting or to use a metho d that

do es not require sorting and if sorting is to b e used whether

there exists an O log ntime sorting metho d for which the con

stanthiddenby the bigO is not excessive

Richard M Karp in A Position Paper on Paral lel ComputationKar

This chapter describ es the investigation of Coles parallel merge sort algo

rithm It starts by describing why this parallel algorithm is an imp ortant

contribution to theoretical computer science The practical value of Coles

algorithm is evaluated by comparing its p erformance with various simple

sorting algorithms The p erformance of these simple algorithms are sum

marised in Section

The main part of the chapter is a top down and complete explanation of

Coles algorithm followed by a description of how it has b een implemented

on the CREW PRAM simulator This includes details that are not available

in earlier do cumentation of the algorithm The chapter ends by presenting

the results from the comparison of Coles algorithm with the simpler algo

rithms

Parallel Sorting

As an aside it is interesting to sp eculate on what are the all time

most imp ortant algorithms Surely the arithmetic op erations

and on decimal numb ers are basic After that I suggest

fast sorting and searching

Stephen A Co ok in his Turing Award Lecture An Overview of

Computational Complexity Co o

Parallel SortingThe Continuing Search for Faster Al

gorithms

The literature on parallel sorting is very rich in there was published

a bibliography containing nearly four hundred references Ric Neverthe

less it is still app earing new interesting parallel sorting algorithms

Within theoretical computer science there are mainly two computational

mo dels that have b een considered for parallel sorting algorithmsthe circuit

mo del and the PRAM mo del An early and imp ortant result for the circuit

mo del was the o ddeven merge and bitonic merge sorting networks presented

byBatcher in Bat A bitonic merge sort network sorts n numb ers



in O log n time see for instance Akl

The AKS sorting network

The rst parallel sorting algorithm using only O log n time was presented

by Ajtai Komlos and Szemeredi in AKS This algorithm is often

called the three Hungarianss algorithm or the AKSnetwork The original

variant of the AKSnetwork used O n log n pro cessors and was therefore not

costoptimal recall Denition at page However Leighton showed

in that the AKSnetwork can b e combined with a variation of the o dd

even network to give an optimal sorting network with O log ntimeonO n

pro cessors Lei Leighton p oints out that the constant of prop ortionality

of this algorithm is immense and that other parallel sorting algorithms will



b e faster as long as n a problem size that probably never will o ccur



in practice



The total numb er of elementary particles in the observable part of the universe has

b een estimated to ab out Sag

In spite of b eing commonly thought of as a purely theoretical achieve

ment the AKSnetwork was a ma jor breakthrough it proved the p ossibility

of sorting in O log n time and implied the rst cost optimal parallel sorting

algorithm describ ed by Leighton The optimal asymptotical behaviour ini

tiated a search for improvements for closing the gap b etween its theoretical

imp ortance and its practical use One such improvement is the simplica

tion of the AKSnetwork done byPaterson Pat However the algorithm

is still very complicated and the complexity constants remain impractically

large GR

Coles CREW PRAM sorting algorithm

The PRAM mo del is more p owerful than the circuit mo del Even the weak

est of the PRAM mo dels the EREW PRAM may implement a sorting

circuit suchasBatchers network without loss of eciency

Also for the PRAM mo del there has b een a search for a O log n time

parallel sorting algorithm with optimal cost In RichardColepre

sented a new parallel sorting algorithm called paral lel merge sort with this

complexity Col This was an imp ortantcontribution since Coles algo

rithm is the second O log n time and O n pro cessor sorting algorithmthe

rst was the one implied by the AKSnetwork Further it is claimed to have

complexity constants which are much smaller than that of the AKSnetwork

To make it p ossible to evaluate its practical value it was necessary to de

velop an implementation of the algorithm This work is describ ed in Section

and Section

Are algorithms using more than n pro cessors practical

Many researchers would argue that sorting algorithms requiring more than

n pro cessors to sort n items are of little practical interest A reasonable

general assumption is that the numb er of pro cessors p should b e smaller

than nHowever there are situations where p n should b e considered

Imagine that you are given the task of making an extremely fast sorting

algorithm for up to numb ers on a Connection Machine with

pro cessor Reading the description of Coles O log n time algorithm in the

bookonecient parallel algorithms by Gibb ons and Rytter GR it is

far from evident that Coles algorithm should not b e considered

It is interesting in this context to read the recentwork of Valianton

the BSP mo del Val see Section where he advo cates the need

for algorithms with parallel slackness Only the future can tell us whether



Execution time

Numb er of items sorted



Figure Insertion sort algorithm Ben which demonstrates worst average

and b est case b ehaviour  represents best casewhich for this algorithm is a

sorted sequence  represents the worst case ie an input sequence sorted in the

opp osite order Each small dot shows the time used to sort one of ten random

input sequences for each problem size xaxis and  shows the average of these ten

samples The random cases were generated bydrawing numb ers from an uniform

distribution

algorithms with very high virtual pro cessor requirement will b e widely

used

Some Simple Sorting Algorithms

This section summarises the p erformance of the three simple sorting algo

rithms describ ed earlier in Chapter For all sorting algorithms evaluated

in this thesis the execution time is measured from the algorithm starts

with the input sequence placed in the global memoryuntil it nishes with

the sorted sequence placed in the global memory The time is measured in

numb er of CREW PRAM time units

Insertion Sort

Perhaps the simplest sequential sorting algorithm is the straight insertion

sort algorithm outlined in Section see Figure at page A

Table Performance data for a CREW PRAM implementation of the parallel

insertion sort algorithm presented in Figure at page Time and cost are given

in kilo CREW PRAM time units and kilo CREW PRAM unittime instructions

resp ectively The numb er of readwrite op erations tofrom the global memory are

given in kilo lo cations

Problem size n time pro cessors cost reads writes

tuned version the second program from the top of page in Ben has

b een implemented on the CREW PRAM simulator The execution time of

the algorithm is strongly dep endent on the degree of presortedness in the

problem instance This is illustrated in Figure

Parallel Insertion Sort

The parallel version of the insertion sort algorithm describ ed in Section

Figure at page has also b een implemented on the simulator Its

p erformance is summarised in Table

OddEven Transp osition Sort

Table summarises the p erformance of the implementation of the o ddeven

transp osition sort algorithm using n pro cessors whichwas develop ed and

describ ed in Section

Bitonic Sorting on a CREW PRAM

This section gives a brief description of a CREW PRAM implementation of

Batchers bitonic sorting network and its p erformance

Table Performance data measured from execution of OddEvenSort on the

CREW PRAM simulator Time and cost are given in kilo CREW PRAM time

units and kilo CREW PRAM unittime instructions resp ectivelyThenumber of

readwrite op erations tofrom the global memory are given in kilo lo cations

Problem size n time pro cessors cost reads writes

Algorithm Description and Performance

m

Batchers bitonic sorting network Bat for sorting of n items con



sists of mm columns eachcontaining n comparators comparison



elements A detailed description of the algorithm may b e found in Chap

ter of Akls b o ok on parallel sorting Akl In this section we assume

that n is a p ower of A natural emulation on a CREW PRAM is to use

n pro cessors which are dynamically allo cated to the one active column

of comparators as it moves from the input side to the output side through



the network The global memory is used to store the sequence when the

computation pro ceeds from one step ie comparator column to the next

The main program and its time consumption are shown in Figure and

Table

When discussing time consumption of PPP programs we will in general

use the following notation

Denition ti n denotes the time used on one single execution of

statement i of the discussed program when the problem size is n tjk n

P

i k

ti n  is a short hand notation for

i j



The p ossibili ty of sorting several sequences simultaneously in the network byuseof

pip elini ng is sacriced by this metho d This is not relevant in this comparison since Coles

algorithm do es not have a similar p ossibili ty

CREW PRAM pro cedure BitonicSort

begin

assign n pro cessors

for each pro cessor do b egin

Initiate pro cessors

for Stage to log n do

for Step to Stage do b egin

EmulateNetwork

ActAsComparator

end

end

end

Figure Main program of a CREW PRAM implementation of Batchers bitonic

sorting algorithm

Table Time consumption of BitonicSort

tn blog nc

tn

tn

tn

Table Performance data measured from execution of BitonicSort on the CREW

PRAM simulator Time and cost are given in kilo CREW PRAM time units and

kilo CREW PRAM unittime instructions resp ectivelyThenumb er of readwrite

op erations tofrom the global memory are given in kilo lo cations

Problem size n time pro cessors cost reads writes

EmulateNetwork is a pro cedure which computes the addresses in the

global memory corresp onding to the two input lines for that comparator

in the current Stage and Step ActAsComparator calculates which of the

two p ossible comparator functions that should b e done by the pro cessor

comparator in the current Stage and Step p erforms the function and

writes the two outputs to the global memory Both pro cedures are easily

p erformed in constant time

The main program shown in Figure implies that the time consumption

of the implementation may b e expressed as

log nlog n T BitonicSort n tntn log n tn

Note that this equation gives an exact expression for the time consump

tion for any value of nwhere n is a p ower of

Coles Parallel Merge Sort Algorithm

Description

Cole Col has invented an ingenious version of parallel merge

sorting in which this merging is done in O parallel time

Alan Gibb ons and Wo jciechRytterin Ecient Paral lel Algo

rithms GR

Richard Cole presented twoversions of the parallel merge sort algo

rithm one for the CREW PRAM mo del and a more complex version for the

EREW PRAM mo del Col Throughout this thesis we are considering

the CREW variant

without the mo dication presented at the b ottom of page in Col

The EREW variant is substantially more complex

A revised version of the original pap er was published in the SIAM Jour

nal on Computing in Col This pap er has b een used as the main

reference for my study and implementation of the CREW PRAM variantof

the algorithm Cole did also write a technical rep ort ab out the algorithm

Col but has informally expressed that it is less go o d than the SIAM

pap er Col

This section explains Coles CREW PRAM parallel merge sort algo

rithm Section describ es the most imp ortant decisions that were made

during the implementation This is followed by pseudo co de for the main

parts of the algorithm along with an exact analysis of the time consumption

Motivation for the Description

Many researchers in the theory community refer to Coles algorithm as sim

pleHowever the eort needed to get the level of understanding that is re

quired for implementing the algorithm is probably counted in days and not

hours for most of us It is therefore natural to use a top down approachinde

scribing the algorithm The reader may consult the SIAM pap er Col for

some further details or an alternative presentation The description found

at pages in the b o ok Ecient Paral lel Algorithms by Gibb ons and

Rytter GR is not completely consistent with Coles pap er but mayadd

some understanding

Main Principles

Coles parallel merge sort assumes n distinct items These items are dis

tributed one p er leaf in a complete binary treethus it is assumed that nis

apower of Ateach no de in the tree the items in the left and right sub

tree of that no de are merged The merging starts at the leaves and pro ceeds

towards the top of the tree Before the merging in a no de is completed

samples of the items received so far are sent further up the tree This makes

it p ossible to merge at several levels of the tree simultaneously in a pip elined

fashion

As a motivation let us consider two other parallel sorting algorithms

based on a nleaf complete binary tree

Two Simpler Ways of Merging In a Binary Tree

One pro cessor in eachnode

Perhaps the simplest waytosortn items initially distributed on the n leaf

no des of a binary tree is to assign one pro cessor to each no de and to merge

subsequences level bylevel starting at the b ottom of the tree At the rst

stage the no des one level ab ove the leaf no des will merge two sequences of

length into one sequence of length In the next stage the no des one

level higher up will merge two sequences into one sequence of length and

so on The degreeofparal lelism is restricted to the number of nodes in the

active level of the tree Worse though as the algorithm pro ceeds towards

the top of the tree the work to b e done in each no de increases while the

parallelism b ecomes smaller At the top no de two sequences of length n

are merged by one single pro cessor This last stage alone takes O n time

and this form of treebased merge sort is denitely not fast

Parallel pro cessing inside eachnode

Cole gives a motivating example that is similar to the algorithm just sketched

Col The computation pro ceeds up the tree level bylevel from the leaves

to the ro ot Each no de merges the sorted sets computed at its children But

as Cole p oints out merging in a no de maybedonebytheO log log ntime

npro cessor merging algorithm of ValiantVal In this case we also have

parallelism inside the no des at the activelevel However the merging is still

done level by levelresulting in O log n log log n time consumption

Coles Observation The Merges at the DierentLevels

of the Tree Can b e Pip elined

Boro din and Hop croft have proved that there is an log log nlower b ound

for merging two sorted arrays of n items using n pro cessors BH There

fore as Cole p oints out it is not likely that the level bylevel approachmay

lead to an O log n time sorting algorithm

Coles algorithm is based on an O log n time merging pro cedure which

is slower than the merging algorithm of Valiant The main principle is

describ ed by the following simpler but similar merging pro cedure

Coles log n merging procedure

The problem is to merge two sorted arrays of n items We

pro ceed in log n stages In the ith stage for each arraywetake

i i

a sorted sample of items comprising every n th item in

the arrayWe compute the merge of these two samples Col

page

Cole made twokey observations

Merging in constant time

Given the results of the merge from the i st stage the merge

in the ith stage can b e done in O time

Since wehavelogn levels in the tree using this merging pro cedure in the



level bylevel approachgives an O log n time sorting algorithm

The second observation explains how the use of a slower merging pro ce

dure may result in a faster sorting algorithm

The merges at the dierent levels of the treecan bepipelined

This is p ossible since merged samples made at level l of the

tree may b e used to provide samples of the appropriate size for

merging at the next level ab ove l without losing the O time

merging prop erty

This may need som further explanation Consider no de u at level l see

Figure In the level by level approach assoonasnode v and w store

a sorted sequence containing al l the items in the subtree ro oted at v and

w the merging may start at level l The merging in no de u takes log k

stages where k is the size of the sorted sequences in no de v and w Each

of these stages may b e done in constant time due to the systematical way

x level l

 

 

 

 

  

  

  

  

 

 

 

 



 

 

 

 

 

 

 

 

























































































u

level l

   

   

   

   

 

   

   

   

   

     

   

   

     

     

   

     

     

   

   

 

   

   

   

   

 

   

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

v w

Figure An arbitrary no de u its twochild no des v and w and parentnodex

The ro ot of the tree is at level



of sampling the sorted sequences in v and w When no de u has completed

the merging the merging may start at no de x

In the pipelined approach the nodes areallowed to start much earlier

on the merging procedure Once no de u contains items it starts to send

samples to its parentnodex These samples are taken from the sequence

stored in no de u whichin this case may only b e a sample of the items

initially stored in the leaves of the subtree ro oted at u No de x follows the

same merging pro cedureas so on as it contains items it starts sending

samples to its parent no de and so on It is shown in Col that this way

of letting the no des start to merge sorted subsequences from its left and

rightchild no de b efore these twonodeshave nished their merging maybe

implemented without destroying the p ossibility of doing each merge stage in

constant time

More Details

This section gives a more detailed description of Coles parallel merge sort

algorithm It attempts to provide a thorough understanding by also ex

plaining some details vital for an implementation whicharenot given in

Coles description Col



In stage i the sample from v and w may b e merged in constant time b ecause the

array currently stored in no de u whichwas obtained in stage i is a so called good

sampler for the new samples from v and w The term go o d sampler is used by Gibb ons

and Rytter GR and reects that the old arraymay b e used to split the new arrayinto

roughly equal sized parts in the following manner Informally If the items in the new

and bigger array are inserted at their right p ositions with resp ect to sorted order in the

old array a maximum of three items from the new array will b e inserted b etween anytwo

neighb our items in the old arrayFull details in Col

The Task of EachNodeu

Eachnodeu should pro duce a list Lu which is the sorted list containing

the items distributed initially at the leaves of the subtree with u as its top

no de ro ot During the progress of the algorithm eachnode u stores an

array Upu which is a sorted subset of Lu Initially all leaf no des y have

Upy Ly At the termination of the algorithm the top no de of the

whole tree t has UptLt which is the sorted sequence In the progress

of the algorithm Upu will generally contain a sample of the items in Lu

Upu SampleUpv SampleUpw and NewUpu

Before we pro ceed we need some simple denitions A no de u is said to b e

external if jUpuj jLuj otherwise it is an inside no de Initially only

the leaf no des are external In each stage of the algorithm a new array

called NewUpuismadeineach inside no de u NewUpu is formed in the

following manner see Figure

Phase Samples from the items in Upv and Upw are made and

stored in the arrays SampleUpv and SampleUpw re

sp ectively

Phase NewUpu is made by merging SampleUpv and SampleUpw

with help of the arrayUpu

For all no des the NewUpuarray made in stage i is the Upuarrayat

the start of stage i At external no des Phase is not p erformedsince

these no des have already pro duced their list Lu

Phase Making Samples

The samples SampleUpu are made in parallel by all no des u according to

the following rules

If u is an inside no de SampleUpu is the sorted array comprising of

every fourth item in Upu The items are taken from the p ositions

etc If Upucontains less than four items SampleUpu

b ecomes empty

At the rst stage when u is external SampleUpu is made in the

same way as for inside no des At the second stage as external no de

SampleUpuisevery second item in Upu and in the third stage

SampleUpuisevery item in Upu SampleUpuisalways in sorted

order

NewUpu

























 

 

 

 

 

 

 















 









 

























 









 













MergeWithHelp 











 









 









                                    

 

    

      

 

    

   

    

  

     

    

 

    

    

     

  

   

      



     

    

  

       

   

   

     

   

     

   

 

   

  



 

 

 

 

 

 





 

 

 

 





 

 

 

 

 

 





 

 

 

 

 

 





 

 

 

 





 





 

 

 

 





 





 

 





 





 





 

 





 





 

 

 

 





 





 

 





 



Upu MergeWithHelp Phase 

 





 

 





 





 









 





 





 









 





 





 









 





 







 







 





 









 





 





 









 





 







 







 





 







 







 







 





 







 







 





 







 







 







 







 





 







 





MakeSamples Phase 

 





 







 







 





 









 





 





 









 





 





 

 





 





 





 

 





 





 









 





 





 

 





 





 





 

 





 





 





 

 





 





 

 

 

 





 





 

 

 

 





 

 

 

 





 





 

 

 

 





 





 

 

 

 





 

 

 

 

 

 





 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

SampleUpv SampleUpw

 

 

 

 

 

 

 

 

 

 

 

 

   

   

   

   

   

   

   



 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MakeSamples

 

 

 

 

 

 

 

 

 

                                 

 

 

 

 

 

 

 

 

 

 

 

   

   

   

   

   

   

   

Upv Upw

Figure Computation of NewUpuintwo main phases

The algorithm has logn stages

Initially only the leaves are external no des In the third stage as external

anode v has used all the items in Lv to pro duce SampleUpv in Phase

and its parentnode u has merged SampleUpv and SampleUpw into

Upu in Phase No de u has now received all the items in Lu and the

child no des v and w have nished their tasks and may terminate In the

next stage the parentnode u will have UpuLu and this is the rst

stage as external for no de u

We see that the level with external no des moves up one level every third

stage After log n stages the top no de t b ecomes external it has Upt

Lt and the algorithm may stop Lemma in Col

The size of Upu is doubled in eachstage

The size of the samples made by the external no des double in each stage

b ecause the sample rate is halved in each stage As a consequence the size

of Upu in the inside no des one level higher do es also double in each stage

At this level the sample rate is xed as long as the no des are inside

no des But since jUpuj doubles in each stage the samples made at this

level will also double in each stage By induction weseeintuitively that the

size of Upu doubles in each stage for each inside no de with jUpuj

Phase Merging In O Time

The most complicated part of Coles algorithm is the merging of SampleUpv

and SampleUpw into NewUpuin O time It constitutes the ma jor

part of the algorithm description in Col and ab out of the co de in

the implementation The merging is describ ed in the following Pro ofs and

some more details are found in Col

Using ranks to compute NewUpu

The merging is based on maintaining a set of ranks Assume that A and

B are sorted arrays Informally we dene A to b e ranked in B denoted

A B ifwe for each item e in A knowhowmany items in B that are

smaller or equal to e Knowing A B means that we know where to insert

an item from A in B to keep B sorted

When using this concept in an implementation it is necessary to do the

following distinction The rank of an item x in a sorted array S is denoted

Rx S When using Rx S to represent the correct p osition with resp ect

to of x in S it is imp ortanttoknowwhetherx is in S or not as made

clear in the following observation

Observation

Assume that the p ositions of the items in the sorted array S are numb ered

f jS jgIfx S Rx S represents the correct p osition of x in S

However if x S Rx S gives the numb er of items in S which are smaller

than xthus if x is to b e inserted in S its correct p osition will b e Rx S

At the start of each stage weknow Upu SampleUpv and SampleUpw

In addition we do the following assumption

Assumption

At the start of each stage for each inside no de u with child no des v and w

we know the ranks Upu SampleUpv and Upu SampleUpw

The making of NewUpumay b e split into two main steps the merging of

the samples from its twochild no des and the maintaining of Assumption

The calculation of the ranks constitutes a large fraction of the total

time consumption of Coles algorithm Howeveraswe will see they are

crucial for making it p ossible to merge in constanttime

Phase Step Merging

Wewant to merge SampleUpv and SampleUpw into NewUpu see Fig

ure This is easy if we for each item in SampleUpv and SampleUpw

know its p osition in NewUpu The merging is then done by simply writing

each item to its right p osition

Consider an arbitrary item e in SampleUpv see Figure Wewantto

compute the p osition of e in NewUpu ie the rank of e in NewUpu de

noted Re NewUpu Since NewUpuismadeby merging SampleUpv

and SampleUpw wehave

Re NewUpu Re SampleUpv Re SampleUpw

The problem of merging has b een reduced to computation of Re Sample

Upv and Re SampleUpw Since e is from SampleUpv Re Sample

Upv is just the p osition of e in SampleUpv Computing the rank of e in

SampleUpw ismuch more complicated Similarly for items e in Sample

Upw it is easy to compute Re SampleUpw but dicult to compute

Re SampleUpv

e

       

               

               

               

               

               

       

       

               

       

       

               

       

       

       

       

               

       

       

               

               

               

       

        NewUpu

               

                

                

                

        

        

 

 

  

 

  

  

  



 

 

 









































































































d f 



    

        

        

        

        

        

    

    

       

   

   

       

    

   

    

   

        

    

   

        



       

        

   

     Upu

        

         

         

        

    

     







r r 





 



 

  

  

  

  



  

 

  



 

 













































 













 

 















 













  























 



 

















 



 









 



 















t





  







r





r 







 











  

 



  

   



 





  















 



 

  

 

   

  

 

  

  

   

   

 

   

   

 

   

 

   

   

  

  

 



  

   



   



  

e    

  

  

 

 



 

 



         

               

                 

                 

                

                

         

         

                

        

          

                

        

         

      

         

              

        

       

              



               

              

        

       

               

               

               

               

       

      

r



SampleUpv SampleUpw

Figure The use of ranks in Coles O time merging pro cedure Knowing the

p osition of the item e in SampleUpv Re SampleUpv r and its p osition



in SampleUpw Re SampleUpw r the p osition of e in NewUpu Re



NewUpu is simply r r in this example

 

The computation of the ranks SampleUpv SampleUpw and Sample

Upw SampleUpv is termed computation of crossranks In the mass

of details and levels which follows Figure maybeusedasanaidfor

keeping track of where we are in the algorithm

Computing Crossranks

Upu and Assumption may b e to a great help in computing the cross

ranks Consider Figure and the item e from SampleUpv Wewantto

nd the p osition of e in SampleUpw asifitwas to b e inserted according

to sorted order in that array

Note that it is only the situation where e is from SampleUpv that is

describ ed in this and the following sections and The other

case e is from SampleUpw is handled in a completely symmetric manner

First for eachitem e in SampleUpv we compute its rank in Upu

This computation is done for all the items in SampleUpv in parallel and

is describ ed as Substep in Section b elow



Re Upu gives us the items d and f in Upu whichwould stradd le e



Let x y and z b e three items with x zWesay that x and z stradd le y if x y

One stage in Coles parallel merge sort algorithm

Phase Making Samples page

Phase Merging in O time page

Step Merging page

Computing crossranks page

Substep page

Substep page

Make NewUpu Eq at page

Step Maintaining Ranks page

from sender no de page

from other no de page

Figure Main structure of one stage in Coles algorithm

if e had to b e inserted in Upu Further Assumption gives Rd Sample

Upw and Rf SampleUpw the right p ositions for inserting d and f

in SampleUpw These ranks are called r and t in Figure Informally

since e d f i the right p osition for e in SampleUpw is b ounded by the

p ositions ranks r and t It can b e shown that r and t always sp ecies a

range of maximum p ositions and the right p osition Re SampleUpw

maythus b e found in constant time This is explained in more detail as

Substep in Section

Before the two substeps are explained in more detail wemention an

analogy whichmay help to ones understanding Imagine that you are sup

p osed to navigate to a certain p osition Re SampleUpw far awayin

an unknown area SampleUpw You know where you are and havea

detailed map covering the entire area The normal smart waytosolve such

a problem is to split it into two succeeding steps coarse and ne navigation

Coarse navigation is done by using an imagined highlevel map where

many details have b een omitted This map Upu is simpler than the full

detail map SampleUpw but it is go o d enough go o d sampler for navi

gating to a small area given by r and tcontaining the exact p osition Fine

navigation is then done by detailed searching in this small area comparing

e with the elements given by r and t

and yz ie y x z i

b c

i i

 













Upu





















 























 









 

 

 

 



 

 

 

  

 

   

 

  

 

 

 



  

 



 



 

 



 









r s

r s

SampleUpv

Figure Substep Computing SampleUpv  Upuby using

Upu  SampleUpv

Substep Computing SampleUpv Upu

SampleUpv Upu is computed by using Upu SampleUpv Con

sider an arbitrary item i in Upu see Figure The range i i i is

  

dened to b e the interval induced by item i Wewant to nd the set of



items in SampleUpv which are contained in i i i denoted I i

  

The items in I i haverankinUpu equal to b where b is the p osition



of i in Upu and the array p ositions are numb ered starting with as



shown in the gure Once I i have b een found a pro cessor asso ciated



with item i may assign the rank b to the items in the set Simultaneously



a pro cessor asso ciated with i should assign the rank c to I i

 

Computing I i



The precise calculations necessary for nding I i are not explained in



Coles SIAM pap er Col However wemust understand howitcanbe

done to b e able to implement the algorithm Wewant to nd all items

SampleUpv with R Upu b These items must satisfy

i i

 

Let itemj denote the item stored in p osition j of SampleUpv Assump

tion gives Ri SampleUpv r and Ri SampleUpv s which

 

implies

itemr i items i

 

Wehave the following observation

Observation

See Figure If there exist items in SampleUpv between and not

including p osition r and s they must haveR Upu b

Proof Let itemr denote an item which is to the rightofitemr and

to the left of items see Figure The denition of the rank r implies

that i item r Hence Ritem r Upu b Let items



denote an item which is to the left of items and to the right of itemr

The requirement of distinct items implies items items and the

denition of the rank s implies item s i Distinct items then gives



item s i sowehaveRitems Upu c Since b c



all items in SampleUpv in p ositions r r s haverankb in

Upu 

The items in p ositions r and s must b e given sp ecial treatment wehave

item r i Ritem r Upu b



item r i Ritem r Upu b



item s i Ritems Upu c



item s i Ritems Upu c Ritems Upu b



To conclude I i is found by comparing itemr with i and items

 

with i A pro cessor asso ciated with item i may do these simple calculations

 

and assign the rank b to all items in I i in constant time This is p ossible



b ecause it can b e proved that I y for anyitem y in Upuwillcontain at



most three items

Substep Computing SampleUpv SampleUpw

See Figure As describ ed in Section Re Upu whichwas

computed in Substep gives the straddling items d and f in Upu Further

Assumption gives the ranks r and tWewanttheexact p osition of e

if it had to b e inserted in SampleUpw Which items in SampleUpw

must b e compared with e The question is answered by the following two

observations



Cole proves that Upu is a so called cover for SampleUpv Col cover is the

same concept as good sampler mentioned in Fo otnote at page Upuisacover for

SampleUpv ifeachinterval i i i induced by an item i from Upucontains at most

  

items from SampleUpv see Figure Note that the denition allows i as an



imagined item to the left of the rst elementofUpu and similarly i



d f

   

       

       

       

       

  

    

       

  

       

    

  

    

       

  

    

      



       

     

 

       

   

   

       

      

       

       

       

    

   



















  







 



 



 





 





 

 

 

 



 





















 



Upu



















  



















 



















t

r 









ReUpu 





















 



 

 

 

 

   

 



  



 

 

 

 

  

 

 

  

 

 

  

 



 

 

  



 





 

   

  

  

 

 



 





e  





 





 

 





 





 

 

 

 

 



      

          

           

           

    

           

        

           

  

        

           

  

        

          

   

        

      

    

          

    

       

        

      

       

          

         

         

           

           

        

     

 

 



 

 

  

 

  

  

 

  

 

  

  

 

  

   

  SampleUpw

 

  

   

  

  

 

 

  

  

 

 



 

 







 

 

 

 

 

 

















 







 











e 



e 



r t

Figure Substep Computing SampleUpv  SampleUpw by using

SampleUpv  Upu

Observation

All items in SampleUpw to the left of and including p osition r are smaller

than item e

Proof Let itemx denote the item in SampleUpw at p osition x Since

d and f straddle ewehave e d The rank r implies that d itemr

so wehave e itemr The assumption of distinct elements implies that

SampleUpv and SampleUpw never will contain the same element hence

e itemr This implies that e itemr 

Observation

All items in SampleUpw to the right of p osition t are larger than item e

Proof The rank t implies that fitem t and weknow that e f

Therefore eitemt 

How many items must b e compared with e Observations and

tell us that e must b e compared with the items from SampleUpw with

p ositions in the range r t Since Upu is a cover for SampleUpw

weknow that d f i contains at most three items in SampleUpw But the

set of items in SampleUpw contained in d f i is not necessarily the same

as the items with p ositions in the range r t However we can prove

the following observation by using the cover prop ertyofd f i

Observation

See Figure Item e must b e compared with at most three items from

SampleUpw starting with itemr going to the right but not b eyond

itemt In other words the items itemr ifor i mint r

Proof First let us consider which items in SampleUpw maybecontained

in d f iWehave four p ossible cases

d itemr item r d f i

d itemr d item r itemr d f i

f item t item t d f i

f item t fitemt itemt d f i

which imply four cases for the placement of d f i in SampleUpw

i itemr itemt

ii itemr itemt

iii itemr itemt

iv itemr itemt

Case i The cover prop erty implies that jitemr itemt j By

deleting one and adding one item wegetjitemr itemtj

Case ii The cover prop erty gives jitemr itemtj which implies



jitemr itemtj

Case iii jitemr itemt j At rst sight one may think that

this case makes it necessary to compare e with items in SampleUpw

However this case do es only o ccur when itemtf see case in equation

and since e f we know that eitem t so there is no need to

consider itemt We need only consider items in itemr itemt

which contains at most items

Case iv The cover prop erty implies jitemr itemtj

Thus wehave shown that e must b e compared with at most three items for

all p ossible cases 

Since these at most three items in SampleUpw are sorted two com

parisons are sucient to lo cate the correct p osition of e Therefore Re

SampleUpw can b e computed in O time



In this case Figure is not correct at most one item itemr may exist b etween

itemr and itemt

When Substep has b een done in parallel for each item e in Sample

Upv and SampleUpw every item knows its p osition rank in NewUpu

by Equation and may write itself to the correct p osition of that array

Phase Step Maintaining Ranks

This is only half the storyWe assumed in Assumption at page that

the ranks Upu SampleUpv and Upu SampleUpw areavail

able at the start of each stage We therefore must compute NewUpu

NewSampleUpv and NewUpu NewSampleUpw at the end of

each stage so that the assumption is valid at the start of the next stage

NewSampleUpx is the SampleUpx that will b e pro duced in the next

stage

Doing this in O time is ab out just as complicated as the merging

describ ed ab ove We explain the computation of NewUpu NewSample

Upv NewUpu NewSampleUpw is computed in an analogous way

Consider an item e in NewUpu The computation of Re NewSample

Upv is split into two cases The rst case o ccurs when e came from

SampleUpv ore came from SampleUpw during the computing of Re

NewSampleUpw We term this case computation of rank in NewSam

pleUp from sender node and it is describ ed in Section b elow The

other case is called computation of rank in NewSampleUp from other node

and is describ ed in Section

First we solvethe from sender node case for all items in NewUpu The

results of this computation are then used to solve the from other node case

Computing the Rank in NewSampleUp From Sender No de

We know Upu SampleUpv and Upu SampleUpw NewUpu

contains exactly the items in SampleUpv and SampleUpw Therefore

the rank R NewUpu for all items in Upu is easily computed as the

sum of R SampleUpv and R SampleUpw This is done for all

no des in parallel

SampleUpv is made from Upv by taking every fourth item in that



array Thus knowing the p osition of an item in SampleUpv it is easy

to nd the p osition of the corresp onding same item in Upv Since we



Every fourth item is the normal case An implementation must also handle the two

other sampling rates whichmay o ccur when no de v is external See Section

just have computed Upv NewUpv wemay compute SampleUpv

NewUpv by going through Upv

For each item in SampleUpv wehave found R NewUpv Simi

larly as for SampleUpv NewSampleUpv ismadeby taking every fourth

item from NewUpv InformallyR NewSampleUpv is therefore sim

ply R NewUpv divided by four

At this p ointwe know SampleUpv NewSampleUpv and that the

item e in NewUpu came from SampleUpv To nd Re NewSampleUpv

it remains to lo cate the corresp onding same item e in SampleUpv This

may easily b e done if we for each item in NewUpu records the sender

address p osition of e in SampleUpv

Computing the Rank in NewSampleUp From Other No de

See Figure which illustrates the case when e is from SampleUpw

Wewant to nd the correct p osition of item e if it had to b e inserted in

NewSampleUpv The following assumption will help us

Assumption

See Figure In each stage at the start of step Maintaining Ranks

for each item e in NewUpu that came from SampleUpw we know the

straddling items d and f from SampleUpv And vice versa for items from

SampleUpv

The item d is the rst item in NewUpu from the other no de SampleUpv

to the left of e and similarly f is the rst item to the right

Further weknow the ranks r and t of d and f in NewSampleUpv they

were computed in the from sender node case Compare this situation with

Substep describ ed in Section We see that Figure and Figure

illustrate the same metho d for computing the correct p osition of e in

NewSampleUpv and SampleUpw resp ectively Re NewSampleUpv

is found by comparing item e with at most three items in NewSampleUpv

Recall Observation at page Since these three items are sorted

two comparisons will suce

It remains to explain howtomake Assumption valid Consider Figure

at page which illustrates the calculation of Re SampleUpw for

an item e in SampleUpv Do not consider the items d and f in that gure

They have a slightly dierent meaning than d and f in Assumption

The same precautions as mentioned in the previous fo otnote must b e taken here In

addition wemust rememb er to use the sampling rate that will b e used in the next stage





 







 





 

















  

 



 





 



 

   

  

  

   

  

   

     

 

 

 

 

 

 

  

 

 

 

 

d

 

f

     

     

           

           

         

          

             

             

         

            

               

        

              

     

             

               

           

           

              

           

              

           

             

                   

              

           

        

            

           

           

     

  









e













NewUpu









 

 



 

  

 

  

 

  

 

  

 

  

    

  

 

  

 

   

  



  

   

   

 

  

 

 





from SampleUpw







r



 

t  

 

 

  

 

 

 

 

 

 





  

 

 

 

 

 

 

 

 

 





from SampleUpv











   

 

 

   

   

 

 

   

   

  

 

   

   

 

 

 



 

 

 

 

 

 

 

 

 

     

       

           

           

           

           

       

           

   

       

           

   

       

   

       

           

       

   

         

     

       

         

 

           

         

         

       

           

           

           

     

NewSampleUpv

e

e

Figure Maintaining Ranks The so called from other no de case The strad

dling items d and f stored at end of Step are used to compute the rank of e in

NewSampleUp from the other no de The p osition of item e in NewSampleUpv is

b ounded by the ranks r and t which are computed in the from sender no de case

Since we know e SampleUpw still Figure Re SampleUpw gives

exactly the two items from the other set that straddle eFor an item e from

SampleUpw a completely symmetric calculation gives Re SampleUpv

and therefore implicitly the twoitems d and f from SampleUpv that strad

dle e These are the items d and f needed in Figure Thus Assumption

is kept valid if we at the end of Step the making of NewUpu record

the p ositions ranks of d and f in NewUpu

Coles Parallel Merge Sort Algorithm

Implementation

This section describ es those asp ects of the implementation of Coles parallel

merge sort algorithm that are most relevant to parallel pro cessing Pro ces

sor activation how the pro cessors are used in parallel the communication

of lo cal variables b etween the pro cessors and CREW PRAM programming

are emphasised Programming details that are well known from sequen

tial programming are given less attention The text assumes a reasonable

knowledge of the algorithm description given in the preceeding section

Dynamic Pro cessor Allo cation

Akey to the implementation of Coles algorithm is to understand how the

pro cessors should b e allo cated to the work which has to b e done in the

various no des of the tree during the computation

Coles description Col omits many details concerning how the work

is distributed on the pro cessors Program designers may regard this lackof

detail as a deciencybut they should not fail to see its advantages Coles

description is detailed enough to explain the main principles of the pro cessor

usage Still it is at an suciently high level for making the programmer

free to decide on the details of how the pro cessors are used

The main principles of the pro cessor usage are describ ed in this section

The full details of the pro cessor usage in the actual implementation are

explained in Section

Agrowing number of activelevels

When designing the pro cessor allo cation it is natural to start by nding out

where and when there is work to do As describ ed in Section at page

at the end of every third stage the lowest active level moves one level

up towards the topFurther the rst samples received byaninside no de u

will have size equal to one Thus jUpuj in the rst stage when no de u

is active No de u will not pro duce samples during this stage b ecause Upu

is to o small However in the next stage the size of Upuwillhave b een

doubled and no de u will pro duce samples for its parentnodeWe see that

the highest active level wil l move one level upwards every second stage

Because of the dierence in sp eed of the lowest and highest active

level in the tree the total numb er of active levels will increase during the

computation until the top level has b een reached This is illustrated in

Figure for the case n

The Numb er of Items in the Arrays

Cole gives a detailed analysis of the numb er of items in the Up NewUp

and SampleUp arrays The size of the arrays will b e dierent from stage to

stage Consider a no de u with jUpuj and child no des v and w which

are not external Since jUpuj doubles in each stage wehavejUpuj



As explained in Section at page

Stage No

Level

root node M M M M

M M M

M M

M

leaf nodes

Figure The activelevels of the binary tree used by Coles algorithm for the

case n M represents that inside no des at that level p erform merging in the

stage ii represents that no des at that level are in their ith stage as

external no des



jUpv j jUpw j jNewUp uj jSampleUp v j jSampleUp w j



 

jUpv j which gives jUpuj jUpv j Also the number of nodes at

 

us level is the numb er of no des at v s level divided by This implies that



the total size of the Up arrays at us level is of the total size of the Up

arrays at v s level if v is not external This is also true for the rst stage

in which no des at v s level are external But for the second stage the no des

at v s level pro duce samples bypicking every second item which implies



jUpuj jUpv j Similarly for the third stage wehave jUpuj jUpv j



To summarise

Observation



The total size of the Up arrays at us level is of the total size of the Up

arrays at v s level if v is not external or if v is in its rst stage as external

 

if v is in its second stage as external and it is if v is in The fraction is

 

its third stage as external



This gives the following upp er b ound for the total size of the Up arrays

X

jUpuj n n n n n

u



of the The total size of the SampleUp arrays for a given level equals



total size of the Up arrays at the same level in the normal case As b efore we

havetwo exceptions given by the dierent sampling rates used when no des

P



denotes a summation over all activenodesu The right part of the equation has

u

b een put into parentheses to remind the reader that the sign only is valid when n is

innitely large

are in their second or third stage as external The third stage as external

gives the largest total size so wehave the following upp er b ound

X

jSampleUp uj n nn n n

u

The NewUp arrays contain the items in the SampleUp arrays They are

lo cated one level higher up in the tree but the total size b ecomes the same

X X

jNewUpuj jSampleUp uj

u u

A Sliding Pyramid of Pro cessors

Cole describ es that one should have one pro cessor standing by each item in

the Up NewUp and SampleUp arrays This strategy implies a processor

requirement which is b ounded ab ovebyn n n whichis

slightly less than n

Consider the Up arrays and the expression given for its size in Equa

tion There are n pro cessors array items at the lowest active level a

maximum of n pro cessors at the next level ab ove a maximum of n

pro cessors at the level ab ove that and so on This may b e viewed as a pyra

mid of processors Each time the lowest active level moves one level upthe

pyramid of pro cessors follows so that we still have n pro cessors at the lowest

active level Similarly the NewUp pro cessors and SampleUp pro cessors

maybeviewedasapyramid of pro cessors that slides up towards the top of

the tree during the progress of the algorithm

Pro cessor Allo cation Snapshot

Since the pro cessor allo cation is so crucial for the whole implementation

we illustrate it by a simple example Consider a problem size of n

In this case Equations to imply a maximum pro cessor requirement

of pro cessors As discussed ab ove this requirementmayonlyoccurin

every third stage Nevertheless since the main goal for the implementation

was to obtain a simple and fast program not to minimise the pro cessor

requirement we allo cate pro cessors once at the start of the program

We get three pyramids of pro cessors as shown in Figure The

distribution into levels is static Thus pro cessor no is always the lowest

Up processors

SampleUp processors

NewSampleUp processors

Figure Coles algorithm n Up pro cessors SampleUp pro cessors

and NewSampleUp pro cessors distributed into levels

numb ered pro cessor used for the Up arrays one level ab ove the external

no des

Figure shows how these pro cessor pyramids are allo cated to the

binary tree no des during stage The no des in the binary tree are numb ered

as depicted in Figure The snapshot illustrates several situations which

must b e handled

Not all levels of the pro cessor pyramids are activeinevery stage recall

Figure The set of active levels change during the computation All

no des that are situated at activelevels are called active nodes In some

stages the size of the arrays is smaller than its maximum size at that level

Since wehave allo cated enough pro cessors to cover the maximum size there

may exist levels with some active and some passive processors eg the b ot

tom level of the SampleUp pyramid in Figure Since the pro cessor

pyramid slides towards the top of the binary tree during the computa

tion the number of nodes covered by a certain level of a pro cessor pyramid

will vary Consequently the pro cessors must b e distributed to the no des

dynamically As an example in stage the lowest level of the Up pro cessors

is allo cated to the four no des In stage the same pro cessors are dis

tributed on the two no des and Further since the size of the arrays at

a certain level maychange b etween two stages the allo cation of pro cessors

to array items within a no de must b e done at the start of eachstageAsan

example pro cessor no an Up pro cessor is allo cated to item in no de

in stage but to item in no de in stage

Up processors P

P P P P

SampleUp processors P P

P P P P P P P P

NewUp processors P P

P P P P P P P P

Figure Snapshot of pro cessor allo cation taken during stage when sorting

items For each pro cessor see Figure it is shown which no de the pro cessor

is allo cated to Legend i indicates that the pro cessor is allo cated to no de i P

indicates that the pro cessor is passive



 

 

 

 

 

 

 

 

 

 

 





 

 





 





 



 

   

 

   

   

 

   

  

  

   

 

   

  

  

  

  

   

  

 

  

 

   

  

 

   

 

  

  

  

 

   

 

  

  

  

  

   

  

  

   

  

    

     

    

  

    

 

   

  

  

 

  

  

  

  

   

  



  

  

   

  

  



  

   

   

 

   

   

   

 

   

 

 

   

i  

   

   

 

   

   

   

 

   

   

   

   

 

   

   

   

   

 

   

   

 

 

   

   

   

 

   

 

   

   

   

   

       

    

         

       

         

       

       

       

       

      

      

      

     

      

       

    

   

    

   

   

    

   

   

    

    

  

    

    

 

    

   

    

    

 

     

   

    

  

    

    

  

   

   

   

    

   

   

    

   

   

    

   

    

    

   

    

   

     

     

   

     

     

     

     

       

        

          

         

        

       

      

    

        

       

    

      

       

    

     

       

    

      

      

   

       

      

 

      

       

   

    

      

   

       

    

  

       

   

   

       

   

   

       

   

   

       

   

   

      

    

   

     

    

    

   

      

   

    

      

 

     

       



      

    

  

       

    

  

      

    

   

     

    

   

    

      

   

    

     

 

       

     

 

       

i i     

  

     

    

    

     

    

    

     

    

    

     

    

   

      

    

   

      

      

 

       

      

  

     

       

  

       

     

   

         

        

      

            

             

            

              

             

          

            

             

           

         

          

           

         

           

      

          

   

      

    

   

      

    

  

    

      

   

    

  

       

   

    

   

      

    

      

   

   

      

 

      

   

    

      

   

    

      



       

      

 

      

       

   

    

       

      

    

       

       

       

       

       

       

      

     

     

   

  

Figure Standard numb ering of no des in binary tree used throughout this

thesis

CREW PRAM pro cedure ColeMergeSort

begin

Compute the pro cessor requirement NoOfProcs

Allo cate working areas

Push addresses of working areas and other facts on the stack

assign NoOfProcs pro cessors

for each pro cessor do b egin

Read facts from the stack

InitiateProcessors

for Stage to logn do b egin

ComputeWhoIsWho

if Stage then CopyNewUpToUp

MakeSamples

MergeWithHelp

end

end

end

Figure Main program of Coles parallel merge sort expressed in Parallel

Pseudo Pascal PPP

Pseudo Co de and Time Consumption

Coles succinct description of the algorithm is at a suciently high level

to give the programmer freedom to cho ose b etween the SIMD or MIMD

Fly programming paradigms The algorithm has b een programmed in a

synchronous MIMD programming style as prop osed for the PRAM mo del

by Wyllie Wyl It is written in PIL and the program is executed on the

CREW PRAM simulator

We will use the PPP notation Section to express the most im

p ortant parts of the implementation in a compact and readable manner A

complete PIL source co de listing is provided in App endix B

The Main Program

The main program of Coles parallel merge sort algorithm is shown in Fig

ure When the program starts one single CREW PRAM pro cessor is

running and the problem instance and its size n are stored in the global

Table Time consumption for the statements in the main program of Cole

MergeSort shown in Figure The entries for statements are valid only

for stage and later stages The time used is shorter for some of the early stages

See Section at page

tn blog nc blog nc

tn

tn blog NoOfProcs c

tn

tn blog nc blog nc

tn

tn

tn

tn

memory Statements are executed by this single pro cessor alone

Computing the pro cessor requirement

The maximum pro cessor requirement here denoted NoOfProcsisgiven by

the sum of the maximum size of the Upu NewUpu and SampleUpu

arrays for all no des u during the computation Thus wehave

X X X

NoOfProcs jUpuj jNewUpuj jSampleUp uj

u u u

For a given nite nthe exact calculation of NoOfProcs is done by re

placing the sign with in Equation and and summing terms as

long as n is not smaller than the divisor A simple way to implement these

calculations is to use divide by lo ops

The resulting time consumption for statement as a function of the

problem size is given in Table as tn

Working areas and address broadcasting

When the total numb er of pro cessors has b een computed it is p ossible to

allo cate working areas of the appropriate size in the PRAM global mem

ory These are dedicated areas which are used to exchange various kinds of

information b etween the pro cessors see Figure

















No of integer locations



















































  

 

 



 

 

 

 





  Name of working area

  



 

  

  



 

  

  

  

 

   

   

 

 

 

 





 

 

 

 

 

 

 

 

 

 

problem instance

n

problem size n

Up SampleUp and NewUp arrays NoOfProcs

ExchangeArea NoOfProcs

No deAddressTable

n

No deSizeTable n

parameters to SetUp Statements and

 

 

 

 

Figure Memory map for implementation of Coles algorithm NoOfProcs

lo cations are used to store the contents of the Up SampleUp and NewUp ar

geArea is raysone integer lo cation for each array item pro cessor The Exchan

used to communicate array items b etween the pro cessors The No deAddressTable

and No deSizeTable are used to help the pro cessors nd the address and size of the

arrays of other no des in the tree

Statement describ es howcentral facts such as the addresses of the

working areas are broadcasted from the single master pro cessor It is done

by pushing values on the user stack in the PRAM global memory When

all NoOfProcs pro cessors have b een allo cated and activated they may read

these data in parallel from the stack Statement Due to the concur

rent read prop erty of the CREW PRAM this kind of broadcasting is easily

p erformed in O time

Pro cessor allo cation and activation

The pro cessors are allo cated with the standard pro cedure describ ed in CREW

PRAM prop erty at page Since NoOfProcs is O n the expression given

for tninTable is O log n

The desire for developing a simple and fast implementation did also

inuence the structure of pro cessor allo cation and activation

Implementing sliding pro cessors

Recall the description given in Section of how the pro cessors should

b e allo cated to the binary computation tree When implementing this it is

crucial to split the necessary computation into a static and dynamic part

The dynamic part called ComputeWhoIsWho see page is executed at

the b eginning of each stage ConsequentlyitmustbedoneinO time

to avoid sp oiling the O log n time overall algorithm p erformance This is

p ossible since the most dicult part need only b e computed once initially

The static part is represented by statement

InitiateProcessors which is executed by all pro cessors in parallel starts

by computing the address in the global memory of the Up SampleUp or

NewUp array item asso ciated with the pro cessor It reads the problem size

n computes the numb er of stages and decides its own pro cessor typ e Up

SampleUp or NewUp Then the Up pro cessors read one input numb er each

and place it in its own Up arrayitem

Then it is time to p erform the static part of the pro cessor sliding It is

done level bylevel for each pro cessor typ e It is calculated whichlevel in the

pyramid See Figure the pro cessor is assigned to This is counted in

number of levels zero or more ab ove the lowest activelevel whichalways

contains the external no des Thus whether a pro cessor is assigned to an

inside no de or an external no de is also known at this p oint Finally the

lo cal pro cessor number at that levelnumb ered from the left is computed

Again the level bylevel approach is implemented by divide by lo ops

giving the logarithmic terms for tnshown in Table

The Main Lo opImplementation Details

The log n stages each consists of four main computation steps

ComputeWhoIsWho

ComputeWhoIsWho starts by calculating the lowest and highest active level

in the binary computation tree recall Figure Together with the level

oset computed initially InitiateProcessors this gives the exact level for

each pro cessor in the current stage Knowing the stage and the level it

is p ossible for each Up SampleUp and NewUp pro cessor to compute the

size of the Up SampleUp or NewUp arrayineach no de at that level This

again makes it p ossible for each pro cessor to calculate the no de numb er and

item numb er within the array for that no de and whether the pro cessor is

active or passive during the stage All the calculations may b e done in O

time

CopyNewUpToUp

CopyNewUpToUp transfers the arraycontents and ranks computed for the

NewUp arrays in the previous stage to the corresp onding Up arrays in the

current stage The pro cedure is quite simple but it will b e describ ed in

some detail to illustrate how data typically are communicated b etween the

pro cessors

The pro cedure is outlined in Figure It assumes that the No deAdd

ressTable Figure for each no de in the tree stores the address of the



rst item in the NewUp array for that no de during the previous stage

The lo cal variable CorrNewUpItem is used byeach Up pro cessor to hold



the address of the NewUp pro cessor in the previous stage which corre

sponds to the Up item asso ciated with the Up pro cessor in the current stage

The pro cessor activation in statement is realised by selecting all active

Up pro cessors which are assigned to an inside no de In every third stage



This assumption is made valid by letting each pro cessor which is assigned to the rst

item of a NewUp array store the address of that item in the No deAddressTable at the end

of each stage



Ie the address in the dedicated part of the global memory used to store the Up

SampleUp and NewUp arrays see Figure

CREW PRAM pro cedure CopyNewUpToUp

begin

address CorrNewUpItem

for each pro cessor assigned to an Up array element where the

NewUp array for the same no de was up dated in the

previous stage do b egin

CorrNewUpItem

No deAddressTable this no de this item no

this pro cessor NewUpCorrNewUpItem Up

end

for each pro cessor of typ e NewUp do

ExchangeAreathis pro cessor RankInLeftSUP

for each pro cessor assigned to an Up array element where the

NewUp array for the same no de was up dated in the

previous stage do

RankInLeftSUP Exchan geAreaCorrNewUpItem

f Transfer RankInRightSUP in the same manner g

end

Figure CopyNewUpToUp outlined in PPP

when the external no des are in their rst stage as external the Up pro

cessors assigned to the external no des should also b e selectedsince these

no des were inside no des in the previous stage

In statement the Up pro cessors compute CorrNewUpItem in parallel

The concurrent read prop erty of the CREW PRAM is exploited since several

Up pro cessors are assigned to the same no de The copying of all NewUp

arrays to the Up arrays is then done simultaneously by the Up pro cessors

statement The p ossibility of doing many writes simultaneously to

dierent lo cations of the global memory is used No write conict will o ccur

b ecause every Up pro cessor is assigned to one unique Up array element

Knowing the mapping from an Up pro cessor in the current stage to the

corresp onding NewUp pro cessor in the previous stage it is straightforward

to communicate the ranks computed at the end of each stage to keep As

sumption at page valid In the description of Phase Step

maintaining ranks at page it was shown how to compute NewUpu

NewSampleUpv and NewUpu NewSampleUpw For eachNewUp

and Up pro cessor these two ranks are stored in the lo cal variables RankIn



LeftSUP and RankInRightSUP resp ectively Thus Upu SampleUpv

and Upu SampleUpw will b e known in the current stage if wecopy

these twovariables from each NewUp pro cessor to the corresp onding Up

pro cessor

The Exchan geArea Figure is allo cated for this kind of commu

nication First each NewUp pro cessor writes the value of RankInLeftSUP

to the unique lo cation asso ciated to that pro cessor statement Then

since we already know the mapping the Up pro cessors simply read the new

geArea value of RankInLeftSUP from the appropriate p osition in the Exchan

statement RankInRightSUP is transferred in the same way

CopyNewUpToUp is p erformed in constant time on a CREW PRAM

MakeSamples

MakeSamples is a pro cedure p erformed by the Up and SampleUp pro cessors

in co op eration to pro duce the samples SampleUpx for all activenodesx

This is a straightforward task see Section and can easily b e done

in constant time

First the Up pro cessors write the address of the rst item of the Up

arrayineachactive no de to the No deAddressTable Then every active



SUP is short for SampleUp and NUP for NewUp

CREW PRAM pro cedure DoMerge

begin

for each pro cessor of typ e Up or SampleUp do

ComputeCrossRanks

for each pro cessor of typ e SampleUp do b egin

RankInParentNUP RankInThisSUP RankInOtherSUP

Compute SLrankInParentNUP and SRrankInParentNUP

Find address of the NewUp array in the parentnode

and store it in ParentNUPaddr

WriteAddress ParentNUPaddr RankInParentNUP

NewUp WriteAddress own value

Record the sender address for each new NewUp item

in the ExchangeArea

end

end

Figure DoMerge outlined in PPP

SampleUp pro cessor calculates the SampleRate or to b e used to

pro duce the samples at that level Every SampleUp pro cessor knows the

no de x and item numb er index i within SampleUpx it is assigned to

It reads the address of the rst item in Upx from the No deAddressTable

and uses the SampleRate to nd the address of the item in Upx which

corresp onds to item i in SampleUpx Finally and in parallel eachactive

SampleUp pro cessor copies this Up array item to its own SampleUp array

item

Merging in Constant TimeImplementation Details

Having read the entire algorithm description in Section it should b e

no surprise that the implementation of Phase merging in constant time

constitutes the ma jor part of the implementation MergeWithHelp is split

into two parts DoMerge which p erforms Step of the algorithm recall

Figure at page and MaintainRanks which p erforms Step

DoMerge

DoMerge is outlined in Figure Its main task is to compute the p osition

of each SampleUp array item in the NewUp array of its parent no de See

Figure at page This rank is stored in the variable RankInPar

entNUP It is given as the sum of the rank of the item in the SampleUp

array which itself is a part of and its rank in the SampleUp arrayofits

sibling no de These two ranks are stored in the variables RankInThisSUP

and RankInOtherSUP resp ectively and they are computed by the pro cedure

ComputeCrossRanks

Postp one statement for a moment After having computed RankIn

ParentNUP the SampleUp pro cessors must get the address of the NewUp

array of the parent no de as expressed by statement It is done by using

the No deAddressTable in the same way as in the pro cedure CopyNewUp

ToUp describ ed ab ove Then each SampleUp pro cessor knows the right

address in the NewUp array for writing its own item value statement

Notice that all the items in all the NewUp arrays are up dated simultaneously

We do not get any write conicts but surely the simultaneous writing of n

values p erformed by n dierent pro cessors represents an intensive utilisation

of the CREW PRAM global memory

Before DoMerge terminates the SampleUp pro cessors must also store

the sender address for each NewUp item just written This information is

needed in MaintainRanks The sender address of an item is represented by

the item no index in the SampleUp arrayitwas copied from and it is

stored in the Exchan geArea If the SampleUp item is situated in a left child

no de the address is stored as a negativenumb er

Now let us return to statement Rep eat Assumption at page

which states that for each SampleUp arrayitemwe should knowthe

straddling items from the other SampleUp array In the terms used in Figure

for each item e wemust know d and f These p ositions are p ossible

to compute now b ecause RankInParentNUP has b een computed by all the

SampleUp pro cessors



The calculation is based on the twovariables SLindex and SRindex

computed by ComputeCrossRanksFor an item e in SampleUpv see Fig

ure SLindex and SRindex are the p ositions in SampleUpw of the two

items in SampleUpw thatwould straddle e if e was inserted according to

sorted order in SampleUpw Thus in general wehave SLindex Rank

InOtherSUP and SRindex RankInOtherSUP



SL is short for straddling on the left side and SR for straddling on the rightside

Again we use the ExchangeArea All SampleUp pro cessors write their

own value of RankInParentNUP to it Then SLindex is used to nd the

address in the ExchangeArea of the SampleUp item from the sibling no de

that would straddle this item on the left side The p osition of this item

in NewUp of the parent no de called SLrankInParentNode is then read

from the ExchangeArea Similarly SRindex gives the address for reading

SRrankInParentNodeand wehavereached our goal whichwas to know

the p ositions of d and f for each item e in NewUpu in Figure Note

that SLrankInParentNUP and SRrankInParentNUP are stored lo cally in

each pro cessor and need not b e communicated to other pro cessors b efore

they are used in MaintainRanks

ComputeCrossRanks

ComputeCrossRanks is outlined in Figure It is p erformed by the Up

and SampleUp pro cessors in co op eration As explained in Section



RankInThisSUP is easily found statement However the calculation



of RankInOtherSUP is rather complicated and split into two substeps

The pseudo co de for Substep describ es that all left child no des v are

handled rst and then all rightchild no des w In other cases suchas

statements in DoMerge b oth SampleUp arrays are handled in par

allel The reason for the reduced parallelism in this case is that the work

represented by statements and is done by the Up pro cessors in their



common parent no de

Recall Figure at page A description of the co de that implements

statement follows The Up pro cessors starts by writing its value of Upu

geArea Then in addition to knowing r the SampleUpv to the Exchan

Up pro cessors may read the value s from the Exchan geAreaFurther in

addition to knowing its own value i it reads the value of i stored in the

 

Up array The address of the SampleUpv array is found by using the

No deAddressTable Now the situation is exactly as in Figure and each



The variable RankInThisSUP stores ReSampleUpx where e SampleUpx and

x v or x w The rank of an item e in the SampleUp array of its sibling no de is stored

in RankInOtherSUP



An exception whichismuch simpler than the general case o ccurs in no des with Sam

pleUp arrays containing only one item This implies that there is no Up array in the

parent no de to help us in computing the rank in the SampleUp array of the sibling no de

However since that array also contains only one item RankInOtherSUP may b e computed

by doing one single comparison



Can you suggest a p ossible strategy for handling b oth child no des in parallel

CREW PRAM pro cedure ComputeCrossRanks

begin

for each pro cessor of typ e SampleUp do

RankInThisSUP p osition index of item in own array

for each pro cessor of typ e Up do b egin

f Substep Section page g

For each item e in SampleUpv compute its rank in Upu

For each item e in SampleUpw compute its rank in Upu

fRankInParentUP is now stored in the Exchan geArea g

end

for each pro cessor of typ e Up or SampleUp do b egin

f Substep Section page g

if pro cessor typ e is SampleUp then

Read RankInParentUP from the Exchan geArea

if pro cessor typ e is Up then b egin

Store addresssize into the No deAddress SizeTable

Store RankInRightSUP into the ExchangeArea

end

if pro cessor typ e is SampleUp in a left child no de then

Find r and t

if pro cessor typ e is Up then

geArea Store RankInLeftSUP into the Exchan

if pro cessor typ e is SampleUp in a rightchild no de then

Find r and t

if pro cessor typ e is SampleUp then b egin

Read maximum values from the other SampleUp array

LocalOset p osition of own value among these

RankInOtherSUP r LocalOset

end

end

end

Figure ComputeCrossRanks outlined in PPP

Up pro cessor considers the candidates in p ositions r to s according to the

rules describ ed in Section When the Up pro cessor which represents

item i Figure has found that an item in SampleUpv iscontained



in the interval induced by i denoted I i it assigns the index of i to

  

the variable RankInParentUP of that SampleUp item This assignmentis

done indirectly by writing the value to the ExchangeAreaAt the start of

Substep the SampleUp pro cessors read this value

The SampleUp item p ointed to by s for i is the same as the item p ointed



to by r for i Consequently these b order items are considered bytwo



pro cessors Therefore one might think that our strategy for up dating Rank

InParentUP may lead to write conicts However this will not o ccur since

the Up pro cessor asso ciated with an item x only assigns to the value Rank

InParentUP for those items in SampleUpv which are contained in I x

and every SampleUp item is memb er of exactly one suchsetI x

Substep is p erformed by the SampleUp and Up pro cessors in co op er

ation statement Again we handle all left child no des b efore all right

child no des statements and The left child no des must access

the lo cal Up variable RankInRightSUP to nd the values of r and t and

the rightchild no des must access RankInLeftSUP The SampleUp pro cessors

use the No deAddressTable and the variable RankInParentUP to lo cate the

items d and f Figure page The values of r and t is then read from

geArea at the p ositions corresp onding to d and f statements the Exchan

and

When all SampleUp pro cessors have calculated r and t they may do the

rest of Substep in parallel statement They read the values of the

items in p ositions r t in the SampleUp array of their sibling no de and

compare their own value with these at most three values The p osition with

resp ect to sorted order is found and stored in LocalOset RankInOtherSUP

is then given as shown in statement

MaintainRanks

As describ ed in Section the computation of NewUpu NewSample

Upv and NewUpu NewSampleUpw is split into two caseshere

represented by the pro cedures RankInNewSUPfromSenderNode and Rank

InNewSUPfromOtherNode Figure shows the most imp ortant parts of

the from sender no de case Both pro cedures are executed by the Up

SampleUp and NewUp pro cessors in co op eration

Recall Section The rst part of the pro cedure is p erformed by the

CREW PRAM pro cedure RankInNewSUPfromSenderNode

begin

fPerformed in parallel by Up SampleUp and NewUp pro cessorsg

if pro cessor typ e is Up then b egin

RankInThisNUP RankInLeftSUP RankInRightSUP

ExchangeAreathis pro cessor RankInThisNUP

end

if pro cessor typ e is SampleUp then b egin

By help of the sample rate used in this stage at this level

nd the p osition in the Up array corresp onding to this

SampleUp item

Use this p osition to read the correct value of RankInThisNUP

geArea from the Exchan

fRankInThisNUP stores SampleUpx NewUpxg

Compute RankInThisNewSUP from RankInThisNUP by

considering the sample rate that will b e used in the next

stage to make NewSampleUp from NewUp

fRankInThisNewSUP stores SampleUpx NewSampleUpxg

Exchan geAreathis pro cessor RankInThisNewSUP

end

if pro cessor typ e is NewUp then

if this item came from SampleUpv then

Read the value of RankInLeftSUP valid for the next

geArea stage from the p osition in the Exchan

corresp onding to its sender item in SampleUpv

else fitem came from SampleUpw g

Read the value of RankInRightSUP valid for the next

stage from the p osition in the Exchan geArea

corresp onding to its sender item in SampleUpw

end

end

Figure RankInNewSUPfromSenderNode outlined in PPPHerewehaveas

sumed that the pro cessor activation has b een done at the caller place and not

inside the pro cedure

Up pro cessors Since NewUpucontains exactly the items in SampleUpv

and SampleUpw Upu NewUpu is easily computed as expressed in

statement The ExchangeArea is used to transfer these values to the

SampleUp pro cessors

In the second part of the pro cedure the SampleUp pro cessors com

pute SampleUpx NewSampleUpx x v w Note that b oth the

left and rightchild no des are handled in parallel Each SampleUp item

pro cessor calculates its corresp onding Up item in the same no de so that

SampleUpx NewUpx is known after statement has b een executed

Since NewSampleUpx is made from NewUpx we get SampleUpx

NewSampleUpx which is stored in the lo cal variable RankInThisNewSUP

and written to the ExchangeArea

The last part of RankInNewSUPfromSenderNode is p erformed by the

geArea to store NewUp pro cessors Rememb er that we used the Exchan

the sender address of each NewUp item at the end of pro cedure DoMerge

statement in Figure This information is read by the NewUp

pro cessors at the very start of MaintainRanks b efore RankInNewSUPfrom

SenderNode is called We describ e statement statement contains



dierent but quite symmetric co de Each NewUp item pro cessor uses

its sender address to nd its corresp onding in SampleUpv Since the

Exchan geArea stores SampleUpx NewSampleUpx for all no des xan

arbitrary item e in NewUpu that came from SampleUpv may indirectly

geAreaso that NewUpu read ReNewSampleUpv from the Exchan

SampleUpv b ecomes known for all items in v that came from v This rank

is stored in the variable RankInLeftSUP and is copied from the NewUp to

the Up pro cessors at the b eginning of next stage by CopyNewUpToUp Fig

ure

Pro cedure RankInNewSUPfromOtherNode is outlined in Figure Al

most all the work is done by the NewUp pro cessors Recall Section

and Figure at page

The pro cedure starts by transferring the variables SL and SRrankInPar

entNUP whichwere computed in DoMerge from the SampleUp pro cessors

to the NewUp pro cessors Statement describ es that the values r and t

are communicated from the pro cessors holding item d and f to the pro ces

sor holding item e Again the Exchan geArea is used and we get a reduced



Statements giveatypical example on how the MIMD prop erty of the CREW

PRAM mo del may b e used to reduce the execution time and increase the pro cessor utili

sation

CREW PRAM pro cedure RankInNewSUPfromOtherNode

begin

fPerformed in parallel by Up SampleUp and NewUp pro cessorsg

Communicate SLrankInParentNUP from the SampleUp pro cessors

geArea to the NewUp pro cessors through the Exchan

Communicate SRrankInParentNUP from the SampleUp pro cessors

to the NewUp pro cessors through the Exchan geArea

f Each NewUp pro cessor knows the p ositions of d and f g

Use the No deAddress SizeTable to let each NewUp pro cessor know

the addresssize of the NewUp array in the other child no de

For each NewUp item e that came from SampleUpw get the

values of RankInLeftSUP from the NewUp items d and f that

came from SampleUpv

For each NewUp item e that came from SampleUpv getthe

values of RankInRightSUP from the NewUp items d and f that

came from SampleUpw

f Each NewUp pro cessor knows the value of r and t g

Compute the lo cal oset of each NewUp item e within the

maximum items in p ositions r t in the NewSampleUp

array of its other child no de f See the text g

The rank of NewUp item e in the NewSampleUp array of the other

no de is now given by r and the lo cal oset just computed

and it is stored in RankInLeftSUP or RankInRightSUP

end

Figure RankInNewSUPfromOtherNode outlined in PPP

degree of parallelism as implicitly expressed by statements and It is

statement that corresp onds to Figure Further note that the ranks r

and t corresp ond to the from sender no de case which just has b een solved

For eachitem e in NewUpu wemust lo cate the items in the p o

sitions r t in NewSampleUpv or NewSampleUpw This is not

trivial b ecause the NewSampleUp arrays do not exist However we know

that NewSampleUpx will b e made by the pro cedure MakeSamples from

NewUpx in the next stage Thus a given item in NewSampleUpv

or NewSampleUpw may b e lo cated in NewUpv or NewUpw if we

takeinto account the sampling rate that will b e used at that level one level

below in the next stage

This should explain how it is p ossible to implement statementand

whywe b othered to store the addresses of the NewUp arrays in statement

Finally each NewUp item pro cessor that came from SampleUpw

SampleUpv computes its rank in NewSampleUpv NewSampleUpw

and stores this value in RankInLeftSUP RankInRightSUP

Performance

The time used to p erform a Stage statements in Figure at page

is somewhat shorter for the six rst stages than the numb ers listed in

Table at page This is b ecause some parts of the algorithm do not

need to b e p erformed when the sequences are very short However for all

stages after the sixth the time consumed is as given by the constants in

the table Stages takes a total of time units The total time used

m

by ColeMergeSort on n distinct items n m is an integer and m

may b e expressed as

T ColeMergeSort n tn tn log n

The O time merging p erformed by MergeWithHelp constitutes the

ma jor time consumption of the algorithm Of the time used by Merge

WithHelp time units ab out is needed to compute the crossranks

Substep and p Col and nearly is used to maintain ranks

Step p Note that the execution time of Coles parallel merge sort

algorithm is indep endent of the actual problem instance for a given problem

size Equation makes it p ossible to calculate the time consumption of

the algorithm as an exact gure for any problem size whichisapower of

Table Performance data measured from execution of ColeMergeSort on the

CREW PRAM simulator Time and cost are given in kilo CREW PRAM time

units and kilo CREW PRAM unittime instructions resp ectively The number of

readwrite op erations tofrom the global memory are given in kilo lo cations

Problem size n time pro cessors cost reads writes

Table shows the p erformance of Coles algorithm on various problem

sizes



O n





O n











 















 









 

 

 

 

 



 

  





















 



 

  

  

  

   

O log n   

  

  

 



  

 



  



   

 

 

   



 

  

 

 

  log T n



 

  

 

 

 

   



  

  

  

 

 







 





















 













 









 



































 



 

Coles algorithm











 









 



 

















  



















 

















 insertion sort

 



 







 











 





















 















 

































N log n

Figure Outline of results exp ected when comparing Coles parallel merge sort

with straight insertion sort

Coles Parallel Merge Sort Algorithm Com

pared with Simpler Sorting Algorithms

The First Comparison

When starting on this work I exp ected that Coles algorithm was so com

plicated that it in spite of its known O log n time complexitywould b e

slower than simpler parallel or sequential algorithmsunless we assumed

very large problem sizes I planned to show that even straight insertion

sort executed on one single pro cessor would p erform b etter than Coles al

gorithm as long as nN and I exp ected N to b e so large that it would

imply a pro cessor requirement for Coles algorithm in the range of hundreds

of thousands or millions of pro cessors Then I could have claimed that

Coles algorithm is of little practical value

The kind of results that I exp ected is illustrated in Figure The

p oint where Coles algorithm b ecomes faster than pro cessor insertion sort



for worst case problems is marked with a circle in the gure The cor

resp onding problem size is denoted N Since insertion sort is very simple

I exp ected that the relatively high descriptional complexity of Coles algo



rithm would imply N to b e as large as oreven larger

However the results for the time consumption of Coles algorithm were

 

Ie problems making insertion sort to an O n algorithm

much b etter than exp ected Figure shows the time used to sort n

integers by Coles algorithm compared with pro cessor insertion sort and

npro cessor o ddeven transp osition sort Note that wehave logarithmic scale

on b oth axes

Coles algorithm is the CREW PRAM algorithm describ ed in the pre

vious section and with p erformance data as summarised in Table In

sertion sort is the algorithm whichwas describ ed in Section Its

p erformance was describ ed in Section The simplest version of odd

even transposition sort algorithm was describ ed in Section Here

we are studying the version develop ed in Section which requires only

n pro cessors to sort n items The p erformance of that CREW PRAM

implementation was rep orted in Section The time used by Coles

algorithm and o ddeven transp osition sort are indep endent of the actual

problem instance for a xed problem size However insertion sort requires



O n time in the worst case and O n time in the b est case b oth shown

in the gure

We see that Coles algorithm b ecomes faster than insertion sort worst

case for N Also Coles algorithm b ecomes faster than o ddeven

transp osition sort whichisvery simpleand uses O n pro cessors for prob

lem sizes at ab out items

These somewhat surprising results Figure were presented at the

Norwegian Informatics Conference i Nat It was a preliminary

comparison which did not supp ort my exp ectations The practical value of

Coles algorithm was still an op en question and it was clear that further

investigations were needed

In retrosp ect b eing surprised by the obtained results can b e considered



as an underestimation of the dierence b etween p olynomials suchasn and

n and log n for large and even mo dest values of n

Revised Comparison Including Bitonic Sorting

A natural candidate for further investigations was Batchers bitonic sorting



describ ed in Section Its relative simplicity together with a O log n

time complexitymay explain claims suchasnobody beats bitonic sorting

Sanb

In addition to implementing bitonic sorting all the compared algorithm

implementations were revised and some improvements were done The mo d

elling of the time consumption in the various implementations was also re

vised in order to give a fair comparison The nal results shown in Figure



Execution time



Problem size



Figure Comparison of Coles algorithm with O nandO n time algorithms

Nat Legend  coles algorithm O log n  o ddeven transp osition sort



O n  insertion sort worst case O n and insertion sort b est case

O n

Table Performance data for the CREW PRAM implementations of the studied

sorting algorithms Problem size n is Time and cost are given in kilo CREW

PRAM time units and kilo CREW PRAM unittime instructions resp ectively

The numb er of readwrite op erations tofrom the global memory are given in kilo

lo cations

Algorithm time pro cessors cost reads writes

Cole

Bitonic

OddEven

Insertworst

InsertAverage

Insertbest

Table Same table as ab ove but with problem size n

Algorithm time pro cessors cost reads writes

Cole

Bitonic

OddEven

Insertworst

InsertAverage

Insertbest

were presented at the SUPERCOMPUTING conference Natb

We see that bitonic sorting is fastest in this comparison in a large range of

n starting at ab out

Table and show some central p erformance data for small test

runs n and n The two rightmost columns show the total

numb er of read op erations from the global memory and the total number of

write op erations to the global memory

The implementation of ColeMergeSort counts ab out PIL lines and

was develop ed and tested in ab out days of work see App endix B The

corresp onding numb ers for BitonicSort are ab out PIL lines and roughly

days of work The listing is given in App endix C



k

Execution time

k

k

k



k k

Problem size n

Figure Time consumption in numb er of CREW PRAM time units measured

by running parallel sorting algorithms on the CREW PRAM simulator for various

problem sizes n horizontal axis Note the logarithmic scale on b oth axes Leg



end  Coles algorithm O log n  bitonic sorting O log n  o ddeven



transp osition sort O n  insertion sort worst case O n and insertion

sort b est case O n





















Execution time















 





























































































 









































 



































































 

































 





















 















 

























































 











 



















 











 







 













logn

Figure Comparison of time consumption in the CREW PRAM implementa

tions of Coles parallel merge sort algorithm marked with  and Batchers bitonic

sorting algorithm marked with 

Bitonic Sorting Is Faster In Practice

As shown in Section and exact analytical mo dels havebeen

develop ed for the time consumption in the implementations of Coles al

gorithm and bitonic sorting Themodelshavebeenchecked against the

test runs and have b een used to nd the p oint where the CREW PRAM

implementation of Coles O log n time algorithm b ecomes faster than the



CREW PRAM implementation of O log n time bitonic algorithm See

Figure

The results are summarised in Table The table shows time and pro



cessor requirements for the two algorithms for n k n k k

for the last value of n making bitonic sorting faster than Coles algorithm

and for the rst value of n making Coles algorithm to a faster algorithm

The exact numb ers which are shown in the table of time units consumed

by the algorithms for the various problem sizes are not esp ecially imp ortant

What is imp ortant is that the metho d of algorithm investigation describ ed

Table Calculated p erformance data for the two CREW PRAM implementa

tions

Algorithm n time pro cessors

 

ColeMergeSort k

 

BitonicSort k

 

ColeMergeSort k

 

BitonicSort k

 

ColeMergeSort

 

BitonicSort

 

ColeMergeSort

 

BitonicSort

in this thesis makes it p ossible to do such exact calculations

We see that our straightforward implementation of Batchers bitonic

sorting is faster than the implementation of Coles parallel merge sort as

 

long as the numb er of items to b e sorted n is less than In

other words for Coles algorithm to become faster than bitonic sorting the



sorting problem instancemustcontain close to Giga Tera items is



ab out as large as the numb er of milliseconds since the Big Bang Har

This comparison strongly indicates that Batchers well known and simple



O log n time bitonic sorting is faster than Coles O log n time algorithm



for al l practical values of n The huge value of n also gives ro om

for a lot of improvements to the implementation of Coles algorithm b efore

it b eats bitonic sorting for practical problem sizes There are also go o d

p ossibilitie s to improve the implementation of bitonic sorting

In fact Coles algorithm is even less practical than depicted by com

parison of execution time It requires ab out times as many pro cessors as

bitonic sorting and it has a far more extensive use of the global memory

This is outlined in the following subsection



The problem size making Coles algorithm faster than bitonic sorting requires ab out



pro cessors If we assume that each pro cessor o ccupies cubic meters this

corresp onds roughly to lling the whole volume of the Earth with pro cessors

Comparison of Cost and Memory Usage

For b oth algorithms wehave derived exact expressions for the execution

time and pro cessor requirement as functions of n Consequently the cost of

running each of the two implementations on a problem of size n is known

CostColeMergeSort n T ColeMergeSort n NoOfProcs n

where T ColeMergeSort nisgiven in Equation and NoOfProcs nis

given by Equations and to

For bitonic sorting wehave

CostBitonicSort n T BitonicSort n n

where T BitonicSort nisgiven in Equation

Using Equation and Equation it is found that bitonic sorting



has a lower cost as long as n is less than another extremely large



number

This comparison further strengthens the conclusion made ab ove Coles

algorithm is not a go o d choice for practical problem sizes

Global memory access pattern

The CREW PRAM simulator oers the p ossibility of logging all references

to the global memory Since the global memory is the most unrealistic part

of the CREW PRAM mo del it maybeinteresting to monitor howmuchan

algorithm utilises the global memory An algorithm with an intensive use

of the memory may b e more dicult to implement on more realistic mo dels

without a global memory Figures and illustrates the memory

usage of the CREW PRAM implementations of Bitonic sorting and Coles

algorithm resp ectively Again it is clearly demonstrated that Bitonic sorting

is the b est alternative for practical use

In the gures plots along a horizontal line depict a memory lo cation

accessed byseveral pro cessors Plots along a vertical line depict the various

memory lo cations accessed by a single pro cessor



Imagine lling the whole universe with neutrons such that there is no remaining empty



spacethe required numb er of neutrons is estimated to ab out Sag



Global memory address

   

   



     

            

          

             

         

           

            

Pro cessor number

           

       

            

           

            

             

         

             

     

Figure Access to the global memory during execution of BitonicSort for sorting

numb ers  marks a read op eration and  marks a write op eration



Global memory address

                                                      

                                                      

                                                      

                                                      

                                                      

                                                      

































   

     

   

     

   

     

   

     

             

                   

             

                 

                     

                                 

                         

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

                    

                     

                     

                     

                    

                     

                    

                    

                                                        

                                                         

                                                   

                                                     

                                                                                        

                                                                                           

                                                                

                    

                    

                         

                         

                         

                         

                         

                         

                         

                         

                                              

                                               

                                              

                                               

                                                           

                                                           

                                                          

                                                         

       

          

           

            

              

                

            

             

              

               

                                             

                                                

                                          

                                            

                                                              

                                                                  

                                                                

                                                                 

                              

                          

                             

                       

                                                  

                                          

                                                              

                                     

      

         

      

           

      

         

      

           

      

         

      

           

      

         

      

           

  

  

      

      

      

       

      

      

      

      

               

               

               

               

                     

                     

                     

                      

    

     

                 

                      

                    

                      

                   

                      

                

                      

                                  

                                       

                             

                                   

                                         

                                                

                                               

                                            



               

             

               

             

                         

                     

                              

                     

       

              

          

                

          

              

         

             

          

            

         

                 

         

                

           

               

                                                       





























 



Pro cessor number

Figure Access to the global memory during execution of ColeMergeSort for

sorting numb ers  marks a read op eration and  marks a write op eration

Chapter

Concluding Remarks and

Further Work

The lack of symmetry is that the powerofacomputational

model is robust in time while the notion of technological fea

sibility is not This latter notion is uid since technology keeps

advancing and oering new opp ortunities Therefore it makes

sense to x the mo del of computation after carefully selecting

one and study it and at the same time lo ok for ecient imple

mentations

Uzi Vishkin in PRAM Algorithms Teach and Preach Vis

It has b een a very interesting job to do the research rep orted in this thesis

Ihave found theoretical computer science to b e a rich source of fundamental

and fascinating results ab out parallel computing During the work new and

interesting concepts and issues havecontinuously p opp ed up

Exp erience and Contributions

The Gap Between Theory and Practice

The main goal for the work has b een to learn ab out howtoevaluate the

practical value of parallel algorithms develop ed within theoretical computer

science

Asymptotically large problem sizes do es not o ccur in practice By evalu

ating the p erformance of implemented algorithms for nite problem sizes the

eect of the size of the complexity constants b ecomes visible My investiga

tion of Coles algorithm has shown that Batchers bitonic sorting algorithm

is faster than Coles paral lel merge sort algorithm for al l practical values of

n This is in contrast to the asymptotical b ehaviour whichsays that Coles

algorithm is a factor of O log n time faster

In my view it is probably a relation b etween high descriptional complex

ity and large complexity constants However the computational complexity

is not a go o d indicator for the size of the complexity constants Recall

that Coles algorithm is considered to b e a simple algorithm within the the

ory community The ma jority of the parallel algorithms recently prop osed

within theoretical computer science has a larger descriptional complexity

than Coles algorithm Consequentlymyinvestigation of Coles algorithm

has conrmed the b elief that promising theoretical algorithms can b e of

limited practical imp ortance

This work should not b e p erceived as criticism of theoretical computer

science In the Intro duction I argued that parallel algorithms from the the

ory world and topics suchascomplexity theory will b ecome increasingly

imp ortant in the future However mywork has indicated that practitioners

should take results from theoretical computer sciencewithagrain of salt

Careful studies should b e p erformed to nd out what is hidden by highlevel

descriptions and asymptotical analysis

Evaluating Parallel Algorithms

A new metho d for evaluating PRAM algorithms

The prop osed metho d for evaluating PRAM algorithms is based on imple

menting the algorithms for execution on a CREW PRAM simulator I have

not found the use of this or similar metho ds describ ed in the literature

The main advantage of the metho d is that algorithms may b e compared

for nite problems As shown by the investigation of Coles algorithm such

implementations make visible information ab out p erformance that is hidden

b ehind the BigOh curtain

Implementation takes time

I originally planned to investigate parallel algorithms from theoretical

computer science It was far from clear howmuchwork would b e needed

to do the implementations Only Coles algorithm has b een implemented

so far Marianne Hagaseth is currently working on an implementation of

the well known theoretical sorting algorithm presented by Shiloachand

Vishkin in SV Hag

For several reasons implementing theoretical algorithms takes a lot of

time First of all the algorithms are in general rather complex They are

describ ed at a relatively high level and some uninteresting details may

have b een omitted However to b e able to implement you havetoknowab

solutely all details This makes it necessary to do a thorough and complete

study of the algorithm until all parts are clearly understo o d When the al

gorithm has b een clearly describ ed and understo o d several implementation

decisions remain

The rst implementation of Coles algorithm

During the detailed study of Coles algorithm some missing details were

discovered This made it necessary to do some few completions of the orig

inal description Section Nata In the development of the CREW

PRAM implementation of Coles algorithm the most dicult task was to

program the dynamic pro cessor allo cation in the form of a sliding pyramid

of pro cessors Section

Tomy knowledge my implementation of Coles algorithm is the only im

plemented sorting algorithm using O log n time with asymptotical ly optimal



cost Col Rud Zag This was rep orted in mypaperLogarithmic

Time Cost Optimal Paral lel Sorting is Not Yet Fast in Practice presented

at the SUPERCOMPUTING conference in Novemb er Natb

A lot may b e learned from mediumsized test runs

Massively parallel algorithms with p olylogarithmic running time are often

complex One might think that evaluation of such algorithms would require

pro cessing of very large problem instances So far this has not b een the

case In studying the relatively complex Coles algorithm some hundreds

of pro cessors and small sized memories have b een sucient to enlighten the

main asp ects of the algorithm In many cases the need for brute force ie

huge test runs may b e reduced by the following working rules

The size of the problem instance is used as a parameter to the algorithm

whichismadetosolve the problem for all p ossible problem sizes



Marcoz Zagha has recently implemented a simplied version of the ReifValiantFlash

sort algorithm RV on the Connection Machine however he has expressed that the

implementation is far from O log n time Zag

Elaborate testing is p erformed on all problem sizes that are within the

limitations of the simulator

A detailed analysis of the algorithm is p erformed The p ossibilityof

making such an analysis with a reasonable eort depends strongly on

the fact that the algorithm is deterministic and synchronous

The analysis is conrmed with measurements from the test cases

Together this will often make it p ossible to use the analytical p erformance

mo del by extrap olation for problem sizes b eyond the limitations of the sim

ulator

The CREW PRAM Simulator

Ab out the prototyp e

Several critical design choices were made to reduce the work needed to

provide highlevel parallel programming and the CREW PRAM simula

tor The PIL language was dened to b e a simple extension of SIMULA

BDMNPoo and it was decided to implement the simulator by using

the DEMOS simulation package Bir A program in PIL is translated to

pure SIMULA which is included in the description of a pro cessor ob ject

in the DEMOS mo del of the simulator Before execution the simulator

program including the translated PIL program is compiled by the SIMULA

compiler see Section A

This is not an ecient solution but it has several advantages The trans

lation from PIL to SIMULA is easy and all the nice features of SIMULA are

automatically available in PIL Perhaps most imp ortant executing the PIL

programs as SIMULA programs together with the simulator co de makes it

p ossible to use the standard SIMULA source co de level debugger HTon

the parallel programs and their interaction with the simulator

Simplicitywas given higher priority than ecient execution of the PIL

programs In retrosp ect this was a rightchoice The b ottleneckininves

tigating Coles algorithm was the program development time not the time

used to run the program on the simulator See the previous section ab out

mediumsized test runs In addition the use of the simulator prototyp e

has given us a lot of ideas ab out how a more ecient and elegant simulator

system should b e designed

Synchronous MIMD programming

The simplicity of the CREW PRAM mo del combined with the very nice

prop erties of synchronous computations make the development analysis and

debugging of parallel algorithms to a relatively easy task

Synchronous MIMD programming implies redundant op erations such

as compile time padding However I b elieve it is more imp ortanttomake

parallel programming easy than to utilise every clo ck p erio d of every pro ces

sor Indirectly easy programming may also imply more ecient programs

b ecause the programmer maygetabetteroverview of all asp ects of a com

plex software system includin g p erformance

Synchronous MIMD programming which has b een claimed to b e easy

should b e considered as a bypro duct of this work However it may b e one of

the most interesting parts for future research In this context it was very en

couraging to hear the keynote address at the SUPERCOMPUTING con

ference held by Danny Hillis Hil Hillis argued that the computer indus

try pro ducing massively parallel computers is searching for a programming



paradigm that combines the advantages of SIMD and MIMD programming

Similar thoughts have b een expressed by Guy Steele Ste

According to Hillis it is at present unknown what such a MIMDSIMD

combination would lo ok like The synchronous MIMD programming style

used in the work rep orted in this thesis is suchacombination To conclude

synchronous MIMD programming is a very promising paradigm for paral lel

programming that should b e further investigated

Use in education

The CREW PRAM simulator system has b een used in the courses Highly

Concurrent Algorithms and Parallel Algorithms which are given bythe

Division of Computer Systems and Telematics IDT at the Norwegian In

stitute of Technology NTH It has b een used for exp erimenting with vari

ous parallel algorithms as well as for mo delling of systolic arrays and neural

networks



The main advantage of the SIMD paradigm is that the synchronous op eration of the

pro cessors implies easier programming MIMD has the advantage of more exibility and

b etter handling of conditional s

Further Work

How far can hard and software developments pro ceed indep en

dently When should they b e combined Parallelism seems to

bring these matters to the surface with particular urgency

Karen Frenkel in Freb

What is the right division of lab our b etween programmer com

piler and online resource manager What is the rightinterface

between the system comp onents One should not assume that

the division that proved useful on multiprogrammed unipro ces

sors is the right one for parallel pro cessing

Marc Snir in Paral lel Computation Models Some Useful Ques

tions Sni

It is my hop e to continue working with many of the topics discussed in this

thesis Below I present a short list of natural continuations of the work

Continuing the study of parallel algorithms from theoretical computer

science to learn more ab out their practical value A very large number

of candidate problems and algorithms exist

Extend the study on parallel sorting algorithms Two imp ortant al

ternatives are the ReifValiant ashsort algorithm RV and the

adaptive bitonic sorting algorithm describ ed by Bilardi and Nicolau

Bil The ashsort algorithm is a socalled probabilistic algorithm

which sorts in O log n time with high probability It has a rather high

descriptional complexity but is exp ected to b e very fast Col

Improving the simulator prototyp e and the PIL language Studying

what kind of to ols and language features would makesynchronous

parallel programming easier

Extending the simulator to the EREW and CRCW variants of the

PRAM mo del Implementing the p ossibilityofcharging a higher cost

for accesses to the global memory

Dening a more complete language for synchronous MIMD program

ming and implementing a prop er compiler with compile time padding

Leslie Valiants recent prop osal of the BSP mo del as a bridging mo del

for parallel computation was presented in Section The use of

this mo del raises many questions Is a PRAM language ideal How

should PRAM programs b e mapp ed to the BSP mo del Howmuch

of the mapping should b e done at compile time and what should b e

left to runtime How is the BSP mo del b est implemented Much

interesting research remains b efore we know the conditional answers

to these questions

App endix A

The CREW PRAM

Simulator Prototyp e

While a PRAM language would b e ideal other styles may b e appro

priate also

Leslie Valiantin A Bridging Model for Paral lel Computation Val

The main motivation for developing the CREW PRAM simulator was to provide

avehicle for evaluation of synchronous CREW PRAM algorithms It was made for

making it p ossible to implement algorithms in a high level language with convenient

metho ds for expressing parallelism and for interacting with the simulator to obtain

measurement data and statistics

The most imp ortant features for algorithm implementation and evaluation and

the realisation of the simulator prototyp e are briey outlined in Section A A

much more detailed description may b e found in the Users Guide in Section A

The app endix ends with a description of how systolic arrays and similar syn

chronous computing structures may b e mo delled by using the simulator

A Main Features and System Overview

A Program Development

Wyllie Wyl prop osed a high level pseudo co de notation called parallel pidgin

algol for expressing algorithms for the PRAM mo del Inspired by this work a

notation called PIL has b een dened The PIL notation Parallel Intermediate



Language is based on SIMULA BDMNPoo with a small set of extensions



SIMULA is a highlevel ob jectoriented programming language develop ed in Norway

It was made available in but is still mo dern

which is necessary for making it into a language well suited for expressing syn

chronous MIMD parallel programs Nearly all features found in SIMULA maybe

used in a PIL program

The main features for program development are

 Processor al location A PIL program maycontain statements for allo cating a

given numb er of pro cessors and binding these to a processor set A pro cessor

set is a variable used to hold these pro cessors and may b e used in pro cessor

activations

 Processor activation A pro cessor activation sp ecies that all pro cessors in a

given pro cessor set are to execute a sp ecic part of the PIL program

 Using global memory The simulator contains pro cedures for allo cating and

accessing global memory The pro cedures ensures true CREW PRAM par

allelism and detection of write conicts

 Program tracing A set of pro cedures for pro ducing program tracing in a

compact manner is provided

 SYNC and CHECK This is two builtin procedures which have shown to bevery

helpful during the topdown implementation of complicated algorithms SYNC

causes the calling pro cessors to synchronise so that they all leave the state

ment simultaneously SYNC has b een implemented as a binary tree compu

tation using only O log n time see Section CHECK checks whether

the pro cessors are synchronous at the place of the call in the program ie

that the pro cessors are doing the same call to CHECK simultaneously

 File inclusion conditional compilation and macros This is implemented by

prepro cessing the PIL program by the standard C prepro cessor cpp

Perhaps the greatest advantage of dening the PIL notation as an extension of

SIMULA is that it implies the availability of the SIMULA debugger simdeb HT

This is a very p owerful and exible debugger that op erates on the source co de level

It is easy to use for debugging of paral lel PIL programs As an example consider

the following command

at if p AND Time THEN stop

This command sp ecies that the control should b e given to the debugger when

source line numb er is executed by pro cessor numb er if the simulated time has

exceeded

A Measuring

The simulator provides the following features for pro ducing data necessary for doing

algorithm evaluation

 Global clock The synchronous op eration of the CREW PRAM implies that

one global time concept is valid for all the pro cessors The global clo ckmay

b e read and reset The default output from the simulator gives information

ab out the exact time for all events which pro duce any form of output

 Specifying time consumption In the currentversion of the simulator the time

used on lo cal computations inside each pro cessor must b e explicitly given in

the PIL program by the user This makes it p ossible for the user to cho ose an

appropriate level of granularity for the time mo delling It also implies the

advantage that the programmer is free to assume any particular instruction

set or other way of representing time consumption Howeverfutureversions

of the CREW PRAM simulator system should ideally contain the p ossibility

of automatic mo delling of the time consumption p erformed by the compiler

 Processor and global memory requirementThenumb er of pro cessors used

in the algorithm is explicitly given in the PIL program and may therefore

easily b e rep orted together with other p erformance data The amountof

global memory used may b e obtained by reading the user stackpointer

 Global memory access statisticsThesimulator counts and rep orts the num

b er of reads and the numb er of writes to the global memory Later versions

maycontain more advanced statistics



 DEMOS data col lection facilities The general and exible data collection

facilities provided in DEMOS are available COUNT may b e used to record

events TALLY may b e used to record time indep endentvariableswith vari

ous statistics mean estimated standard deviation etc maintained over the

samples and HISTOGRAM provides a convenientway of displaying measure

ment data

Figure A shows the general format of a main program PIL for testing an al

gorithm on a series of problem instances When generating problem instances

the random drawing facilities in DEMOS which sample from dierent statistical

distributions maybevery useful

Functions such as pro cessor allo cation and pro cessor resynchronisation are re

alised in the simulator by simulating real CREW PRAM algorithms for these

tasksto obtain realistic mo delling of the time consumption

A A Brief System Overview

When developing the CREW PRAM simulator prototyp e it was a ma jor goal to

make a small but useful and easily extensible system This had to b e done within

a few man monthsso it would b e imp ossible to do all from scratch



DEMOS is a nice and p owerful package for discrete event simulation written in SIM

ULA Bir ... PROCESSOR_SET ProcessorSet; INTEGER n; ADDRESS addr; ... FOR n := 2 TO 512 DO BEGIN addr := GenerateProblemInstance(n); ... the problem instance is now placed in global memory ClockReset; ... Start of evaluated algorithm ... initial sequential part of algorithm ASSIGN n PROCESSORS TO ProcessorSet; FOR_EACH PROCESSOR IN ProcessorSet; ... PIL statements executed in parallel by all the processors in ProcessorSet END_FOR_EACH PROCESSOR; ... End of evaluated algorithm TimeUsed := ClockRead; ... sequential code for checking and/or reporting the results obtained by running the algorithm ... sequential code for gathering and reporting performance data from the last test run END_FOR loop doing several test runs;

... sequential code for reporting statistics about all the test runs

Figure A General format of PIL program for running a parallel algorithm on

several problem instances In this case the algorithm is assumed to use one set of

n pro cessors where n is the size of the problem instance













PILprogram













 

 

 

  

 

 

 



 













 

















PILcompiler







CREW PRAM



         

 

 





 

 



 

 

 

















 





 

 





 

  







 





 





procsim 









 



 



   





 

 





 



 















  





 

 





 

 



 











Simulator



















DEMOS 



















SIMULA



















UNIXSUN



                                                    

Figure A CREW PRAM Simulator implementation overview

The main implementation principle used in the simulator is outlined in Figure

A The CREW PRAM simulator is written in SIMULA BDMN Poo

DH using the excellent DEMOS discrete event simulation package Bir The



PIL program is compiled to SIMULADEMOS and included in the le procsim

which is part of the simulator source co de procsim describ es the b ehaviour of

each CREW PRAM pro cessor Thus during algorithm simulation each pro cessor

ob ject in the SIMULA sense executes its own copy of the PIL program

A great advantage implied by this implementation strategy is that all the p ow

erful features of the SIMULA programming language and the DEMOS package are

available in the PIL program Further the source level SIMULA debugger simdeb

HT may b e used for studying and interacting with the parallel PIL program

under execution

This is certainly a resource demanding implementation However exp erience

haveshown as was describ ed in Section that the prototyp e is able to simulate

large enough problem instances to unveil most interesting asp ects of the evaluated

algorithm

During the development of the CREW PRAM implementation of Coles al

gorithm all except one errors were found by testing the program on problem

instances consisting of only numb ers The last error was found by testing the

implementation on numb er however the same error was also visible when sort

ing only numb ers

Executing Coles algorithm on numb ers requires CREW PRAM pro ces

sors and takes ab out CPU seconds on a SUN CREW PRAM pro

cessors are needed to sort numb ers and require ab out CPU hours Bitonic

sorting is simpler and requires fewer pro cessors whichmake the simulations faster

numb ers are sorted in CPU seconds and numb ers corresp onding to

CREW PRAM pro cessors are sorted in ab out CPU hours SUN

Implementing the simulator in SIMULADEMOS implies an op enended easily

extensible system The CREW PRAM simulator may easily b e mo died to an

EREW PRAM simulator

A Users Guide

This section describ es how to use the CREW PRAM simulator as it currently is in

stalled at the Division of Computer Systems and Telematics IDT The Norwegian

Institute of Technology NTH

A Intro duction

The CREW PRAM simulator includes a simple programming environmentthat

makes it convenient to develop and exp eriment with synchronous MIMD parallel



This is a simple transformation done by a pip elin e of lters written in gawk GNU

awk

algorithms Parallel algorithms are in general more dicult to develop under

stand and test than sequential algorithms However the prop erty of synchronous

op eration combined with features in the simulator removes a lot of these diculties

The parallel algorithms may b e written in a highlevel SIMULA like language

Apowerful symb olic debugger may b e used on the parallel programs

A Parallel Algorithms CREW PRAM PPP and PIL

When studying parallel algorithms it is desirable to use a machine mo del that

is simple and generalso that one may concentrate on the algorithms and not

the details of a sp ecic machine architecture The CREW PRAM satises this

requirement

Further when expressing algorithms it has b ecome practice to use so called

pseudo co de In Section a notation called PPP Parallel Pseudo Pascal was

presented

Just as other kinds of pseudo co de PPP may not yet b e compiled into exe

cutable co de It is therefore needed a notation at a lower level The motivation

b ehind the work that has resulted in the CREW PRAM simulator is a study of a

sp ecial kind of parallel algorithmsnot to write a compiler Therefore the pro

gramming notation which has b een adopted is a compromise b etween the following

two requirements It should b e as similar to PPP as p ossible It must b e easy

to transform into a format that may b e executed with the simulator The notation

is called PIL which is short for Parallel Intermediate Language It is describ ed in

Section A

A System Requirements and Installation

The CREW PRAM simulator may b e used on most of IDTs SUN computers run



ning the UNIX op erating systemNote that the run command very well maybe

executed lo cally on a SUN workstationthis reduces your chance of b ecoming un

p opular among the other users

We assume that you are using the UNIX command interpreter which is called

csh First you must add the following line at the end of your cshrc le in your

login directory

set path cmd path

Then make sure that your current directory is your login directory and typ e

source cshrc

to make the change active

Use cd to move to a directory where you want to install the version of the

CREW PRAM simulator Typ e the following two commands which will install the

software in a new directory called ver



UNIX is a trademark of ATT

ASSIGN PROCESSORS TO ProcSetA

TIME

FOREACH PROCESSOR IN ProcSetA

Hello World

Use

Hello World

TTOHello World

TIME

ENDFOREACH PROCESSOR

Figure A Part of simple demo program left and part of pro duced output

right

mkdir ver

sigynhomelassedringreleasesversioncmdinstall ver

A Getting Started

Without going into details this section shows howavery simple example program

may b e compiled and executed with the simulator

The current installation of the CREW PRAM simulator has a directory called

pil which contains a collection of PIL programs which illustrates various asp ects

of the CREW PRAM simulator and parallel algorithms Consider the very simple

parallel algorithm on the le demopil The main part of this program and its

output is shown in Figure A

Be sure that your current directory is srcTo compile the PIL program stored in

the le called demopil under the pil directorygive the command

pilc demo

After compiling you must use the link command Just typ e linkNow you are

ready to execute the program typ e run

A Carrying On

In this section we describ e most of the features in the CREW PRAM simulator

the related to ols and the PIL language

As an example we will use a simple PIL program OddEven Transp osition

Sort is one of the simplest parallel algorithms for sorting Descriptions of the

algorithm may b e found in Akl pp or Qui pp A PIL program

that implements this algorithm may b e found on the le oespil under the pil

directory

Makeacopyofoespil to another lename so you do not destroy the original

version In the following we assume that your copywas named oddevenpil

A The PIL LanguageBackground

The PIL Parallel Intermediate Language notation is based on SIMULA with a

small set of extensions which is necessary for making it into a language well suited

for expressing synchronous MIMD parallel programs

Consequently nearly all features found in SIMULA may b e used in a PIL

program The main exceptions to this are mentioned in this app endix The CREW

PRAM simulator is implemented by go o d help of the simulation package DEMOS

DEMOS oers a lot of p owerful features which all are made available through the

PIL language In the context of analysing and measuring parallel algorithmsa

large set of pro cedures for gathering and rep orting statistics maybevery helpful

Arecent b o ok on SIMULA is AnIntroduction to Programming in SIMULA by

R J Po oley Poo an older one is SIMULA BEGIN written by Graham Birtwistle

and the three Norwegians OleJohan Dahl Bjrn Myhrhaug and Kristen Nygaard

who invented SIMULA BDMN

DEMOS is written in SIMULA and is describ ed in the b o ok DEMOSA Sys

tem for Discrete Event Model ling on Simula written by Graham Birtwistle Bir

This b o ok also gives a small intro duction to SIMULA sucient for making the

reader able to utilise all the p owerful features in DEMOS

The current implementation of the CREW PRAM simulator is based on the

SIMULA system oered byLundSoftware House AB Sweden Sp ecic details of

this version of the SIMULA language may b e found in DHHT

A Including Files Macros and Conditional Compiling

Three useful features are made available by letting the PIL compiler run the stan

dard UNIX C prepro cessor on the input le They are explained in the following

Further do cumentation of these features may b e found in KR or just typ e

man cpp on any UNIX terminal Note that it is not p ossible to pass parameters to

the cpp command in this version of pilc

include

The reader which is familiar with the C programming language will recognise the

syntax used for including les oddevenpil includes three lessorting oesvar

and oesproc It is go o d practice to split your algorithm in several logical units

When the PIL compiler pilc encounters the line

includesorting

it will lo ok for a le named sorting in the pil directory If it is found the contents

are copied into the program at the place of the include statement

define

oddevenpil contains one macro named SHOWodd which is dened bythe define

command This macro is simply used to make a shorter name on the more precise



pro cedure call LetShowArrayOdd phase done

Macros may b e used to make programs more readablehowever complicated

macros and esp ecially macros with parameters should b e used with care Macros

are in general dangerous

A few names ADDRESS CONSTANT and INFTY havebeendenedasmacrosina

le called pilh which is included by the PIL compiler These names are nothing

more that syntactic sugar

ifdef else endif etc

It is often desirable to op erate with several versions of a program but with only

slightchanges b etween twoormoreversions Twoversions may b e stored as two

very similar les However it is normally much b etter to let the dierence b e

visible inside one copy of the program Conditional compiling makes this p ossible

Study the lines b eginning with a around SHOWodd and SHOWeven in our sample

le Insert and one space in the front of the line define VERBOSE recompile

link and runand see what happ ens SIMULA lines b eginning with followed by

a space are interpreted as comments by the SIMULA compiler

Excessive use of this feature maymake the programs unreadable

A Program Structure Pro cedures and

Variable Declarations

When you start the CREW PRAM simulator one CREW PRAM pro cessor will

automatically start execution at top of the CREW PRAM program This pro cessor

is sometimes called the master processoreven though it is equal to all the other

CREW PRAM pro cessors If you do not allo cate and activate pro cessors discussed

in the following two subsections the program will b e executed as a sequential

program by this single pro cessor

Pro cedures

As for sequential programs it is imp ortant to structure your program byuseof

pro cedures Pro cedures may b e nested Top down development and a hierarchyof

pro cedures is warmly recommended since PIL programs are not easier to develop

than traditional sequential programs

Pro cedures in PIL have the same format as in SIMULA with the following ex



ception In the currentversion al l pro cedures must b egin with a BEGIN statement

and end with a statement of the form

END PROCEDURE ProcedureName



This pro cedure is dened in the le oesproc More on pro cedures in Section A



In standard SIMULA pro cedures may consist of a single statement not enclosed by

the SIMULA keywords BEGIN and ENDThisis not allowed in PIL

ProcedureName maybeany text but it is recommended to use the name of the

pro cedure

Variablesmake them as lo cal as p ossible

Declaration of pro cessor lo cal variables follow the SIMULA syntax All variables

which are declared at the outermost level in the PIL program will b e lo cal to

each pro cessor To illustrate this consider a program that consists of two logically

dierent parts A and B Part A is executed by pro cessor set A and part B is

executed bytheB pro cessors Part A do es only use variable varA and B uses only

varB In the currentversion all the pro cessors in pro cessor set AandBwill have

its own copyofboth varA and varBeven though the pro cessors in B do not use



varA and vice versa

This is not an ideal situation If the programmer in the co de executed by the

A pro cessors erroneously uses varB no error message will o ccuras one should

exp ect from a prop er compiler The A pro cessors will use its own lo cal copy which

will contain the value zero at least the rst time it is erroneously used

In large programs with several kinds of pro cessors and manyvariables this

odditymakes the program less readable with resp ect to variable usage In the

current implementation it also makes the data part of each pro cessor ob ject un

necessary largewhichinturnslows down the simulation

However this problem may to a large extent bealleviated by making variables

local whereitispossible Remember that variables in SIMULA and therefore also

in PIL may b e declared as lo cal for an arbitrary blo ck in addition to lo cal inside

a pro cedure

As an example compare oespil with the le oesNewpil shown in Figure

at page In the latter program b oth pro cedures and variables have b een

moved inside the parallel part FOR EACH PROCESSOR of the program This

programming style is generally recommended

Variables in the global memory

Global variables ie variables placed in the CREW PRAM global memory are

rather restricted in the currentversion of PIL Section A explains howinteger

lo cations in the global memory are allo cated and accessed by using the PIL READ

and WRITE statements These statements require the sp ecication of a global mem

ory addressVariables used to hold such addresses should b e declared as of typ e

ADDRESS



The reason for this o dd feature is that the implementation of the simulator is strongly

based on letting each pro cessor as a SIMULA ob ject execute its own copy of the whole

PIL program

In a wayitgives the same problems as traditionall y encountered in large sequential

programs with a lot of global variablesit is dicult to read from the co de whichvariables

are used or maybeusedinvarious parts of the co de

A Pro cessor Allo cation

All paral lel PIL programs must contain at least one statement for pro cessor allo ca

tion The format is

ASSIGN Number PROCESSORS TO ProcSet

This statement allo cates Number new pro cessors in a pro cessor group or set whichis

given the name ProcSet The term processor set is used to denote such a collection

of pro cessors The master pro cessor mentioned ab ove is a pro cessor set containing

only one pro cessor and is automatically allo cated when starting the simulator

Number maybeany expression resulting in an integer Note however that in the

currentversion this expression must b e written as one textstring without any

whitespace characters ProcSet is the name of a variable whichmust b e declared

as typ e PROCESSOR SET See the declaration of the variable ProcSet in the le

oesvar included from oddevenpil

Request for pro cessor allo cation can only b e issued by the single pro cessor mas

ter pro cessor which executes the outermost level of the main program However

the allo cation is done by the allo cated pro cessors in co op eration in a binary tree

fashion As a result allo cating n pro cessors takes O log n time units See Section

Anynumb er of pro cessors may b e allo cated to a pro cessor set and anynumber

of pro cessor sets may b e allo cated and used in a program

A Pro cessor Activation

When a set of pro cessors has b een allo cated they may b e sp ecied to execute

co de in parallel by use of so called pro cessor activation Processor activations

must be placed at the outermost level of the main programwith the exception that

they may b e done inside eventually nested FOR andor WHILE lo ops Currently

pro cessor activation may not b e placed inside a general BEGIN END blo ck For

further details study the le LoopTestpil in the pil directory The format is

FOR EACH PROCESSOR IN ProcSet

AnyCode

END FOR EACH PROCESSOR

AnyCode will b e executed by all the pro cessors in ProcSet in paralleland it is

important to remember that each of these processors wil l operate on its own set of

variables The pro cessors in ProcSet are numb ered and each pro cessor

knows its own numb er through the lo cal variable pThisvariable should not b e

changed by the pro cessor PIL program

The single pro cessor master which is dedicated for executing the outermost

level of the main program will wait outside while AnyCode is executed



Whitespace is common UNIX terminology for space blank tab and newline

AnyCode may contain variable and pro cedure declarations if it is written as a

SIMULA blo ckBEGIN END If it do es not contain declarations it may b e written

as a list of statements without the enclosing BEGIN ENDThe END FOR EACH

PROCESSOR statementchecks that the pro cessors are synchronous when they leave

AnyCodeIfnotawarning SYNCHRONY LOSTisgiven see page

Nesting of pro cessor activations is not allowed Currently only one pro cessor

set may b e active at the same time

A Using the Global Memory

All pro cessors in a CREW PRAM are connected to a global memory This memory

must b e used for all kinds of communication b etween the pro cessors The global

memory is simply a large arrayofintegers On the simulator the global memory

has xed size and may b e regarded as consisting of two partsthe user stack and the

system stack The system stack is used to store the pro cessor set data structures

The user stack is used to store global variables allo cated dynamically in the PIL

program It grows towards higher addresses

The numb er of lo cations readwritten fromto the global memory is measured

by the simulator and rep orted at end of a PIL program execution

Allo cating memory

Global memory must b e allo cated explicitly b efore it can b e used see the pro cedure

MemUserMalloc in Section A In our sample program memory is allo cated

in the pro cedure GenerateSortingInstance in the le sorting

Lo cal memory should not b e allo cated explicitly in a PIL program Just declare

your variables and the SIMULA system do es the remaining work

Accessing global memory

Any pro cessor may read from any lo cation in the global memory at any time The

format is

READ VarName FROM AddressExpression

The contents of the lo cation in the global memory given by AddressExpression is

read and placed in the integer variable called VarName AddressExpression may

b e an arbitrary integer expression In the currentversion AddressExpression and

VarName must not contain any whitespace characters The PIL keyword READ

must b e the rst text symb ol on the line and the whole statementmust b e written

on a single line but it may exceed characters In the currentversion a READ

statementtakes one time unit regardless of the complexityof AddressExpression

Any pro cessor may write anyvalue to any lo cation in the global memory at

any time But if two or more pro cessors at the same time unit writes to the same

global memory lo cation a write conict will o ccur This is an illegal op eration on

a CREW PRAM and the CREW PRAM simulator will give an appropriate error

message and stop The format for writing to the global memory is

WRITE Value TO AddressExpression

Value maybeanyinteger expression Note that for every time unit all writes

to global memory o ccur after all reads from the memory See CREW PRAM

prop erty at page In the currentversion a WRITE statementtakes one time

unit regardless of the complexityof AddressExpression The syntax of the WRITE

statement is similar to the READ statementasdescribedabove

Just as standard SIMULA statements these statements must not end with a

if they are placed just b efore the SIMULA keyword ELSE

A FOR and WHILE Lo ops

A FOR lo op should b e written according to the following format

FOR LoopVar FirstVal TO LastVal DO

BEGIN

AnyCode

END FOR

The keyword TO is used instead of the corresp onding STEP UNTIL of SIMULA to

make PIL closer to the PPP syntax The whole FOR DO must b e contained in a

single line In the currentversion LoopVar FirstVal and LastVal must b e written

as one textstring without any whitespace characters and the assignment op erator

must b e enclosed by spaces

WHILE loops haveavery similar form

WHILE LoopCondition DO

BEGIN

AnyCode

END WHILE

The LoopCondition may contain whitespace characters however the whole

WHILE DO statementmust b e contained in a single line and DO must b e the

last symb ol on that line

A BuiltIn Pro cedures and Variables

The CREW PRAM simulator oers various builtin pro cedures and variables which

will b e used in most PIL programs

Interaction with the simulator

 UseTimeUnits

This pro cedure is used to represent time used by the calling pro cessor Time

Units is an arbitrary integer expression

 WaitTimeUnits

Do es exactly the same as Use but should b e used to make it p ossible to

distinguish idle time from activework

 IntegerVar ClockRead

ClockRead returns the current time as an integer value Rememb er that the

synchronous prop erty of a CREW PRAM implies that this clo ckmaybe

p erceived as one global clo ck whichalways is correct for all the pro cessors

 ClockReset

This call sets the clo ck to zero It is typically called after having done un

interesting initialisation work etc and just b efore the computation which

should b e measured starts

 SYNC

This pro cedure makes all the pro cessors in the active pro cessor set to syn

chronise so that they all will return from the call to SYNC simultaneously

This explicit resynchronisation is implemented as a binary tree computa

tion with O log n time consumption as outlined in Section Its use

together with the CHECK pro cedure in a recommended top down approach for

CREW PRAM programming is discussed in Section In the currentver

sion of the simulator it is assumed that SYNC is called by al l the pro cessors

in the active pro cessor set

 CHECK

Checks whether al l the pro cessors in the active pro cessor set are synchronous

at the place in the program where CHECK is called See also the description

of pro cedure SYNC ab ove

 WarningText

Prints WARNING followed by Text The program execution con

tinues

 ErrorText

Prints ERROR followed by Text and stops the program execu

tion

Using the global memory

See also Section A

 AddressVar MemUserMallocIntegerExpr

Allo cates IntegerExpr numb er of consecutive memory lo cations on top of the

user stack in the global memory The address of the rst lo cation is returned

 MemUserFreeIntegerExpr

Frees IntegerExpr numberofconsecutive memory lo cations from top of the

user stack in the global memory

 ArrayPrintStartAddressExpression Size

Prints Size consecutive lo cations in global memory starting at and including

StartAddressExpression

 PUSHIntegerExpr

Allo cates a new lo cation on top of the user stack and writes the value of

IntegerExpr in that lo cation

 IntegerVar POP

POP returns the value stored in the lo cation on top of the user stack and frees

the lo cation

 UserStackPtr

This is a globally accessible variable of typ e INTEGER which p oints to the

next free lo cation on the user stack

 MemN Reads and MemN Writes

These are DEMOS ob jects of typ e COUNT which measures the number of READ

op erations and WRITE op erations to the global memory The READ count

may b e reset by inserting MemN ReadsRESET in your PIL program and

its currentvalue may b e rep orted by MemN WritesREPORT Similarly for

WRITE op erations

The pro cedures PUSH and POP may b e used together with the UserStackPtr

to pass values from a single pro cessor to a large numb er of pro cessors in a very

ecientway See the le testpil for a small example of howthismay b e done

to implement parallel parameter passing or study the main program and the

pro cedure SetUp in the le systolicdemopil explained in Section A for a

more realistic example

Generating random data and collecting statistics

A set of global variables has b een included in the CREW PRAM simulator to make

it easy to use the DEMOS random number genarators and the DEMOS facilities

for col lecting statistical data These global variables make it p ossible to create

the sp ecic DEMOS ob jects when they are needed probably done by the master

pro cessor and to use them byany pro cessor without interfering with the simulator

time or other quantities that are measured by the simulator

These variables are of SIMULA typ e REFDemosClassName and are named

GlobalDemosClassNameIndexVal DemosClassName maybeoneof COUNT TALLY

or HISTOGRAM with IndexVal in the range or one of CONSTANT NORMAL NEGEXP

UNIFORM ERLANG EMPIRICAL RANDINT POISSON or DRAW with IndexVal in the range

An extensive example is found in the le syncpil in the pil directoryCon

sult also the DEMOS b o ok Bir

Tracing and output

Some programmers have the opinion that making co de for pro ducing output in

SIMULA requires a lot of typing Also when debugging programs it is convenient

to b e able to trace out values of variables during the program execution Such

program tracing should b e p ossible to turn owithout the need for remo ving the

co de which pro duces the tracing It maybeusefultokeep it also after you believe

you have done your last change to the program If a later mo dication intro duces

a new bug you will probably have go o d use for the old tracing co de

The CREW PRAM simulator contains a set of simple pro cedures whichmay

b e a go o d help in making this kind of tracing The routines mayalsovery well b e

used for normal output from the program All these pro cedures starts with T

 T Off

Turns tracing o Tracing pro cedures called after this pro cedure will do

nothing

 T On

Turns tracing on T On and T Off op erate on a global ag BOOLEAN T Flag

whichcontrols tracing for all pro cessors

 T ThisIs

Prints out the variable name of the pro cessor set and the pro cessor number

for the pro cessor p erforming this pro cedure This is very useful if you have

lost control over your pro cessors

Time  T

Rep orts the current simulation time

 T TITITOText IntegerExpr Text IntegerExpr Text

This is the most complicated of a set of similar pro cedures used to trace out

values from the program in a compact manner

 T TITIO T TITI T TITO T TIO T TBO T TI T TO T T

These pro cedures are all similar to T TITITO The part of the name following

the underscore character is intended to b e a short way to sp ecify the param

eters whichmust b e given to the pro cedure T is short for Text I for Integer



and B for Bo olean A pro cedure with name ending with O calls OUTIMAGE

b efore it returns

All lines in the simulator output pro duced bytheT pro cedures start with

a in the rst column making them easy to lter away from the other output

Miscellaneous

 IsOddIntegerExpr

Returns TRUE if IntegerExpr evaluates to an o dd number



IO in SIMULA is buered OUTIMAGE empties the output buer

 IsEvenIntegerExpr

Returns TRUE if IntegerExpr evaluates to an even number

 logIntegerExpr

Returns the value of log IntegerExpr



 powerIntegerExpr

IntegerExpr

Returns the value of

A The DevelopmentCycle

It is nowtimetogive some more details of the editcompile linkrun cycle

A pilc

The PIL compiler pilc consists of two main phases The rst phase pro duces

a SIMULA le and the second phase compiles this le together with the source

co de for the simulator The rst phase is implemented as a UNIX pip eline whose



most imp ortant parts are the C prepro cessor cpp and a lter called PILfix

These interior details are mentioned here to describ e why error messages from the

pilc command may come from at least three main sources This will nowbe

demonstrated byintro ducing some small errors in our sample le oddevenpil

cpp errors

Insert a space b efore the s in sorting in the include statement at the b eginning

of the le and try to compile the le with pilc The output will then contain a

line lo oking like

tmpppl Cant find include file sorting



tmpppl is a temp orary le used by pilc is the line number in that le Few

early error messages like this come alone In our example we also get an error

message from the SIMULA compiler In general it is seldom worthwhile to invest

to much eort in understanding the subsequent error messages

When encountering error messages from cpp rememb er that man cpp is avail

able on your terminal

Before you pro ceed remove the error intro duced ab ove



PILfix do es the main work of translating PIL sp ecic features into SIMULA co de

PILfix itself is a pip elin e of three simple ltersall written in the pattern scanning and

pro cessing language gawk GNU awk



This and all line numb ers in later examples are very dep endent on the actual version

of the simulator and the example le you are using Please do not exp ect them to match

with what you get on your screen

PILfix errors

During the progress of the compilation PILfix pro duces an indented listing of the

pro cedures found in the program This is valuable information when lo oking for

errors in the blo ck structure of the program

In the currentversion not all syntax errors that may o ccur in PIL programs

are found and rep orted by PILfixHowever most errors are detected in the next

phasethe SIMULA compiler

Insert a space after the n in the ASSIGN statementofoddevenpil and compile

Wegetthefollowing error message

PILfix ERROR at internal line no

PIL line was ASSIGN n PROCESSORS TO ProcSet

Error message ASSIGN statement improper syntax

Other errors found by PILfix are demonstrated by testpil and

testpil in the pil directory

simula errors

simula is the name of the SIMULA compiler DH It is run on the simulator

source les in the src directory which includes the le PILalgsim whichis

pro duced by the subphases discussed ab ove It is seldom necessary for the PIL

programmer to lo ok at PILalgsim

Most error messages from simula are selfexplanatory and will not b e discussed

here However there is one kind of error which often o ccurs in SIMULA and PIL

programs that may pro duce a frightening amount of error messages This typ e of

error is often called error in block structure First let us demonstrate howsuchan

error unwillingly maybeintro duced in your program

The traditional format for comments in SIMULA is the keyword COMMENT fol



lowed byany amount of lines until the comment is ended with a If you forget

to terminate the commentby the comment will deleat all co de up till the next

Thismay cause an error message but may also result in a faulty program where

imp ortant parts are silently omitted

In our sample le remove the at the end of the comment just b efore the

FOR This will pro duce a lot of error messages referring to names in statement END

other les than your program PILalgsim

Error in file utilsim at the beginning of the listing

Error Too few ends extra end inserted at the end

The le name utilsim refers to one of the CREW PRAM simulator source les



The SIMULA version used by the CREW PRAM simulator allows instead of

COMMENT

Whenever this kind of error message o ccur the reason is probably an error in

the block structure in the SIMULA PIL program In this case the last part of the



error listing tells the reason This do es not generally happ en

The following habit is warmly recommended for all blo ck oriented languages

to avoid this kind of errors When typing a comment of the form COMMENT or

orwhentyping a BEGIN END structurealways typ e the termination

part of the syntax b efore you typ e the insides

How to map line numb ers given by the SIMULA system to line numb ers in your

PIL algorithm is discussed under the run command b elow

A link

Since the currentversion of the CREW PRAM simulator compiles the PIL program

together with the simulator as one source le there is very seldom that you get errors

from the link command that have not b een rep orted by the compiler

A run

SIMULA has a p owerful run time system The PIL programmer takes advantage

of this Together with a run time error message the source line numb er where the

fail o ccurred will b e printed

Line numb ers

The line numb ers in error messages from the SIMULA compiler or run time system

will not match the line numb ers in the PIL program The reason is that the

PIL program is translated to a larger SIMULA program which is included in the

simulator co de

However it is still easy to nd the PIL line corresp onding to a SIMULA line

The le pramlis in the src directory contains the listing pro duced by pilc and



corresp onding to the co de executed by run Find the line number in this le where

the error has o ccurred In most cases you will by insp ection of pramlis understand

where the corresp onding line in the PIL program is For large programs it maybe

practical to use an editor with two buersone containing your PIL program and

one containing pramlis Then the editor will probably do the pattern matching

faster and more reliable

Error messages from the simulator

We will now shortly explain run time error messages from the simulator



A brute force metho d for lo cating such errors is a systematic and rep eated exclusion

of parts of the program followed by recompilation The include statementmay b e helpful

in doing this



There are in fact two columns with line numb ers in this listing the one to the right

is the right one for this purp ose

 WRITE CONFLICT

This o ccurs if two or more pro cessors p erform a write op eration on the same

lo cation address in the global memory The CREW PRAM simulator will

stop the program at the end of the time unit when the conict o ccurred See

the example le testpil

 SYNCHRONY LOST

This is given as a warning from the simulator when the CHECK pro cedure

detects that not all pro cessors in the active pro cessor set are synchronous

The SIMULA source line numb er where CHECK was called is included in the

message An example is given in testpil Note that the CHECK pro cedure

FOR EACH PROCESSOR statement is called as part of the END

 insufficient memory

The CREW PRAM simulator implementation is based on allo cating mem

ory dynamically as storage is needed during the execution When running

PIL programs using a large numb er of pro cessors the following message may

app ear

Runtime Error at line Block at line

No storage freed by garbage collector

The lower limit for the numb er of pro cessors giving this run time error mes

sage may b e made larger by using the m option for the SIMULA compiler

Consult DH and makeyour own version of the pilc command but do

not forget that you are using a multiuser system

A Debugging PIL Programs Using simdeb

Perhaps the greatest advantage of executing the PIL program as a SIMULA pro

gram is that it implies the availability of the SIMULA debugger simdeb HT

Notice that the debugger op erates at the SIMULA source co de level

To help the PIL programmer getting started with using the debugger we will

nowgive some examples of commands that haveproven useful The various com

mands maybecombined as shown in some of the examples

The SIMULA debugger is started on your last compiled and linked PIL program

by issuing the command debug

help

This command pro duces a listing of the available debugger commands

help output

This is the format for obtaining more information ab out a sp ecic command

in this case the output command

proceed

Starts execution of the program

at LineNo stop

The at command is used to dene that some debugger actions should take

place when the program executes the sp ecied source line number LineNoIn

this case wewant the program to stop at LineNo ie the command denes

a breakpoint LineNo should b e a line number in pramlis as discussed in

section A

output VarName

Prints the value of the sp ecied variable

output ObjectRefAttribute

Prints value of sp ecied attribute

output ObjectRef all

Prints all attributes in the ob ject p ointed to by ObjectRef

at LineNo if p then stop

Denes a breakp oint exclusively for pro cessor no in the current pro cessor

set

at LineNo if IntTime then chain proceed

Displays the call chain at LineNo when the simulator time has passed

and pro ceeds the execution

at LineNo if p then output all proceed

Prints the value of all variables in the current blo ck for pro cessor no and

pro ceeds

trace on FirstLineLastLine

Makes the debugger printeach SIMULA source line executed in the range

given by the twolinenumb ers

Note that most of the commands may b e abbreviated ou output pro pro ceed

etc Though this debugger is very p owerful nothing can replace the value of fresh

air systematic work a lot of time and a cup of coee

A Mo delling Systolic Arrays and other Syn

chronous Computing Structures

I know of only one computeb ound problem that arises naturally in

practice for which no systolic solution is known and I cannot prove

that a systolic solution is imp ossible

H T Kung in Why Systolic Arrays Kun

This section gives some hints on how systolic arrays and similar systems may be

simulated by help of the CREW PRAM simulator The text explains howvarious

kinds of systolic arrays may b e mo delled It is hop ed that the reader will under

stand how these ideas may b e adopted for simulation of other synchronous globally

clo cked computing structures

A Systolic Arrays Channels Phases and Stages

The CREW PRAM simulator was originally made for the sole purp ose of simulating

synchronous CREW PRAM algorithms However a few small extensions to the

simulator a PIL package and two examples have b een made for demonstrating

how the simulator can b e used for mo delling systolic arrays The adopted solutions

are rather ad ho c More elegant solutions are p ossible by doing more elab orate

redesigns of the simulator

The le systoolpil in the pil directory contains PIL co de that provides some

help in simulating systolic arrays Its use will b e do cumented bytwovery simple

examples First it is necessary to explain some central concepts

Systolic arrays

An imp ortant pap er on systolic arrays is H T Kungs Why Systolic Architectures

Kun It is a comprehensive compact and readable intro duction to the topic

A or system is a collection of synchronous pro cessors often

called pro cessing elements or cells connected in a regular pattern Information in

a systolic system ows b etween cells in a pip elined fashion Each cell is designed to

p erform some sp ecialised op eration When implementing systolic systems in VLSI

or similar technologies it is imp ortanttohave simple and rep eated cells which

are interconnected in a regular pattern with mostly short lo cal communication

paths

Such mo dularity and regularity is not a necessity when simulating on the CREW

PRAM simulator Nevertheless it may b e to help in mastering the complexity

involved in designing parallel systems

Channels

Systolic arrays pass data b etween pro cessing elements on channels These are nor

mally unidirectional with one source cell and one destination cell When simulating

a systolic array on the CREW PRAM mo del it is natural to let eachchannel cor

resp ond to a sp ecic lo cation in the global memory This memory area must b e



allo cated b efore use Further each pro cessing elementmust know the address

of every channel it do es use during the computation The channels oered by the

simulator may b e used in b oth directions and they mayhavemultiple readers and

multiple writers If two or more cells write send data to the same channel in the

same stage the result will b e unpredictable



This is only true for integer channels When using channels for passing real numb ers

the currentversion of the simulator provides a set of preallo cated channels

Three phases in a stage

A systolic system p erforms a numb er of computational stepshere called stages

One stage consists of every pro cessing element doing the following three phases

ReadPhase Each cell reads data from its input lines channels

ComputePhase Each cell p erforms some computation This is often a simple

op eration or just a delayed copying transmitting of data

WritePhase Each cell writes new data values computed in the previous

phase of this stage into its output lines channels

All cells will nish stage i b efore they pro ceed to stage i Inotherwords

wehavesynchronous op eration

These three phases should b e considered as logical phases In an implementation

it is irrelevant whether phase of stage i is overlapp ed with phase of stage i

as long as the data availabilityrulegiven b elow is not violated Further phase

andmay b e implemented as overlapping or as one phase

Data availability rule

The data written in phase of stage i cannot beread in phase earlier than in

the fol lowing stage i and later stages

A Example Unit Time Delay

In most all systolic arrays the new values computed from the inputs at the start

of stage i are written on the output lines at the end of the same stageand are

available in the next stage Wesay that all the cells op erate with unit time delay

and one time unit is the same as the duration of one stage

The example le systolicdemopil mo dels a simple linear arrayworking in

this manner The structure is outlined in Figure A The array do es nothing

more than copying the received data from left to right in a pip elined fashion The

regularity of this example makes it relatively easy to let nthenumb er of cells in

the linear array b e a parameter to the mo del In the simulation program each

kind of cell is describ ed as a pro cedure see Figure A

This articial example has one sp ecial pro cessing element for pro ducing input

to the arrayInputProducer and one for receiving the output from the last cell

in the arrayOutputConsumer They are not a part of the systolic array but a

convenientway of mo delling its environments

The channels which are used for communication b etween the cells should b e

used by the following pro cedures

 SENDChannelAddressExpression IntegerExpression

 IntegerVariable RECEIVEChannelAddressExpression

Linear array

























    

   

   

   

   

    Linear Linear Linear

   

   

   

   

   

   

   

   

   

   











Cell Cell Celln











                                                                 

 

 

 

 

 

 

 























  

 

 

 

 

 

 

 

 

   

 

 

 

 

 

 

 

 

   

 

 

 

 

 

 

 

 

   

 

 

 

Input

 

 

 

 

 

   

 

 

 

Output

 

 

 

 

 

   

 

 

 

 

 

 

 

 

   

 

 

 

 

 

 

 

  Producer

   

 

 

 

 

 

 

 

  Consumer

   

 

 

 

 

 

 

 

 

   

 

 

 

 

 

 

 

 

               

Figure A Linear array simulated in the le systolicdemopil

 IntegerVariable RECEIVE ZEROFYChannelAddressExpression

The value stored at the channel is set to zero after it has b een passed to the

caller of this pro cedure

It is also p ossible to use READ and WRITE on the channels see Section A

but it is not recommended

Measuring the length of the phases

As describ ed ab ove each stage consists of three phases Toachievepropersyn

chronisation of the array the user must sp ecify how the op eration p erformed in

each stage is distributed on the three phases This must be done by inserting

the pro cedure call StartComputePhase at the end of the read phase and the

call StartWritePhase at the end of the computation phasesee the pro cedure

LinearCell in Figure A

This makes it p ossible for the package systoolpil to measure the length

of the three phases At the end of the computation a call to the pro cedure

ReportSystolic will pro duce a listing of the longest and shortest read compute

and write phase p erformed and also the pro cessor numb ers of the pro cessing el

ements which p erformed those phases This information may b e used to lo calise

b ottlenecks in the system Rememb er that the pro cessing elements should op erate

in synchrony The length of a stage must b e the same for all cellssimple or com

plicated The structure may b e made faster bymoving work from the most time



consuming cells to underutilised cells

In this context time is measured in numb er of CREW PRAM simulator time



It is probably the length of the compute phase that is most interesting Receiving

and sending of data are often done in parallel

PROCEDURE LinearCell

BEGIN

INTEGER val

val RECEIVEinp

inp

out

 

 

 

 

 

 

Linear  

 

 

 

 

 

 

 

 

 

StartComputePhase

Celli

This element has nothing to compute

StartWritePhase

SENDout val

ENDPROCEDURE LinearCell

Figure A Simple cell in a linear systolic array left and its description right

units and is sp ecied by the user by calls to the Use pro cedure The time used

by the pro cedures for accessing the channels describ ed ab ove are dened bythe

constants t SEND etc at the start of the le systoolpilAt the same place you

may nd constants that sp ecify the maximum lengths that are tolerated for eachof

the three phases If your system exceeds one of these limits an appropriate error

message will b e given The limits maybechanged by the user

The mo delling of time consumed in the phases may b e done at various degrees

of detail and it may b e omitted However the calls to StartComputePhase and

StartWritePhase may not b e omitted

Describing the pro cessing elements cells

The pro cedure LinearCell describ es the op eration of each cell in the linear array

which is p erformed in each stage Therefore variables declared in the pro cedure are

new initialised to zero at the start of each stage Data items that must live

during the whole computation must bedeclared outside the procedures and b efore

the keyword BEGIN PIL ALG These variables are lo cal to each pro cessing element

See for example the variable CellNo in the le systolicdemopil which stores

the numberofeach cell in the linear array

The variables inp and out are also of this kindthey store the address of

the input channel and output channel of every cell in the linear array The ini

tialisation of these variables for each pro cessing element corresp onds to building

the hardware This is done by the pro cedure called SetUpwhich is called byall

pro cessing elements

SetUp may require knowledge of global variables such as the size of the linear

array n This maybedoneby passing the value on the UserStack see the calls to

PUSH in the main program and the rst part of the SetUp pro cedure

General program structure

Figure A outlines the main parts of a general systolic arraysimulation program

using systoolpil It is hop ed to b e selfexplanatory The reader should note that

the IF tests in the main lo op FOR Stage must b e disjunctive These tests

are used to select what kind of cell a pro cessor should simulate during the stage

One pro cessor may only act as one cell typ e during a stage

Figure A outlines only one of several p ossible ways of structuring a PIL pro

gram for simulating systolic arrays You should not do changes to the structure

shown in Figure A unless you real ly needtodoit and only after having studied



the details in systoolpil

Please run the example systolicdemopil and study the output Imp or

tant parts of the output are the lines showing the start of each stage and the

contents on the communicationchannels The latter is pro duced by the pro cedure

TraceChannels This pro cedure should b e tailored to the actual structure b eing

simulated The CREW PRAM simulator time information TIME isof



less value since wehave dened a new time unit stage at one level higher

A Example Delayed Output and Real Numbers

In the previous example every cell used exactly one stage to pro duce new outputs

from the inputs There are cases where this restricts the degree of abstraction that

may b e used in a mo del

Delayed output

Consider the linear array of the previous example For n this structure delivers

an element as output stages after it was received as input If this is the sole

function of what wewant to mo del it should b e p ossible to representby a single

pro cessing element with delay stages instead of a linear array consisting of

unit delay cells

This maybeachieved by oering an additional parameter delay in the SEND

pro cedure The pro cedure call SENDaddr val delay will do the same as SEND

describ ed ab ove in Section A but the data element will b e available on the

channel at the start of stage i delay andaslongasitisnotoverwritten by

another SEND call or set to zero byaRECEIVE ZEROFY call

Once wehave this p ossibilitywemay use a single pro cessor to representa

comp osite structure whichmakes several computations of varying complexity and

delivers output with diering delays This p ossibilitymay b e helpful in top down

development of systolic arrays and similar structures

If you insert

define DELAYSUSED



Esp eciallybevery careful when changing the value of the variable Stage and

the placementofStartSystolic EndSystolic or StartReadPhase Do not forget

StartComputePhase or StartWritePhase and do not change their order



This information mayberemoved by using the lter removeTIME on the output pro

duced bythe run commandjust typ e run removeTIME #include PROCESSOR_SET ProcSetName; ... declaration of variables for the cells PROCEDURE SetUp; BEGIN ... read eventual parameters from the UserStack ... initialize variables and channel-addresses for the various cell types END_PROCEDURE SetUp; PROCEDURE CellDescription-1; ... one for each cell type BEGIN ... declaration of variables used during one stage ... actions in read phase (i.e. calls to RECEIVE) StartComputePhase; ... actions in compute phase StartWritePhase; ... actions in write phase (i.e. calls to SEND) END_PROCEDURE CellDescription; ... more cell descriptions BEGIN_PIL_ALG ... assign processors to ProcSetName; ... allocate (integer) channels ... PUSH eventual parameters on UserStack; FOR_EACH PROCESSOR IN ProcSetName SetUp; CHECK; StartSystolic; FOR Stage := 1 STEP 1 UNTIL ... DO BEGIN StartReadPhase; IF p = 1 THEN T_TIO("****** Start of stage no", Stage); IF p = ... THEN CellDescription-1; IF p = ... THEN CellDescription-2; ... END_FOR; EndSystolic; ReportSystolic; END_FOR_EACH PROCESSOR;

END_PIL_ALG

Figure A General format of program for simulation of systolic arrays using

systoolpil

b efore the place where systoolpil is included you will get access to a version of

the SEND pro cedure that requires a delay given as parameter

 SENDChannelAddressExpression IntegerExpression

DelayExpression

DelayExpression must evaluate to an integer  The value corresp onds

to unit time delay as discussed in the previous example

 The pro cedures for receiving data are as b efore

It is also p ossible to use READ and WRITE on the channels but it is not recom



mended

Using real numbers

The CREW PRAM simulator was originally made for handling integers onlyUpon

request a simple and ad ho c extension has b een made for making it p ossible to

op erate on real numb ers

A xed numb er of preallo cated channels for passing real numb ers are provided

by the simulator In the currentversion channels are provided with addresses

to The only dierence from the ordinary integer channels is that the real

channels cannot b e allo cated or op erated up on by READ and WRITE

In the currentversion real numb ers may only b e used together with delays If

you insert

define REALSUSED

b efore the place where systoolpil is included you will get access to the following

pro cedures for passing real numbers between the pro cessing elements

 SEND REALChannelAddressExpression RealExpression

DelayExpression

The delaymust b e 

 RealVariable RECEIVE REALChannelAddressExpression

 RealVariable RECEIVE REAL ZEROFYChannelAddress

Expression

In addition to

TRTextRealExpression  T

Prints the text followed bythevalue of RealExpression

 RealArrayPrintChannelStartAddressExpression Size

Prints the contents of Size consecutiverealchannels starting at and including

ChannelStartAddressExpression



WRITE corresp onds to SEND with delay

Both delayed output channels passing real numb ers and zerofying are demon

strated on the example le systolicdemopil This articial example is an ex

tension of the previous example Wehaveintro duced one real channel going down

from eachofthe n cells in the linear arrayFurther wehaveintro duced a new cell

typ e RealCell used in two instances See the ASCI Igure at the start of the

le systolicdemopil

RealCell is used to illustrate receiving with and without zerofying LinearCell



has b een extended to compute a real number which is sentdownwards The

instances of LinearCell demonstrate various delays



Note howtimeStage state val and sender CellNomaybeencodedinanumber

App endix B

Coles Parallel Merge Sort

in PIL

This is a complete listing of Coles parallel merge sort algorithm implemented in

PIL for execution on the CREW PRAM simulator

B The Main Program

Colepil

Coles Parallel Merge Sort algorithm

Complete implementation in PIL on CREW PRAM simulator with detailed

time modelling

Author Lasse Natvig Email lasseidtunitno

Started

Finished No known errors

Last changed improvements

Copyright C Lasse Natvig IDTNTH Trondheim Norway

Note that in general UP is short for Up array SUP is short for

SampleUp array and NUP is short for NewUp array

includeSorting

includeColevar

includeColemem specifies memory usage

includeColeproc

includeColeproc

BEGINPILALG

Loop for testing the algorithm on several problem instances

FORmmTODO

BEGIN

nn powermm

addr GenerateSortingInstancenn RANDOM

ClockReset

READ nn FROM UserStackPtr

UPprocs UPsizenn See Coleproc

SUPprocs SUPsizenn See Coleproc

UsetLOAD tADD tADD tSTORE

NoOfProcessors UPprocs SUPprocs SUPprocs

UsetMalloctSTORE tLOADtSHIFTtSUB tMalloctSTORE

UPstart MemUserMallocNoOfProcessors storage for Arrays

addr MemUserMallocNoOfProcessors storage for ExchangeArea

addr MemUserMallocnn storage for NodeAddressTable

addr MemUserMallocnn storage for NodeSizeTable

Push variables on the user stack as parameters to the SetUp procedure

Use tPUSH

PUSHaddr PUSHaddr PUSHaddr

PUSHUPprocs PUSHSUPprocs PUSHUPstart

ASSIGN NoOfProcessors PROCESSORS TO ProcSet

FOREACH PROCESSOR IN ProcSet

SetUp Read pushed variables and initialize local variables

UsetFOR

FOR Stage TO LastStage DO

BEGIN

Since nothing is done in Stage and

If not convinced study an example

UsetIF

IF Stage THEN

ComputeWhoIsWho

UsetIF tSUB tJNEG

There is no new NUP array to copy in stage or

If not convinced study an example

IF Stage OR Stage THEN

CopyNUPtoUP

phase of current stage

There is no need to make samples in stage in Stage or

If not convinced study an example

UsetIF tSUB tJNEG

IF Stage OR Stage THEN

MakeSamples

phase of current stage

MergeWithHelp

Store the addresses of the NUParrays made in this stage

This is used by CopyNUPtoUP at the beginning of the next stage

StoreArrayAddressesNUP

UsetFOR

ENDFOR

IF ProcessorType NUP AND active AND node AND index THEN

ArrayPrintThisAddr size

ENDFOREACH PROCESSOR

TTITITOR Colepilsorted nn integers in ClockRead ticks

TTIONoOfProcessors NoOfProcessors

ENDFOR

ENDPILALG

B Variables and Memory Usage

B Colevar

Colevar

Variables used only by master

PROCESSORSET ProcSet

INTEGER mm

INTEGER nn

ADDRESS addr addr addr

Variables used by all processors

INTEGER m number of levels in complete binary tree

INTEGER n number of integers which is to be sorted problem size

INTEGER LastStage no of last stage in algorithm

INTEGER ProcessorType UP SUP or NUP

INTEGER CONSTANT UP SUP NUP

INTEGER UPprocs SUPprocs number of UP and SUP NUP processors

INTEGER NoOfProcessors

BOOLEAN InsideNode TRUE if this processor is allocated to inside node

ADDRESS UPstart address of the first element of the UP arrays

ADDRESS ThisAddr address of element in UPSUPNUP permanently allocated

to this processor

ADDRESS ExchangeAreaStart Start of table in global memory used to

exchange locally stored values between processors

ADDRESS NodeAddressTableStart start of table used to store the

address of the UP SUP or NUP array for each

node during the current stage

ADDRESS NodeSizeTableStart start of table used to store the

size of the UP SUP or NUP array for each

node during the current stage

pr only used for SUP array

variables which are updated in each stage

INTEGER Stage current stage no in algorithm LastStage

INTEGER HAL Highest Active Level in current Stage

INTEGER LAL Lowest Active Level in current Stage

INTEGER ExternalState the number of stages this level have been

external including the current stage

INTEGER node no of node in complete binary tree which this processor

currently is allocated to

INTEGER level level containing that node

NB node and level may contain wrong values during stages

when the processor not is active ie the variable active

see below is false

INTEGER above the placement of the node corresponding to this processor

measured in no of levels above the external level

INTEGER ProcNoAtLevel no of this processor among the processors of this

kind UP SUP or NUP at this level n

INTEGER size size of UPSUP or NUP array for nodes of this kind at

this level

INTEGER index the no of the element served by this processor in the

UP SUP or NUP array associated with this node

element no in array for this node served by the processor

BOOLEAN active TRUE if this processor is active in the current stage

INTEGER RankInLeftSUP

INTEGER RankInRightSUP

B Colemem

Colemem

COMMENT

GLOBAL MEMORY USAGE in Colepil

location i st number

location i nd number

location i rd number n numbers to be sorted

location in nth number

location in n problem size n

location in

UP SUP and NUP arrays

contains NoOfProcessors elements

one for each processor

ExchangeArea contains one element

for each processor MUST be placed

just after the UPSUPNUParrays

NodeAddressTable

NodeSizeTable

addr

addr

addr parameters to SetUp

UPprocs

SUPprocs

UPstart

UserStackPtr next free location

end of COMMENT

B Basic Pro cedures

B Coleproc

Coleproc

INTEGER PROCEDURE LowestNodeNolevel INTEGER level

BEGIN

LowestNodeNo powerlevel

ENDPROCEDURE LowestNodeNo

INTEGER CONSTANT tLowestNodeNo tLOAD tpower

The time used is modelled at the caller place

INTEGER PROCEDURE HighestNodeNolevel INTEGER level

BEGIN

HighestNodeNo powerlevel

ENDPROCEDURE HighestNodeNo

INTEGER CONSTANT tHighestNodeNo tLOAD tADD tpower tSUB

The time used is modelled at the caller place

INTEGER PROCEDURE NoOfNodeslevel INTEGER level

BEGIN

NoOfNodes powerlevel

ENDPROCEDURE NoOfNodes

INTEGER CONSTANT tNoOfNodes tLOAD tpower

The time used is modelled at the caller place

PROCEDURE StoreArrayAddresses type INTEGER type

BEGIN

UsetStoreArray tWRITE

IF ProcessorType type AND active AND index THEN

WRITE ThisAddr TO NodeAddressTableStartnode

ELSE

WaittWRITE

ENDPROCEDURE StoreArrayAddresses

PROCEDURE StoreArraySizes type INTEGER type

BEGIN

UsetStoreArray tWRITE

IF ProcessorType type AND active AND index THEN

WRITE size TO NodeSizeTableStartnode

ELSE

WaittWRITE

ENDPROCEDURE StoreArraySizes

INTEGER CONSTANT tStoreArray tIF tLOAD tADD

tLOAD tSUB tADD tWRITE

PROCEDURE StoreProcessorValue val

INTEGER val

BEGIN

UsetLOAD tADD

WRITE val TO ExchangeAreaStartp

ENDPROCEDURE StoreProcessorValue

INTEGER CONSTANT tStoreVal tLOAD tADD tWRITE

INTEGER PROCEDURE UPsizen INTEGER n

Calculates the maximum size in no of elements of all the UP arrays

n n n see Cole p

BEGIN

INTEGER size

UsetLOAD tSTORE tSHIFT tSTORE tWHILE

size n

n n may be done by shift operation

WHILE n DO

BEGIN

UsetLOAD tADD tSTORE tSHIFT tSTORE tWHILE

size size n

n n may be done by shift operation

ENDWHILE

UsetSTORE

UPsize size

ENDPROCEDURE UPsize

INTEGER PROCEDURE SUPsizen INTEGER n

Calculates the maximum size in no of elements of all the SUP arrays

n n n see Cole p

BEGIN

INTEGER size

UsetSTORE tWHILE

size

WHILE n DO

BEGIN

UsetLOAD tADD tSTORE tSHIFT tSTORE tWHILE

size size n

n n may be done by shift operation

ENDWHILE

UsetSTORE

SUPsize size

ENDPROCEDURE SUPsize

PROCEDURE SetUp

PROCEDURE SetUp

Initialization of local processor variables which are unchanged during

the computation

BEGIN

PROCEDURE InitProcs

Initialize processors level by level

BEGIN

INTEGER ARRAY ProcNoofKind

PROCEDURE InitProcKindtype INTEGER type

BEGIN

INTEGER Initiated

INTEGER HowMany

INTEGER LevelsAbove

UsetIF tSTORE

IF type NUP THEN Phase done by the NUP processors is not

LevelsAbove performed at external nodes thus they are

ELSE used one level higher up see Cole p

LevelsAbove

Use tLOAD tSUB tSTORE tLOAD tSTORE

Initiated ProcNoofKindtype

HowMany n

UsetWHILE

WHILE HowMany DO

BEGIN

UsetIF tAND tLOAD tADD tIF

IF p Initiated AND p Initiated HowMany THEN

BEGIN Processor p is of this category

Use tLOADtSTORE tIFtSTORE tLOADtSUBtSTORE

above LevelsAbove

InsideNode above

ProcNoAtLevel p Initiated

END

ELSE

WaittLOADtSTORE tIFtSTORE tLOADtSUBtSTORE

Use tLOAD tADD tSTORE

tIF tIF tSHIFT tSTORE

Initiated Initiated HowMany

LevelsAbove LevelsAbove

IF type UP THEN

HowMany IF HowMany n THEN n ELSE HowMany

ELSE SUP or NUP

HowMany HowMany

UsetWHILE

ENDWHILE

ENDPROCEDURE InitProcKind

Use tSTORE tLOAD tADD tSTORE

ProcNoofKind

ProcNoofKind UPprocs

ProcNoofKind UPprocs SUPprocs

InitProcKindUP

InitProcKindSUP

InitProcKindNUP

ENDPROCEDURE InitProcs

BODY OF SetUp

READ UPstart FROM UserStackPtr

READ SUPprocs FROM UserStackPtr

READ UPprocs FROM UserStackPtr

READ NodeSizeTableStart FROM UserStackPtr

READ NodeAddressTableStart FROM UserStackPtr

READ ExchangeAreaStart FROM UserStackPtr

UsetLOAD tADD tADD tSTORE

NoOfProcessors UPprocs SUPprocs SUPprocs

set up the address of the element in UPSUPNUP array corresponding

to this processor

UsetLOAD tADD tSTORE

ThisAddr UPstart p

READ n FROM UPstart

Use tlog tSTORE tMULT tSTORE

m logn

LastStage m

Use tIF tLOADtADDtIFtSTORE tLOADtADDtIFtSTORE

Time used in longest path modelled

IF p UPprocs THEN

ProcessorTypeUP

ELSE IF p UPprocs SUPprocs THEN

ProcessorType SUP

IF p UPprocs SUPprocs THEN

ProcessorType NUP

read input and place in own array element

UsetIF

IF ProcessorType UP THEN

BEGIN

INTEGER myNumber

UsetLOAD tSUB

READ myNumber FROM ThisAddrn

WRITE myNumber TO ThisAddr

END

ELSE

WaittLOAD tSUB tREAD tWRITE

InitProcs

ENDPROCEDURE SetUp

PROCEDURE ComputeWhoIsWho

PROCEDURE ComputeWhoIsWho

BEGIN

PROCEDURE ComputeSizeOfArrays

BEGIN

INTEGER FirstStageAtThisLevel

UsetIF tIF

IF ProcessorType UP THEN

BEGIN

UsetIF

tLOAD tSUB tMULT tADD tSTORE

tIF tLOAD tSUB tpower tMULT tMIN tSTORE

IF level m THEN size

ELSE

BEGIN

FirstStageAtThisLevel m level

maximum UParray size at this level is powermlevel

IF Stage FirstStageAtThisLevel THEN

size MIN powerStage FirstStageAtThisLevel

powerm level

ELSE

size

END

WaittLOAD tAND tJNEG

END

ELSE

IF ProcessorType SUP THEN

BEGIN

UsetIF tLOAD tAND tJNEG

tLOAD tSUB tMULT tADD tSTORE

tIF tLOAD tSUB tpower tMULT tMIN tSTORE

IF level m AND Stage THEN size

ELSE

BEGIN

FirstStageAtThisLevel m level

maximum SUParray size at this level is powermlevel ie the

same as the maximum UP array size SUP UP at last active stage

IF Stage FirstStageAtThisLevel THEN

size MIN powerStage FirstStageAtThisLevel

powermlevel

ELSE

size

END

Wait

END

ELSE

BEGIN ProcessorType NUP

Knows that above therefore level must be m

UsetLOAD tSUB tMULT tADD tSTORE

tIF tLOAD tSUB tpower tMULT tMIN tSTORE

FirstStageAtThisLevel m level

maximum NUParray size at this level is powermlevel ie the

same as the maximum UP array size NUP UP in the next stage

IF Stage FirstStageAtThisLevel THEN

size MIN powerStage FirstStageAtThisLevel

powermlevel

ELSE

size

WaittIF tLOAD tAND tJNEG

END

ENDPROCEDURE ComputeSizeOfArrays

INTEGER NoOfActive No of active processors of this kind at this level

start of body in ComputeWhoIsWho

UsetLOAD tSUB tDIV tSUB tSTORE

tLOAD tSUB tSHIFT tSUB tSTORE

tLOAD tDIV tSTORE tIF tSTORE

LAL m Stage

HAL m Stage

ExternalState REMStage

IF ExternalState THEN ExternalState

UsetLOAD tSUB tSTORE

level LAL above

ComputeSizeOfArrays

UsetIF tLOAD tNoOfNodes tMULT tSTORE

IF level THEN

NoOfActive NoOfNodeslevel size

ELSE

NoOfActive

UsetIF

tLOAD tLowestNodeNo tLOAD tSUB tDIV tADD tSTORE

tLOAD tSUB tDIV tMULT tSUB tSTORE

tIF tSTORE

IF size THEN

BEGIN

node LowestNodeNolevel ProcNoAtLevel size

index ProcNoAtLevel ProcNoAtLevel size size

active ProcNoAtlevel NoOfActive

END

ELSE

BEGIN

node index active FALSE

END

ENDPROCEDURE ComputeWhoIsWho

PROCEDURE MakeSamples

PROCEDURE MakeSamples

BEGIN

PROCEDURE Reducerate INTEGER rate

BEGIN

INTEGER UPval size

ADDRESS addr UPeltAddr

UsetLOAD tSUB

READ addr FROM NodeAddressTableStartnode

UsetLOAD tADD tSUB tMULT tSUB tADD tIF

UPeltAddr addr size ratesize index

IF UPeltAddr addr THEN

BEGIN

READ UPval FROM UPeltAddr

WRITE UPval TO ThisAddr

ie to corresponding SUPaddress

END

ELSE

ErrorSUP couldnt fetch UPelement

ENDPROCEDURE Reduce

INTEGER CONSTANT tReduce tLOAD tSUB tR

tLOAD tADD tSUB tMULT tSUB tADD tIF tR tW

INTEGER SampleRate

StoreArrayAddressesUP

UsetIF tLOAD tAND

IF ProcessorType SUP AND active THEN

BEGIN

UsetCONST tSTORE tIF tIF tSTORE

SampleRate

IF NOT InsideNodeTHEN

BEGIN

IF ExternalState THEN SampleRate

IF ExternalState THEN SampleRate

END

ReduceSampleRate

END

ELSE

WaittCONST tSTORE tIF tIF tSTORE tReduce

ENDPROCEDURE MakeSamples

includeOrderMergeproc

PROCEDURE CopyNUPtoUP

PROCEDURE CopyNUPtoUP

Copies the NUP array made in the previous stage to the UP array

for this stage Performed by the UPprocessors

BEGIN

ADDRESS addr For Up array elements this variable represents the

address of the corresponding NUP array element in the

previous stage

INTEGER val

Observation Copying of NUP to UP should only be done at inside nodes

except at stage when it also should be done at external nodes

because these nodes were inside nodes in the previous stage

UsetIF tLOAD tAND

tLOAD tSUB tDIV tJZERO tLOAD tOR

IF ProcessorType UP AND active AND REMStage OR InsideNodeTHEN

BEGIN

UsetLOAD tADD

READ addr FROM NodeAddressTableStartnode

UsetLOAD tADD tSUB tSTORE

addr addrindex

READ val FROM addr

WRITE val TO ThisAddr

END

ELSE

WaittLOAD tADD tSUB tSTORE tREAD tREAD tWRITE

At the end of the main loop the array addresses are stored for the

NUParrays in the previous stage When CopyNUPtoUP is called at the

start of this stage the NodeAddressTable therefore gives

the mapping from nodeno to NUPprocessor as the NUP processors was

allocated to the NUParrays in the previous stage It is not necces

sary to use the NodeSizeTable since the UParray size for a node in stage i

is the same as the size of NUParray in the same node in stage i

Thus the information found in the NodeAddressTable is enough for an UP

processor to find its corresponding NUPprocessor from the previous stage

Transfer RankInLeftSUP from NUP to UP processors

UsetIF

IF ProcessorType NUP THEN

StoreProcessorValueRankInLeftSUP See Coleproc

ELSE

WaittStoreVal

UsetIF tLOAD tSUB tADD assumes that the result of the

subcondition evaluated above was stored locally

IF ProcessorType UP AND active AND REMStage OR InsideNode THEN

READ RankInLeftSUP FROM ExchangeAreaStartaddrUPstart

ELSE

WaittREAD

Transfer RankInRightSUP from NUP to UP processors

UsetIF

IF ProcessorType NUP THEN

StoreProcessorValueRankInRightSUP

ELSE

WaittStoreVal

UsetIF tLOAD tLOAD assumes that results were stored locally above

IF ProcessorType UP AND active AND REMStage OR InsideNode THEN

READ RankInRightSUP FROM ExchangeAreaStartaddrUPstart

ELSE

WaittREAD

ENDPROCEDURE CopyNUPtoUP

B Coleproc

Coleproc

procedures used in O merging only

INTEGER PROCEDURE RateInNextStage

BEGIN

INTEGER rate

INTEGER ExtState

InsideNode in this Stage implies InsideNode in next Stage OR

first stage as external node in next stage In both these cases

the sample rate is

UsetIF tLOAD tADD tDIV tSTORE tIF tSTORE

IF InsideNode THEN

rate

ELSE

BEGIN

NOT InsideNode in this Stage implies NOT InsideNode in next stage

ExtState REMStage

IF ExtState THEN rate corresponds to ExtState

IF ExtState THEN rate

END

RateInNextStage rate

ENDPROCEDURE RateInNextStage

INTEGER CONSTANT tRateInNextStage tIF

tLOAD tADD tDIV tSTORE tIF tSTORE

INTEGER PROCEDURE RateInNextStageBelow

Computes the sample rate that will be used on level below this in next stage

BEGIN

INTEGER rate

INTEGER ExtState

UsetIF tLOAD tSUB tLOAD tADD tDIV tSTORE

tIF tSTORE

IF LAL level THEN also the level below must be internal

rate

ELSE

BEGIN

ExtState REMStage

IF ExtState THEN rate corresponds to ExtState

IF ExtState THEN rate

END

RateInNextStageBelow rate

ENDPROCEDURE RateInNextStageBelow

INTEGER CONSTANT tRateInNextStageBelow tIF tLOAD tSUB

tLOAD tADD tDIV tSTORE tIF tSTORE

INTEGER PROCEDURE CompareWithval val val val

INTEGER val val val val

BEGIN

NOTE It is ASSUMED that val val val

INTEGER pos

UsetIF tIF tSTORE

IF val val THEN

BEGIN

IF val val THEN

pos

ELSE

pos

END

ELSE

BEGIN

IF val val THEN

pos

ELSE

pos

END

CompareWith pos

ENDPROCEDURE CompareWith

INTEGER CONSTANT tCompareWith tIF tIF tSTORE

B Merging in ConstantTime

B OrderMergeproc

OrderMergeproc

PROCEDURE MergeWithHelp

BEGIN

INTEGER SLrankInParentNUP SL is short for StraddleLeft

INTEGER SRrankInParentNUP SR is short for StraddleRight

These must be declared here since they are computed

in DoMerge and used in MaintainRanks

INTEGER CONSTANT tTryCandidates

tLOAD tSUB tADD tSTORE tIF

tREAD tIF tSUB tAND tLOAD tADD tWRITE

tLOAD tSUB tADD tSTORE

tREAD tIF tLOAD tADD tWRITE

tLOAD tADD tSTORE tLOAD tSUB tADD tSTORE tIF

tLOAD tADD tWRITE

tWRITE tLOAD tADD tSTORE tIF tLOAD tADD

INTEGER CONSTANT tSubStep tIF

tLOAD tSHIFT tSTORE tLOAD tSTORE tADD

tStoreVal tIF tLOAD tSUB tADD tR tR tIF

tR tLOAD tADD tR tTryCandidates

INTEGER CONSTANT tfindrandt tLOAD tSHIFT tADD tR

tLOAD tADD tSUB tSTORE tLOAD tADD

tR tLOAD tSHIFT tADD tR

tLOAD tADD tIF tLOAD tADD tR

INTEGER CONSTANT tsscase tIF tStoreVal

tIF tLOAD tSHIFT tJZERO tfindrandt

INTEGER CONSTANT tsscase tIF tStoreVal

tIF tLOAD tSHIFT tJNEG tfindrandt

INTEGER CONSTANT tReadValues tIF tABS tLOAD tSUB tADD

tR tIF tADD tLOAD tR tIF tADD tLOAD tR

INTEGER CONSTANT tSubStep tIF tLOAD tADD

tR tStoreArray tsscase tsscase

tIF tR tStoreVal

tIF tLOAD tSHIFT tSUB tReadValues

tSTORE tADD tSTORE tCompareWith

INTEGER CONSTANT tComputeCrossRanks tIF tLOAD tSTORE

tIF tLOAD tSUB tJZERO tSubStep tSubStep

includeDoMergeproc

INTEGER CONSTANT tDoMerge tDoMergePart tDoMergePart

includeMaintainRanksproc

BEGIN BODY of MergeWithHelp

Main Assumption Ranks UPu SUPv and UPu SUPw are known

these are stored in RankInLeftSUP and RankInRightSUP

BEGIN

StoreArrayAddressesSUP

StoreArraySizesSUP

There is no need to perform DoMerge in Stage or If not convinced

study an example

Assumes that this test is combined with similar test in Colepil mainprog

IF Stage OR Stage THEN

BEGIN

UsetIF tLOAD tJZERO tAND tAND

IF ProcessorType UP AND size AND active AND InsideNodeOR

ProcessorType SUP AND size AND active AND node OR

ProcessorType NUP AND size AND active AND InsideNode THEN

DoMerge

ELSE

WaittDoMerge

END

END

There is no need to perform MaintainRanks in Stage and

RankInLeftRightSUP are needed at the first time when the size of UPu

is and the size of SUPuchild This occur first time in

Stage Consider an example

UsetIF tSUB tJNEG

IF Stage OR Stage THEN

BEGIN

MaintainRanks

END

ENDPROCEDURE MergeWithHelp

B DoMergeproc

DoMergeproc

PROCEDURE DoMerge

BEGIN

NOTE that StoreArraySizesSUP and StoreArrayAddressesSUP are done

just before the call to DoMerge in MergeWithHelp

This is assumed in SubStep

INTEGER RankInParentNUP RankInThisSUP RankInOtherSUP

INTEGER SLindex SL d and SR f in fig p in Cole

INTEGER SRindex These are the indices of the straddling items in the

other SUP array sibling They are computed in

procedure ComputeCrossRanks

PROCEDURE ComputeCrossRanks

BEGIN

INTEGER RankInParentUP

PROCEDURE SubStep

PROCEDURE SubStep performed by active inside UPprocessors

BEGIN

INTEGER r s i i

PROCEDURE TryCandidatesr addr

INTEGER r ADDRESS addr

BEGIN

INTEGER CandidateVal

ADDRESS SUPelementAddr points to the SUParray element

in global memory whose RankInParentUP should be set to the

index position of own element of the executing UPprocessor

The assigned value is passed indirectly by using the

ExchangeArea The address of an element in the ExchangeArea

for a processor p is given by the address of its element in the

UPSUPNUP array the value of NoOfProcessors

See figure refWorkAreas

itemr

UsetLOAD tSUB tADD tSTORE tIF

SUPelementAddr addr r

IF r THEN

BEGIN

READ CandidateVal FROM SUPelementAddr

UsetIF tLOAD tADD

IF CandidateVal i THEN

WRITE index TO SUPelementAddrNoOfProcessors

ELSE

WaittWRITE

END

ELSE

WaittREAD tIF tLOAD tADD tWRITE

Note that r and s may be the same position In that case we must

have that s i since r s r i and i i Therefore the

element in position r s has rank b in UPu it is i

This case is handled as itemr above

items

UsetLOAD tSUB tADD tSTORE

SUPelementAddr addr s

READ CandidateVal FROM SUPelementAddr

UsetIF tSUB tAND tLOAD tADD

IF CandidateVal i AND CandidateVal i THEN

WRITE index TO SUPelementAddrNoOfProcessors

ELSE

WaittWRITE

itemr

UsetLOAD tADD tSTORE

tLOAD tSUB tADD tSTORE tIF tLOAD tADD

rr

SUPelementAddr addr r

IF r s THEN

WRITE index TO SUPelementAddrNoOfProcessors

ELSE

WaittWRITE

itemr

UsetLOAD tADD tSTORE tIF tLOAD tADD

rr

SUPelementAddr SUPelementAddr

IF r s THEN

WRITE index TO SUPelementAddrNoOfProcessors

ELSE

WaittWRITE

ENDPROCEDURE TryCandidates

tTryCandidates is defined in OrderMergeproc

Substep starts here

for each element e in SUPv compute its rank in UPu

Find interval ii induced by this UPelementThenfindthe

items in SUPv contained in ii Assign ranks to these elements

The interval is given by the values i and i where i is the value

stored at ThisAddr for this UParray processor and i is the value

stored at ThisAddr if not index size for the same UParray

processor If index size ThisAddr is the last element for this

UParray element i INFTY and s the position of the last

element in SUPv

See refSubstepDescr in cotex

UsetIF

IF ProcessorType UP THEN

BEGIN

See Figure refSubstep in cotex

ADDRESS addr

INTEGER LeftChild

UsetLOAD tSHIFT tSTORE tLOAD tSTORE

LeftChild node

r RankInLeftSUP

StoreProcessorValueRankInLeftSUP

The NodeSizeTable should now contain the size of the SUParray

for each node Similarly the addresses of the SUParrays in the

NodeAddressTable See the start of the body of MergeWithHelp

UsetIF tLOAD tSUB tADD

IF index size THEN

READ s FROM ExchangeAreaStartp

ELSE Right neighbour processor does not exist lets point to the

last element in SUPv

READ s FROM NodeSizeTableStartLeftChild

READ i FROM ThisAddr

UsetIF

IF index size THEN

READ i FROM ThisAddr

ELSE

BEGIN

UsetREAD i INFTY

END

UsetLOAD tADD

READ addr FROM NodeAddressTableStartLeftChild

TryCandidatesr addr

END of ProcessorType UP

ELSE

WaittLOAD tSHIFT tSTORE tLOAD tSTORE tStoreVal

tIF tLOAD tSUB tADD tR tR tIF

tR tLOAD tADD tR tTryCandidates

Do the same for each element e in SUPw Note that this

computation must be done in two phases one for SUPv and

one for SUPw which not can be done in parallel since the

work in both phases are done by the UPu processors which

are common for SUPv and SUPw

for each element e in SUPw compute its rank in UPu

Assumes that this test is combined with the one above

IF ProcessorType UP THEN

BEGIN

ADDRESS addr

INTEGER RightChild

UsetLOAD tSHIFT tADD tSTORE tLOAD tSTORE

RightChild node

r RankInRightSUP

StoreProcessorValueRankInRightSUP

The NodeSizeTable should still contain the size of the

SUParray for each node Similarly the addresses of the

SUParrays in the NodeAddressTable

UsetIF tLOAD tSUB tADD

IF index size THEN

READ s FROM ExchangeAreaStartp

ELSE

READ s FROM NodeSizeTableStartRightChild

READ i FROM ThisAddr

UsetIF

IF index size THEN

READ i FROM ThisAddr

ELSE

BEGIN

UsetREAD i INFTY

END

UsetLOAD tADD

READ addr FROM NodeAddressTableStartRightChild

TryCandidatesr addr

END of ProcessorType UP

ELSE

WaittLOADtSHIFTtADDtSTORE tLOADtSTORE tStoreVal

tIF tLOAD tSUB tADD tR tR tIF

tR tLOAD tADD tR tTryCandidates

ENDPROCEDURE SubStep

tSubStep is defined in DoMergeproc

PROCEDURE SubStep

PROCEDURE SubStep

BEGIN

for each element e in SUPvw compute its rank in SUPwv

see fig Cole p and refSubstep in cotex

INTEGER r t

PROCEDURE findrandt

BEGIN

INTEGER dProcNo ParentUPaddr ParentSize

find the processor no which holds d

UsetLOAD tSHIFT tADD

READ ParentUPaddr FROM NodeAddressTableStartnode

UsetLOAD tADD tSUB tSTORE tLOAD tADD

dProcNo ParentUPaddr RankInParentUP UPstart

RankInLeftRightSUP is now stored in ExchangeArea

READ r FROM ExchangeAreaStartdProcNo

UsetLOAD tSHIFT tADD

READ ParentSize FROM NodeSizeTableStartnode

UsetLOAD tADD tIF tLOAD tADD

IF RankInParentUP ParentSize THEN

BEGIN

tsize

UsetREAD

END

ELSE

READ t FROM ExchangeAreaStartdProcNo

ENDPROCEDURE findrandt

tfindrandt is defined in OrderMergeproc

UsetIF tLOAD tADD

IF ProcessorType SUP THEN

READ RankInParentUP FROM ExchangeAreaStartp

ELSE

WaittREAD

StoreArrayAddressesUP

StoreArraySizesUP

Case e in SUPv find rank in SUPw

tsscase starts here

UsetIF

IF ProcessorType UP THEN

StoreProcessorValueRankInRightSUP

ELSE

WaittStoreVal

UsetIF tLOAD tSHIFT tJZERO

IF ProcessorType SUP AND REMnode THEN

BEGIN

left node SUPv

findrandt

END

ELSE

Waittfindrandt

tssCase ends here

Case e in SUPw find rank in SUPv

tsscase starts here

UsetIF

IF ProcessorType UP THEN

StoreProcessorValueRankInLeftSUP

ELSE

WaittStoreVal

UsetIF tLOAD tSHIFT tJNEG

IF ProcessorType SUP AND REMnode THEN

BEGIN

right node SUPw

findrandt

END

ELSE

Waittfindrandt

tssCase ends here

The rest of Case and Case is handled in parallel

UsetIF

IF ProcessorType SUP THEN

BEGIN

Remember that node has no sibling node

INTEGER val val val val LocalOffset

READ val FROM ThisAddr

StoreProcessorValueval

BEGIN

PROCEDURE ReadValuessize

INTEGER size

BEGIN

UsetIF tABS tLOAD tSUB tADD tADD

IF r ABSsize THEN

for the ABS see the second call to ReadValues below

BEGIN

r corresponds to the value INFTY

UsetREAD val INFTY

END

ELSE

READ val FROM ThisAddrindexsizer

UsetIF tADD tLOAD Assumes that the long address

expression was stored locally in a register above

IF r t THEN

READ val FROM ThisAddrindexsizer

ELSE

BEGIN

val INFTY UsetREAD

END

UsetIF tADD tLOAD

IF r t THEN

READ val FROM ThisAddrindexsizer

ELSE

BEGIN

val INFTY UsetREAD

END

ENDPROCEDURE ReadValues

UsetIF tLOAD tSHIFT tSUB

IF REMnode THEN

ReadValuessize this is left child read from right child

ELSE

ReadValuessize this is right child read from left child

END

Use tSTORE tADD tSTORE

LocalOffset CompareWithval val val val

RankInOtherSUP r LocalOffset

Store the positions in the other SUP of the two straddling items

these values are needed in MaintainRanks

Use tSTORE tADD tSTORE

SLindex RankInOtherSUP

SRindex RankInOtherSUP

END

ELSE

WaittREAD tStoreVal tIF tLOAD tSHIFT tSUB

tReadValues tSTORE tADD tSTORE

tCompareWith

ENDPROCEDURE SubStep

Bo dy of ComputeCrossRanks

BEGIN BODY of ComputeCrossRanks

UsetIF tLOAD tSTORE

IF ProcessorType SUP THEN

RankInThisSUP index

Compute rank in other SUP

see also the processor selection before the call to DoMerge and

ComputeCrossRanks

UsetIF tLOAD tSUB tJZERO

IF ProcessorType SUP AND size THEN

BEGIN

Boundary case

Since the size of the SUParray in this node there is no

UParray in the parent node to help us in computing the rank

in the other SUP as done by the routines SubStep and SubStep

However since the SUParrays are so small it is easy to compute

this rank in constant time by using only one comparison

INTEGER ThisVal OtherVal

ADDRESS OtherAddr

UsetIF tSHIFT tLOAD tADD

IF REMnode THEN

READ OtherAddr FROM NodeAddressTableStartnode

ELSE

READ OtherAddr FROM NodeAddressTableStartnode

READ OtherVal FROM OtherAddr

READ ThisVal FROM ThisAddr

UsetIF tSTORE

IF ThisVal OtherVal THEN

RankInOtherSUP

ELSE

RankInOtherSUP

UsetLOAD tSTORE tADD tSTORE

SLindex RankInOtherSUP

SRindex RankInOtherSUP

Note the sign below

WaittSubStep tSubStep

tIF tSHIFT tLOAD tADD

tR tR tR tIF tSTORE

tLOAD tSTORE tADD tSTORE

END of ProcessorType SUP and size

ELSE

BEGIN

SubStep

SubStep

END

ENDPROCEDURE ComputeCrossRanks

Bo dy of DoMerge

BEGIN body of DoMerge

We want the position of each SUPelement in the NUParray of its

parent node This position is computed and stored in the variable

RankInParentNUP It is given by the sum of the rank of the element in

SUPv and SUPw The procedure ComputeCrossRanks computes these

values and they are stored in RankInThisSUP and RankInOtherSUP upon

return from the procedure

ADDRESS WriteAddress

see also the processorselection before the call to DoMerge

tDoMergePart starts here

UsetIF

IF ProcessorType UP OR ProcessorType SUP THEN

ComputeCrossRanks

ELSE

WaittComputeCrossRanks

UsetIF tLOAD tADD tSTORE

IF ProcessorType SUP THEN

RankInParentNUP RankInThisSUP RankInOtherSUP

The values SLrankInParentNUP and SRrankInParentNUP are only used

in MaintainRanks which not is called before Stage and not in

Stage

UsetIF tLOAD tSUB tJZERO tLOAD tSUB tJNEG

IF ProcessorType SUP AND Stage OR Stage THEN

BEGIN

Compute SLrankInParentNUP and SRrankInParentNUP

Knows SLindex SRindex and RankInParentNUP for all SUPprocessors

StoreProcessorValueRankInParentNUP

Note that the padding used here is not robust be careful if modifying

UsetIF

IF SLindex THEN

BEGIN

UsetSTORE

SLrankInParentNUP

WaittIF tLOADtSHIFT tLOADtSUBtADDtSTORE tR

tSTORE

END

ELSE

BEGIN

INTEGER SLaddr

Compute processor no address corresponding to SLindex

NoOfProcessors is added because we want the address

in the ExchangeArea

UsetIF tLOADtSHIFT tLOADtSUBtADDtSTORE

IF REMnode THEN This is a left child mark

SLaddr ThisAddr index size SLindex NoOfProcessors

ELSE This is a right child

SLaddr ThisAddr index size SLindex NoOfProcessors

READ SLrankInParentNUP FROM SLaddr

END

UsetIF

IF SRindex size THEN

BEGIN

UsetLOAD tSHIFT tADD tSTORE

SRrankInParentNUP size

WaittIF tLOADtSHIFT tLOADtSUBtADDtSTORE tR

tLOAD tSHIFT tADD tSTORE

END

ELSE

BEGIN

INTEGER SRaddr

UsetIF tLOADtSHIFT tLOADtSUBtADDtSTORE

IF REMnode THEN

SRaddr ThisAddr index size SRindex NoOfProcessors

ELSE

SRaddr ThisAddr index size SRindex NoOfProcessors

READ SRrankInParentNUP FROM SRaddr

END

END of ProcessorType SUP AND Stage

ELSE

WaittStoreVal

tIF tIF tLOADtSHIFT

tLOADtSUBtADDtSTORE tR

DO THE MERGE

tDoMergePart starts here

StoreArrayAddressesNUP

UsetIF

IF ProcessorType SUP THEN

BEGIN

ADDRESS ParentNUPaddr

INTEGER val

copy the SUP element to its right position in the NUP array

of its parent node

READ val FROM ThisAddr

UsetLOAD tSHIFT tADD

READ ParentNUPaddr FROM NodeAddressTableStartnode

UsetLOAD tADD tSUB tSTORE

WriteAddress ParentNUPaddrRankInParentNUP

RankInThisSUP for the first element in a set If RankInOtherSUP

we will have RankInParentNUP which is the correct position of the

element because it is in the set We must subtract to get the correct

address above as usual because ParentNUPaddr is the address of element

no and not the address of element no which do not exist

WRITE val TO WriteAddress

END

ELSE

WaittREAD tLOAD tSHIFT tADD tREAD

tLOAD tADD tSUB tSTORE tWRITE

Record the SenderAddress of each NUPelement in the ExchangeArea

The information is used in MaintainRanks A negative SenderAddress

indicates that the item is from the left child a positive address

indicates that it is from the right child

assumed combined with the if test above

IF ProcessorType SUP THEN

BEGIN

INTEGER SenderAddress

UsetLOAD tSTORE

SenderAddress index

Use the address in the ExchangeArea corresponding to the processor

allocated to the NUPelement just written

Use tIF tSHIFT tLOAD tADD tSUB

IF REMnode THEN node is left child

WRITE SenderAddress TO WriteAddressNoOfProcessors

ELSE

WRITE SenderAddress TO WriteAddressNoOfProcessors

END of ProcessorType SUP

ELSE

WaittLOAD tSTORE tIF tSHIFT

tLOAD tADD tSUB tWRITE

ENDPROCEDURE DoMerge

INTEGER CONSTANT tDoMergePart tIF tIF tLOAD tADD tSTORE

tIF tLOAD tSUB tJZERO tLOAD tSUB tJNEG

tComputeCrossRanks tStoreVal tIF tIF tLOADtSHIFT

tLOADtSUBtADDtSTOREtR

INTEGER CONSTANT tDoMergePart tStoreArray tIF

tREAD tLOAD tSHIFT tADD tREAD

tLOAD tADD tSUB tSTORE tWRITE

tLOADtSTORE tIFtSHIFT tLOADtADDtSUB tWRITE

B MaintainRanksproc

MaintainRanksproc

PROCEDURE MaintainRanks

BEGIN

Computes the ranks UPu SUPv and UPu SUPw for the next

stage This is NUPu NewSUPv and NUPu NewSUPw for this

stage These ranks are stored in the variables RankInLeftSUP and

RankInRightSUP in the NUP processors on return from the procedure

See Cole p

INTEGER SourceProcNo Every NUParray element corresponds to a

SUParray element this variable gives the SUParray processor no

The variable is declared here because it is computed in the first

of the following two procedures and used in both of them

PROCEDURE RankInNewSUPfromSenderNo de

PROCEDURE RankInNewSUPfromSenderNodeSender

INTEGER Sender

BEGIN

INTEGER RankInThisNUP

The two cases that the element e came from SUPv or SUPw

are handled in parallel

compute UPv NUPv

UsetIF

IF ProcessorType UP THEN

NOTE that active NUP and UP processors at the same node occurs at

stage when MaintainRanks not is performed and at stage and all

later stages BUT if NUPnode does not exist as active processors

any longer the reason is that UPnode has reached its maximum size

and since NUPnode UPnode there is no need to store more than

UPnode and it is possible to compute UPvNUPv even if NUPv

does not exist

BEGIN

UsetIF tLOAD tSUB tpower tLOAD tADD tSTORE

IF size powerm level THEN

BEGIN

UPv NUPv

RankInThisNUP index

END

ELSE

BEGIN

RankInThisNUP RankInLeftSUP RankInRightSUP

END

StoreProcessorValueRankInThisNUP mark

Store the address of the UParrays needed below

StoreArrayAddressesUP

END of ProcessorType UP

ELSE

WaittIF tLOAD tSUB tpower tLOAD tADD tSTORE

tStoreVal tStoreArray

tPart ends tPart starts

for each element in SUPv or SUPw find the corresponding

element in UPv

What element in UP is this SUP element a copy of

UsetIF

IF ProcessorType SUP THEN

BEGIN

ADDRESS UPaddr

INTEGER UPeltIndex

INTEGER rate

INTEGER UParraySize

INTEGER RankInThisNewSUP

compute samplerate which was used in making SUP from UP

UsetSTORE tIF tIF tSTORE

rate

IF NOT InsideNode THEN

BEGIN

IF ExternalState THEN rate

IF ExternalState THEN rate

END

find UParray of same node

UsetLOAD tADD

READ UPaddr FROM NodeAddressTableStartnode

find the position in UParray corresponding to this SUPelement

UsetLOAD tSUB tSHIFT tADD tSTORE

UPeltIndex index rate

The processor no of the UParray element corresponding to this

SUP element is UPaddr UPeltIndex UPstart

The ExchangeArea now contains UPv NUPv see mark

UsetLOAD tADD tSUB

READ RankInThisNUP FROM ExchangeAreaStartUPaddrUPeltIndexUPstart

We now have for each SUPx element

RankInThisNUP SUPxNUPx for x v or x w

NewSUP is made from NUP under the names SUP from UP in the next

stage must compute the sample rate that will be used in that stage

rate RateInNextStage

NewSUPx is every rateth element in NUPx

UsetIF tLOAD tSUB tSHIFT tADD tSTORE

IF rate THEN

RankInThisNewSUP RankInThisNUP rate

this is SUPx NewSUPx save this value and the address

of the SUParrays to make it possible for the NUP processors

to get the correct value below

StoreProcessorValueRankInThisNewSUP

StoreArrayAddressesSUP

END of ProcessorType SUP

ELSE

Wait tSTORE tIF tIF tSTORE tLOAD tADD tR

tLOAD tSUB tSHIFT tADD tSTORE

tLOAD tADD tSUB tR tRateInNextStage

tIF tLOAD tSUB tSHIFT tADD tSTORE

tStoreVal tStoreArray

tPart ends here

Let NUPu processors get their value of the rank

NUPu NewSUPvw if their NUPelt came from SUPvw

UsetIF

IF ProcessorType NUP THEN

BEGIN

ADDRESS SUPxAddr

UsetIF

IF Sender THEN

BEGIN from left child

UsetLOAD tSHIFT tADD

READ SUPxAddr FROM NodeAddressTableStartnode

UsetLOAD tSUB tSTORE tADD

SourceProcNo SUPxAddr Sender UPstart

READ RankInLeftSUP FROM ExchangeAreaStartSourceProcNo

END

ELSE

BEGIN

IF Sender THEN Error sender

from right child

UsetLOAD tSHIFT tADD

READ SUPxAddr FROM NodeAddressTableStartnode

UsetLOAD tSUB tSTORE tADD

SourceProcNo SUPxAddr Sender UPstart

READ RankInRightSUP FROM ExchangeAreaStartSourceProcNo

END

END

ELSE

WaittIF tLOAD tSHIFT tADD tREAD

tLOAD tSUB tSTORE tADD tREAD

Remember that it were the values for the next stage of

RankInLeftRightSUP which were computed above

ENDPROCEDURE RankInNewSUPfromSenderNode

INTEGER CONSTANT tPart tIF

tIF tLOAD tSUB tpower tLOAD tADD tSTORE

tStoreVal tStoreArray

INTEGER CONSTANT tPart

tIF tSTORE tIF tIF tSTORE tLOAD tADD tR

tLOAD tSUB tSHIFT tADD tSTORE

tLOAD tADD tSUB tR tRateInNextStage

tIF tLOAD tSUB tSHIFT tADD tSTORE

tStoreVal tStoreArray

INTEGER CONSTANT tRankInNewSUPfromSenderNode tPart tPart

tIF tIF tLOAD tSHIFT tADD tREAD

tLOAD tSUB tSTORE tADD tREAD

PROCEDURE RankInNewSUPfromOtherNo de

PROCEDURE RankInNewSUPfromOtherNodeSender

INTEGER Sender

BEGIN

INTEGER r t

INTEGER LocalOffset

PROCEDURE ComputeLocalOffsetOtherNUPaddr rate

ADDRESS OtherNUPaddr

INTEGER rate

BEGIN

INTEGER OwnVal val val val

Get the maximum elements in positions r t and t

See fig Cole p

This procedure is similar to ReadValues in DoMerge but the address

calculation is different because we must use rate to obtain the

correct mapping between NewSUPx and NUPx see mark below

Item x is read from position OtherNUPaddr xrate

UsetLOAD tSHIFT tADD

READ val FROM OtherNUPaddrrrate

UsetADD tIF tLOAD tSHIFT tADD

IF r t THEN

BEGIN

READ val FROM OtherNUPaddrrrate

END

ELSE

BEGIN

UsetREAD val INFTY

END

UsetADD tIF tLOAD tSHIFT tADD

IF r t THEN

BEGIN

READ val FROM OtherNUPaddrrrate

END

ELSE

BEGIN

UsetREAD val INFTY

END

READ OwnVal FROM ThisAddr

LocalOffset CompareWithOwnVal val val val

ENDPROCEDURE ComputeLocalOffset

INTEGER dRank fRank

INTEGER NUPvAddr NUPwAddr

INTEGER NUPvANDwSize

The elements d and f are the two elements from the other SUParray

which straddle e e this element See fig in Cole p

dRank is the rank of the element d in NUPu similarly for fRank

These values are computed in the procedure DoMerge In that

procedure they are stored in the variables SLrankInParentNUP and

SRrankInParentNUP

Pass first dRank then fRank from the SUP to the NUPprocessors

SourceProcNo was calculated above in RankInNewSUPfromSenderNode

tpart starts here

UsetIF

IF ProcessorType SUP THEN

StoreProcessorValueSLrankInParentNUP

ELSE

WaittStoreVal

UsetIF tLOAD tADD

IF ProcessorType NUP THEN

READ dRank FROM ExchangeAreaStartSourceProcNo

ELSE

WaittREAD

UsetIF

IF ProcessorType SUP THEN

StoreProcessorValueSRrankInParentNUP

ELSE

WaittStoreVal

UsetIF tLOAD tADD

IF ProcessorType NUP THEN

READ fRank FROM ExchangeAreaStartSourceProcNo

ELSE

WaittREAD

This code is needed because NewSUPx does not exist Instead we

find a given NewSUPx item in NUPx by considering the sample

rate that will be used in the next stage one level below to make

NewSUPx from NUPx

In the code below see mark and mark every NUPprocessor

must know the address of the NUParrays if it exist of its two

children If they do not exist the address of the UParrays of the

children must be known

If there are active NUP processors in node x there are active NUP

processors in the left and right child of x if the size of the NUP

array at node x is smaller than MaxNUPsizexlevel

Assumes that size and address of NUP procs are stored

in the Size and Address Table see mark below

tpart starts here

UsetIF tLOAD tSUB tpower tSHIFT tSUB tJNEG

IF ProcessorType NUP AND size powerm level THEN

BEGIN

NUParray should exist in right and left child

UsetIF tLOAD tSHIFT tADD

IF Sender THEN

READ NUPvAddr FROM NodeAddressTableStartnode

ELSE

READ NUPwAddr FROM NodeAddressTableStartnode

UsetLOAD Assumes use of offset computed above

READ NUPvANDwSize FROM NodeSizeTableStartnode

END

ELSE

WaittIF tLOAD tSHIFT tADD tR tLOAD tR

StoreArrayAddressesUP

StoreArraySizesUP

UsetIF Assumed combined with the IF statement above

IF ProcessorType NUP AND size powerm level THEN

BEGIN

UParray should exist in right and left child

UParray is used as NUP array

UsetIF tLOAD tSHIFT tADD

IF Sender THEN

READ NUPvAddr FROM NodeAddressTableStartnode

ELSE

READ NUPwAddr FROM NodeAddressTableStartnode

UsetLOAD

READ NUPvANDwSize FROM NodeSizeTableStartnode

END

ELSE

WaittIF tLOAD tSHIFT tADD tR tLOAD tR

tpart starts here

UsetIF

IF ProcessorType NUP THEN

BEGIN

INTEGER rate

PROCEDURE findrandt

BEGIN

Knows that d is to the left of e the position of e is index

and that f is to the right of e d e and f are all inside the

same NUPx array

The address of the st element with index in this array

for this node is ExchangeAreaStart p index Therefore

the address of element no pos numbered size is pos

added to this value

UsetIF tLOAD tADD tSUB tADD

IF dRank THEN

READ r FROM ExchangeAreaStartpindexdRank

ELSE

BEGIN

UsetREAD

r

END

UsetIF tLOAD tSUB tADD Assumed combined with above

IF fRank size size of own NUParray THEN

READ t FROM ExchangeAreaStartpindexfRank

ELSE

BEGIN

t NUPvANDwSizerate

WaittREAD Assumed not simpler than the THENclause

END

ENDPROCEDURE findrandt

r and t are the ranks of d and f in the NewSUP for the same

node as d and f came from SUP These values have just been

computed in procedure RankInNewSUPfromSenderNode

The rate used in the next stage have been computed before

by the SUPprocessors in procedure RankInNewSUPfromSenderNode

However in this case we need the rate that will be used in the

next stage in the CHILD nodes of this node

UsetSTORE

rate RateInNextStageBelow

StoreProcessorValueRankInLeftSUP

UsetIF

IF Sender THEN

BEGIN

This is from right child SUPw d and f from the left

findrandt

END

ELSE

Waittfind

StoreProcessorValueRankInRightSUP

UsetIF

IF Sender THEN

BEGIN

This is from left child SUPv d and f from the right

findrandt

END

ELSE

Waittfind

UsetIF

IF Sender THEN

BEGIN mark

r and t points to elements in NewSUPv This is the case

shown in fig in Cole Note that NewSUP does not exist yet

However we know that NewSUPv will be made by the procedure

MakeSamples from NUPv in the next stage Thus the positions r and t

in NewSUPv implicitly gives the positions in NUPv for the

elements which must be compared with e

If the samplerate used to make NewSUP from NUP is rate the position

of the element corresponding to r in NUPv is given by rrate

NUPvAddr was read above mark

ComputeLocalOffsetNUPvAddr rate

END of Sender

ELSE

BEGIN

IF Sender THEN

BEGIN

r and t points to elements in NewSUPw

see comments for the case Sender above

NUPwAddr was read above mark

ComputeLocalOffsetNUPwAddr rate

END

ELSE

Error Should not occur

END

END of ProcessorType NUP

ELSE

WaittSTORE tRateInNextStageBelow

tStoreVal tIF tfind tIF tComputeLocalOffset

Can be appended to IFtest above

IF ProcessorType NUP THEN

BEGIN

UsetIF tLOAD tADD tSTORE

IF Sender THEN

BEGIN

This item is from left other child is right

RankInRightSUP r LocalOffset

END

ELSE

BEGIN

This item is from right other child is left

RankInLeftSUP r LocalOffset

END

END

ELSE

WaittIF tLOAD tADD tSTORE

ENDPROCEDURE RankInNewSUPfromOtherNode

INTEGER CONSTANT tpart tIF tStoreVal tIFtLOADtADD

tREAD

INTEGER CONSTANT tpart

tIF tLOAD tSUB tpower tSHIFT tSUB tJNEG

tIF tLOAD tSHIFT tADD tR tLOAD tR

tStoreArray tIF

tIF tLOAD tSHIFT tADD tR tLOAD tR

INTEGER CONSTANT tComputeLocalOffset tLOAD tSHIFT tADD

tR tADD tIF tLOAD tSHIFT tADD

tR tADD tIF tLOAD tSHIFT tADD

tR tR tCompareWith

INTEGER CONSTANT tfind tIF tLOAD tADD tSUB tADD

tREAD tIF tLOAD tSUB tADD tREAD

INTEGER CONSTANT tpart tIF

tSTORE tRateInNextStageBelow tStoreVal tIF tfind

tIF tComputeLocalOffset

tIF tLOAD tADD tSTORE

INTEGER CONSTANT tRankInNewSUPfromOtherNode tpart tpart

tpart

Bo dy of MaintainRanks

BODY OF MaintainRanks

It is only necessary to perform MaintainRanks for nodes with NUParray

which is smaller than the maximum size of NUParrays at the same level

ie MergeWithHelp must be performed also in the next stage This maximum

size is given by powerm level

INTEGER SenderAddr

UsetIF tAND tLOAD tSUB tpower tSUB tJNEG

IF ProcessorType NUP AND active AND size powerm level THEN

BEGIN

read sender address of this element

UsetLOAD tADD

READ SenderAddr FROM ExchangeAreaStartp

END

ELSE

BEGIN

SenderAddr this is done for robustness

WaittLOAD tADD tREAD

END

UsetIF tAND tOR tLOAD tAND tOR

Assumed combined with the test above and simplified

IF ProcessorType UP AND active OR

ProcessorType SUP AND active AND size powerm level OR

ProcessorType NUP AND active AND size powerm level THEN

BEGIN

RankInNewSUPfromSenderNodeSenderAddr

END

ELSE

WaittRankInNewSUPfromSenderNode

StoreArrayAddressesNUP mark

StoreArraySizesNUP

UsetIF Assumes that the result of the same test was stored above

IF ProcessorType UP AND active OR

ProcessorType SUP AND active AND size powerm level OR

ProcessorType NUP AND active AND size powerm level THEN

BEGIN

RankInNewSUPfromOtherNodeSenderAddr

END of processorselection

ELSE

WaittRankInNewSUPfromOtherNode

The ranks NUPu NewSUPv are stored in RankInLeftSUP for the

NUP processors and similarly in the RankInRightSUP for NUPu NewSUPw

The addresses of the NUParrays are stored at end of this stage see

the main loop and this info is used by the UPprocessors to read copy

these Ranks in the procedure CopyNUPtoUP at the start of the next stage

ENDPROCEDURE MaintainRanks

App endix C

Batchers Bitonic Sorting in

PIL

This is a complete listing of Batchers bitonic sorting algorithm Bat Akl

implemented in PIL for execution on the CREW PRAM simulator

C The Main Program

Bitopil

Bitonic sorting with detailed time modelling

A bitonic sorting network is emulated by using n processors to act

as the active column of comparators in each pass during each stage

Author Lasse Natvig Email lasseidtunitno

Last changed improvements

Copyright C Lasse Natvig IDTNTH Trondheim Norway

includeSorting

PROCESSORSET ComparatorColumn

INTEGER n m

ADDRESS addr

BEGINPILALG

FOR m TO DO

BEGIN

n powerm

addr GenerateSortingInstancen RANDOM

ClockReset

UsetPUSH

PUSHaddr

PUSHn

ASSIGN n PROCESSORS TO ComparatorColumn

FOREACH PROCESSOR IN ComparatorColumn

BEGIN

INTEGER n m Stage Pass

ADDRESS Addr UpInAddr LowInAddr

INTEGER UpInVal LowInVal

INTEGER GlobalNo Global Comparator No

INTEGER sl sequence length is always a power of two

PROCEDURE ComputeInAddresses

BEGIN

INTEGER sn sequence number

ADDRESS sa sequence address

INTEGER LocalNo Local Comparator No inside this sequence

UsetIF MAX tLOAD tpower tSTORE

tIF tLOAD tSHIFT tSTORE

IF Pass THEN

sl powerStage

ELSE IF Pass THEN

sl sl

Use tLOAD tSHIFT tSHIFT tSTORE

tLOAD tSHIFT tADD tSTORE

tLOAD tSHIFT tMOD tSTORE

sn GlobalNosl

sa Addr snsl

LocalNo MODGlobalNo sl

UsetIF MAXtLOAD tADD tSTORE

tLOAD tSHIFT tADD tSTORE

tLOAD tSHIFT tSUB tJNEG tSTORE

tIF MAX tLOADtSHIFTtADDtSTORE

tLOADtSHIFTtADDtADDtSTORE

tLOADtSHIFTtSUBtSTORE

This way of modelling time used gives easy analysis because every call

to the procedure ComputeInAddresses takes the same amount of time

Since the variable Pass always will have the same value for all processors

the case Pass occur for all processors simultaneously Therefore a

faster execution might be obtained by placing appropriate Use statements

in the THEN and ELSE part of IF Pass

However for the test IF UpperPart time modelling with

Use tIF MAXThenClause Elseclauseismostappropriate since we

simultaneously have processors in both branches and must wait for

the slowest branch to finish

IF Pass THEN

BEGIN

UpInAddr sa LocalNo

LowInAddr UpInAddr sl

END

ELSE

BEGIN

BOOLEAN UpperPart

UpperPart LocalNo sl

IF UpperPart THEN

BEGIN

UpInAddr sa LocalNo

LowInAddr UpInAddr sl

END

ELSE

BEGIN

LowInAddr sa LocalNo

UpInAddr LowInAddr sl

END

END

ENDPROCEDURE ComputeInAddresses

PROCEDURE Comparator

BEGIN

PROCEDURE Swapab

NAME a b INTEGER a b

BEGIN

INTEGER Temp

Temp a a b b Temp

ENDPROCEDURE Swap

INTEGER CONSTANT tSwap tLOAD tSTORE

INTEGER Type See Fig Selim G Akls book on parallel sorting

BEGIN Compute type

INTEGER Interval

Use tLOAD tSUB tpower tSTORE

tIF tLOAD tSHIFT tSTORE

Interval powerStage this is from Akl book

IF IsEvenGlobalNoInterval THEN

Type

ELSE

Type

END

UsetIF tIF tSwap

IF Type THEN

BEGIN

IF LowInVal UpInVal THEN

SwapLowInVal UpInVal

END

ELSE

BEGIN

IF LowInVal UpInVal THEN

SwapLowInVal UpInVal

END

UsetLOAD tSHIFT tADD

WRITE UpInVal TO AddrGlobalNo

WRITE LowInVal TO AddrGlobalNo

ENDPROCEDURE Comparator

MAIN PROGRAM

SetUp

UsetLOAD

READ n FROM UserStackPtr

UsetSUB

READ Addr FROM UserStackPtr

UsetLOAD tlog tSTORE tLOAD tSUB tSTORE

m logn

GlobalNo p

MAIN LOOP

UsetSTORE set to tIF

FOR Stage TO m DO

BEGIN

UsetSTORE tIF

FOR Pass TO Stage DO

BEGIN

ComputeInAddresses

READ UpInVal FROM UpInAddr

READ LowInVal FROM LowInAddr

Comparator

UsetLOAD tADD tSTORE tIF

ENDFOR

UsetLOAD tADD tSTORE tIF

ENDFOR

END

ENDFOREACH PROCESSOR

TTITIOP bitonic n ClockRead

TTIP bitonicCost n OUTTEXT

OUTINTnClockRead OUTIMAGE

TTO

FreeProcsComparatorColumn

BEGIN

INTEGER dummy to avoid warning

dummy POP

dummy POP

END

ENDFOR

ENDPILALG

Bibli og raphy

AAN Kjell ystein Arisland Anne Cathrine Aasb and Are Nundal VLSI

parallel shift sort algorithm and design INTEGRATION the VLSI

journal pages

AC Loyce M Adams and Thomas W Cro ckett Mo deling Algorithm Exe

cution Time on Pro cessor Arrays IEEE Computer pages July

ACG Sudhir Ahuja Nicholas Carriero and David Gelernter Linda and

Friends IEEE Computer August

AG George S Almasi and Allan Gottlieb Highly Paral lel ComputingThe

BenjaminCummings Publishing Company Inc

Agg Alok Aggarwal A Critique of the PRAM Mo del of Computation In

Proceedings of the NSF ARC Workshop on Opportunities and Con

straints of Paral lel Computing Sana pages

AHMP H Alt T Hagerup K Mehlhorn and F P Preparata Deterministic

Simulation of Idealized Parallel Computers on More Realistic Ones

In Mathematical Foundations of Computer Science Bratislava

Czechoslovakia Lecture Notes in Computer Scienceno pages

AHU A Aho J Hop chroft and J Ullman The Design and Analysis of

Computer Algorithms AddisonWesley Publishing Company Reading

Massachusetts

AHU Alfred V Aho John E Hop croft and Jerey D Ullman Data Struc

tures and Algorithms AddisonWesley Publishing Company Reading

Massachusetts

Akl Selim G Akl Paral lel Sorting Algorithms Academic Press Inc Or

lando Florida

Akl Selim G Akl The Design and Analysis of Paral lel Algorithms Prentice

Hall International Inc Englewo o d Clis New Jersey

AKS M Ajtai J Komlos and E Szemeredi An O n log n sorting network

Combinatorica

Amd Gene M Amdahl Validity of the single pro cessor approachtoachieving

large scale computer capabilities In AFIPS joint computer conference

Proceedingsvolume pages

And Richard Anderson The Complexity of Paral lel Algorithms PhD thesis

Stanford

Bas A Basu Parallel pro cessing systems a nomenclature based on their

characteristics IEE Proceedings May

Bat K E Batcher Sorting networks and their applications In Proc AFIPS

Spring Joint Computer Conference pages

BCK R Bagro dia K M Chandy and E Kwan UC A Language for the

Connection Machine In Proceedings of SUPERCOMPUTING New

York November pages

BDHM Dina Bitton David J DeWitt David K Hsiao and Jaishankar Menon

A Taxonomy of Parallel Sorting Computing Surveys

September

BDMN Graham M Birtwistle OleJohan Dahl Bjrn Myhrhaug and Kristen

Nygaard SIMULA BEGIN second editionVan Nostrand Reinhold

CompanyNewYork

Ben Jon L Bentley Programming Pearls AddisonWesley Publishing Com

pany Reading Massachusetts

BH A Boro din and J E Hop croft Routing Merging and Sorting on Paral

lel Mo dels of Computation Journal of Computer and System Sciences

Bil Gianfranco Bilardi Some Observations on Mo dels of Parallel Compu

tation In ProceedingsoftheNSFARC Workshop on Opportunities

and Constraints of Paral lel Computing Sana pages

Bir Graham M Birtwistle DEMOS A System for Discrete Event Mod

el ling on Simula The MacMillan Press Ltd London

Bo o Grady Bo o ch Software Engineering With ADA The Ben

jaminCummings Publishing Company Inc

CF David Cann and John Feo SISAL versus FORTRAN A Compari

son Using the Livermore Lo ops In Proceedings of SUPERCOMPUT

ING New York November pages

CG Nicholas Carriero and David Gelernter Applications exp erience with

Linda In Proc ACMSIGPLAN Symp on Paral lel Programming

CM K Mani Chandy and Jayadev Misra Paral lel Program Design AFoun

dation AddisonWesley Publishing Company Reading Massachusetts

Col Richard Cole Parallel Merge Sort In Proceedings of th IEEE Sym

posium on Foundations of Computer ScienceFOCS pages

Col Richard Cole Parallel Merge Sort Technical Rep ort Computer

Science Department New York University March

Col Richard Cole Parallel Merge Sort SIAM Journal on Computing

August

Col Richard Cole The APRAM Incorp orating Asynchronyinto the PRAM

Mo del In Proceedings of the NSF ARC Workshop on Opportunities

and Constraints of Paral lel Computing Sana pages

Col Richard Cole New York University Private communication March

and April

Co o S A Co ok An Observation on TimeStorage Tradeo Journal of

Computer and System Sciences

Co o S A Co ok Towards a complexity theory of synchronous parallel com

putation LEnseignement Mathematique XXVI I

Co o Stephen A Co ok An Overview of Computational Complexity Com

munications of the ACM June Turing Award

Lecture

Co o Stephen A Co ok The Classication of Problems whichhaveFast Par

allel Algorithms In Foundations of Computation Theory Proceedings

of the International FCTConference Borgholm Sweden August

Lecture Notes in Computer Science no pages Berlin

Springer Verlag

Co o Stephen A Co ok A Taxonomy of Problems with Fast Parallel Algo

rithms Inform and Control This is a revised version

of Co o

DeC Angel L DeCegama The technology of paral lel processing Paral lel

processing architectures and VLSI hardware Volume I Prentice Hall

Inc Englewo o d Clis New Jersey

Dem Howard B Demuth Electronic Data Sorting IEEE Transactions on

Computers C April

DH Hans Petter Dahle and Per Holm SIMULA Users Guide for UNIX

Technical rep ort Lund Software House AB Lund Sweden

DKM C Dwork P C Kanellakis and J C Mitchell On the Sequential

Nature of Unication Journal of Logic Programming

DLR D Dobkin R J Lipton and S Reiss Linear programming is logspace

hard for P Information Processing Letters

Dun Ralph Duncan A Survey of Parallel Computer Architectures IEEE

Computer February

Fei Dror G Feitelson Optical Computing A Survey for Computer Scien

tists The MIT Press London England

Fis David C Fisher Your Favorite Parallel Algorithms MightNotBeas

Fast as You Think IEEE Transactions on Computers

Flo R W Floyd Algorithm shortest path Communications of the

ACM The algorithm is based on the theorem in War

Fly Michael J Flynn Very HighSp eed Computing Systems In Proceedings

of the IEEEvolume pages Decemb er

Frea Karen A Frenkel ComplexityandParallel Pro cessing An Interview

With Richard Karp Communications of the ACM

February

Freb Karen A Frenkel Sp ecial Issue on Parallelism Communications of the

ACM Decemb er

FT Ian Foster and Stephen Taylor Strand New Concepts in Paral lel Pro

gramming PrenticeHall Inc Englewo o d Clis New Jersey

FW S Fortune and J Wyllie Parallelism in Random Access Machines

In Proceedings of the th ACM Symposium on Theory of Computing

STOC pages ACM NewYork May

Gel David Gelernter Domesticating Parallelism IEEE Computer pages

August Guest Editors Intro duction

Gib Phillip B Gibb ons Towards Better Shared Memory Programming

Mo dels In Proceedings of the NSF ARC Workshop on Opportuni

ties and Constraints of Paral lel Computing Sana pages

GJ Michael R Garey and David S Johnson COMPUTERS AND IN

TRACTABILITY A Guide to the Theory of NPCompletenessWH

Freeman and Co New York If you nd this b o ok dicult you

might try WilorRS

GMB John L Gustafson Gary R Montry and Rob ert E Benner Develop

mentofParallel Metho ds for a Pro cessor Hyp ercub e SIAM Jour

nal on Scientic and Statistical Computing July

Gol L M Goldschlager The monotone and planar circuit value problems

are log space complete for P SIGACT News

GR Alan Gibb ons and Wo jciech Rytter Ecient Paral lel Algorithms

Cambridge University Press Cambridge

GS Mark Goldb erg and Thomas Sp encer A new parallel algorithm for

the maximal indep endent set problem SIAM Journal on Computing

April

GSS Leslie M Goldschlager Ralph A Shaw and John Staples The max

imum ow problem is log space complete for P Theor Comput Sci

Octob er

Gup Ra jiv Gupta Lo op Displacement An Approach for Transforming and

Scheduling Lo ops for Parallel Execution In Proceedings of SUPER

COMPUTING New York November pages

Gus John L Gustafson Reevaluating Amdahls Law Communications of

the ACM May

Hag Marianne Hagaseth Very fast pram sorting algorithms preliminary

title Masters thesis Division of Computer Systems and Telematics

The Norwegian Institute of TechnologyFebruary In preparation

Hal Rob ert Halstead Multilisp A language for concurrent symb olic com

putation ACM Transactions on Programming Languages Octob er

Har David Harel ALGORITHMICS The Spirit of Computing Addison

Wesley Publishing Company Reading Massachusetts

HB Kai Hwang and Faye A Briggs Computer Architecture and Paral lel

Processing McGrawHill Bo ok CompanyNewYork

Hil W Daniel Hillis The Connection Machine Scientic American

June

Hil W Daniel Hillis The Fastest Computers Keynote Address at the

SUPERCOMPUTING conference New York Novemb er

Notes from the presentation

HJ R W Ho ckney and C R Jesshop e Paral lel Computers architec

ture programming and algorithms nd edition Adam Hilger Bristol

HM David P Helmb old and Charles E McDowell Mo deling Sp eedupn

greater than n In Proceedings of the International Conferenceon

Paral lel Processing vol III pages

Hoa C A R Hoare Quicksort Computer Journal

Hop John E Hop croft Computer Science The Emergence of a Disci

pline Communications of the ACM March Turing

Award Lecture

HS J Hartmanis and R E Stearns On the computational complexityof

algorithms Trans Amer Math Soc May

HS W Daniel Hillis and Guy L Jr Steele Data parallel algorithms Com

munications of the ACM December

HT Per Holm and Magnus Taub e SIMDEB Users Guide for UNIX Tech

nical rep ort Lund Software House AB Lund Sweden

Hud Paul Hudak ParaFunctional Programming IEEE Computer pages

August

HWG Michael Heath PatrickWorley and John Gustafson Once Again Am

dahls Law and Authors Resp onse Communications of the ACM

February Technical corresp ondance from Heath

and Worley with the resp onse from Gustafson

JL N Jones and W Laaser Complete problems for deterministic p olyno

mial time Theoretical Computer Science

Kan Paris C Kanellakis Logic Programming and Parallel ComplexityIn

Jack Minker editor Foundations of Deductive Databases and Logic

Programmingchapter pages Morgan Kaufmann Publish

ers Inc Los Altos California

Kar Richard M Karp Combinatorics Complexity and Randomness Com

munications of the ACM February Turing Award

Lecture

Kar Richard M Karp A Position Pap er on Parallel Computation In Pro

ceedings of the NSF ARC Workshop on Opportunities and Constraints

of Paral lel Computing Sana pages

Kha L G Khachian A p olynomial time algorithm for linear programming

Dokl Akad Nauk SSSR In Russian Trans

lated to English in Soviet Math Dokl vol

KM Ken Kennedy and Kathryn S McKinley Lo op Distribution with Arbi

trary Control Flow In Proceedings of SUPERCOMPUTING New

York November pages

Knu D E Knuth The Art of Vol Sorting

and Searching AddisonWesley Publishing Company Reading Mas

sachusetts

KR Brian W Kernighan and Dennis M Ritchie The C Programming Lan

guage PrenticeHall Inc New Jersey

KR Richard M Karp and VijayaRamachandran ASurvey of Parallel

Algorithms for SharedMemory Machines Technical Rep ort

Computer Science Division University of California Berkeley March

Kru J B Kruskal On the shortest spanning subtree of a graph and the

travelling salesman problem Proceedings of the American Mathematical

Society

Kru Clyde P Kruskal EcientParallel Algorithms Theory and Practice

In Proceedings of the NSF ARC Workshop on Opportunities and Con

straints of Paral lel Computing Sana pages

Kuc David J Kuck A Survey of Parallel Machine Organization and Pro

gramming ACM Computing Surveys March

Kun H T Kung Why Systolic Architectures IEEE Computerpages

January

Kun H T Kung Computational mo dels for parallel computers In Scientic

applications of multiprocessors Prentice Hall Engelwo o d Clis New

Jersey

Lad R E Ladner The circuit value problem is log space complete for P

SIGACT News

Lan Asgeir Langen Parallelle programmeringssprak Masters thesis Divi

sion of Computer Systems and Telematics The Norwegian Institute of

Technology The UniversityofTrondheim Norway In Norwegian

Law Eugene L Lawler Combinatorial Optimization Networks and Ma

troids Holt Rinehart and Winston New York

Lei Tom Leighton Tight Bounds on the ComplexityofParallel Sorting In

Proceedings of the th Annual ACM Symposium on Theory Of Com

puting May pages ACM New York

Lei Tom Leighton What is the Right Mo del for Designing Parallel Algo

rithms In Proceedings of the NSF ARC Workshop on Opportunities

and Constraints of Paral lel Computing Sana pages

Lub Boris D LubachevskySynchronization barrier and to ols for shared

memory parallel programming In Proceedings of the International

ConferenceonParal lel Processing vol II pages

Lun Stephen F Lundstrom The Future of Sup ercomputing The Next

Decade and Beyond Technical Rep ort Novemb er PARSA

Do cumentation from tutorial on SUPERCOMPUTING

Mag Bruce Maggs Beyond parallel randomaccess machines In Proceed

ings of the NSF ARC Workshop on Opportunities and Constraints of

Paral lel Computing Sana pages

Man Heikki Manilla Measures of Presortedness and Optimal Sorting Algo

rithms IEEE Transactions on Computers C April

MB Peter Moller and Kallol Bagchi A Design Scheme for ComplexityMod

els of Parallel Algorithms In Proceedings of the th IEEE Annual

Paral lel Processing Symposium California

McG Rob ert J McGlinn A Parallel Version of Co ok and Kims Algorithm for

Presorted Lists softwarepractice and experience

Octob er

MH Charles E McDowell and David P Helmb old Debugging Concurrent

Programs ACM Computing Surveys

MK J Miklosko and V E Kotov editors Algorithms Software and Hard

ware of Paral lel Computers Springer Verlag Berlin

MLK G Miranker Tang L and Wong C K A ZeroTime VLSI Sorter

IBM J Res Develop March

Moh Joseph Mohan Exp erience with TwoParallel Programs Solving the

Traveling Salesman Problem In Proceedings of the International

ConferenceonParal lel Processing pages IEEE New York

MV Kurt Mehlhorn and Uzi Vishkin Randomized and Deterministic Sim

ulations of PRAMs byParallel Machines with Restricted Granularity

of Parallel Memories Acta Informatica

Nat Lasse Natvig Comparison of Some Highly Parallel Sorting Algo

rithms With a Simple Sequential Algorithm In Proceedings of NIK

Norsk Informatikk Konferanse Norwegian Informatics Conference

Stavanger November pages

Nata Lasse Natvig Investigating the Practical Value of Coles O log n Time

CREW PRAM Merge Sort Algorithm In Proceedings of The Fifth In

ternational Symposium on Computer and Information Sciences ISCIS

V Cappadocia Nevsehir Turkey October November pages

Natb Lasse Natvig Logarithmic Time Cost Optimal Parallel Sorting is Not

Yet Fast in Practice In Proceedings of SUPERCOMPUTING New

York November pages

Natc Lasse Natvig Parallelism Should Make Programming Easier Presenta

tion at the Workshop On The Understanding of Parallel Computation

Edinburgh July

OC Ro dney R Oldeho eft and David C Cann ApplicativeParallelism on

a SharedMemory Multipro cessor IEEE Software pages

Par D Parkinson Parallel eciency can b e greater than unity Paral lel

Computing pages

Par Ian Parb erry Paral lel Complexity Theory John Wiley and Sons Inc

New York

Pat M S Paterson Improved sorting networks with O log n depth Tech

nical rep ort Dept of Computer Science UniversityofWarwick Eng

land Res Rep ort RR

Pip N Pipp enger On simultaneous resource b ounds preliminary version

In Proc th IEEE Foundations of Computer Science pages

Pol Constantine D Polychronop oulos PARALLEL PROGRAMMING

AND COMPILERSKluwer Academic Publishers London

Po o R J Pooley An Introduction to Programming in SIMULABlackwell

Scientic Publications Oxford England

QD Michael J Quinn and Narsingh Deo Parallel Graph Algorithms ACM

Computing Surveys Septemb er

Qui Michael J Quinn Designing Ecient Algorithms for Paral lel Comput

ers McGrawHill Bo ok CompanyNewYork

RBJ Abhiram G Ranade Sandeep N Bhatt and S Lennart Johnsson The

Fluent Abstract Machine In Proceedings of the Fifth MIT Conference

on AdvancedResearch in VLSI pages

Rei J Reif Depth rst search is inherently sequential Information Pro

cessing Letters

Rei Rudiger K Reischuk Parallel Machines and their Communication The

oretical Limits In STACS pages

Ric D Richards Parallel sortinga bibliography ACM SIGACT News

pages

RS V J RaywardSmith A First Course in Computability Blackwell

Scientic Publications Oxford

RS John H Reif and Sandeep Sen Randomization in parallel algorithms

and its impact on computational geometryInLNCS pages

Rud Larry Rudolph Hebrew University Israel Private communication

May and June

RV John H Reif and Leslie G Valiant A Logarithmic Time Sort for Linear

Size Networks Journal of the ACM January

Sab Gary W Sab ot The Paralation Model ArchitectureIndep ende nt

Parallel Programming The MIT Press Cambridge Massachusetts

Sag Carl Sagan KosmosUniversitetsforlaget Norwegian edition

San Sandia National Lab oratories Scientists set sp eedup record overcome

barrier in parallel computing News Release from Sandia National Lab

oratories Albuquerque New Mexico March

Sana Jorge L C Sanz editor Opportunities and Constraints of Paral lel

Computing SpringerVerlag London Pap ers presented at the

NSF ARCWorkshop on Opp ortunities and Constraints of Parallel

Computing San Jose California Decemb er ARC IBM Al

maden ResearchCenter NSF National Science Foundation

Sanb Jorge L C Sanz The Tower of Bab el in Parallel Computing In Pro

ceedings of the NSF ARC Workshop on Opportunities and Constraints

of Paral lel Computing Sana pages

SG Sarta j Sahni and Teolo Gonzales PComplete Approximation Prob

lems Journal of the ACM July

Sha Ehud Shapiro Concurrent Prolog A Progress Rep ort IEEE Com

puter pages August

SIA SIAM Both Gordon Bell Prize Winners Tackle Oil Industry Problems

May SIAM The So ciety for Industrial and Applied Mathemat

ics

SN XianHe Sun and Lionel M Ni Another View on Parallel Sp eedup In

Proceedings of SUPERCOMPUTING New York November

pages

Sni Marc Snir Parallel Computation Mo delsSome Useful Questions In

Proceedings of the NSF ARC Workshop on Opportunities and Con

straints of Paral lel Computing Sana pages

Sny Lawrence Snyder A TaxonomyofSynchronous Parallel Machines In

Proceedings of the International Conference on Paral lel

ing pages

Spi Paul Spirakis The Parallel Complexity of Deadlo ck Detection In Math

ematical Foundations of Computer Science Bratislava Czechoslo

vakia Lecture Notes in Computer Science no pages

Ste Guy L Steele Making Asynchronous Parallelism Safe for the World

In Proceedings of POPL pages

SV Yossi Shiloach and Uzi Vishkin Finding the Maximum Merging and

Sorting in a Parallel Computation Mo del Journal of Algorithms

SV Larry Sto ckmeyer and Uzi Vishkin Simulation of Parallel Random

Access Machines by Circuits SIAM Journal on Computing

May

Tar R E Tarjan Depthrst search and linear graph algorithms SIAM

Journal on Computing

Tha Peter Thanisch Department of Computer Science University of Edin

burgh Private communication July

Tiw Praso on Tiwari Lower Bounds on Communication ComplexityinDis

tributed Computer Networks Journal of the ACM Oc

tob er

Upf Eli Upfal A Probabilistic Relation Between Desirable and Feasible

Mo dels of Parallel Computation In ProceedingsofthethACM

Symposium on Theory of Computation STOC pages

Val L Valiant Parallelism in comparison problems SIAM Journal on

Computing

Val Leslie G Valiant A Bridging Mo del for Parallel Computation In

Proceedings of the th Computer Conference

Charleston California April pages

Vis Uzi Vishkin PRAM Algorithms Teach and Preach In Proceedings of

the NSF ARC Workshop on Opportunities and Constraints of Paral lel

Computing Sana pages

War Stephen Warshall A Theorem on Bo olean Matrices Journal of the

ACM This and Flo are commonly considered as the

source for the famous FloydWarshall algorithm

Wil J W J Williams Heapsort algorithm Communications of the

ACM

Wil Herb ert S Wilf Algorithms and Complexity Prentice Hall Interna

tional Inc Englewo o d Clis New Jersey

Wir Niklaus Wirth Algorithms Data Structures Programs Prentice

Hall Englewo o d Clis New Jersey

WW K Wagner and Wechsung Computational Complexity D Reidel Pub

lishing Company Dordrecht Holland

Wyl J C Wyllie The Complexity of Paral lel Computations PhD thesis

Dept of Computer Science Cornell University

Zag Marcoz Zagha Carnegie Mellon University Private communication

Nov

ZG Xiaofeng Zhou and John Gustafson Bridging the Gap Between Am

dahls Law and Sandia Lab oratorys Result and Authors Resp onse

Communications of the ACM August Techni

cal corresp ondance from Zhou with the resp onse from Gustafson

Index

Bound continued cover

upp er or lower in Coles algorithm

Bridging mo del

Abstract machine

Brute force techniques

Activation of pro cessors

Builtin pro cedures

Active levels

Bulk synchronous parallel mo del

in Coles algorithm

Active no des

Cell

in Coles algorithm

description of

Active pro cessors

in systolic array

in Coles algorithm

Channel

ADDRESS

for integers

ADDRESS

for reals

AKS sorting network

in systolic array

Allo cation

pro cedures for using it

of memory

of pro cessors

CHECK

Amdahl reasoning

CHECK

Amdahls law

CHECK

using it in practice

checking for synchrony

Analysis

Circuit mo del

asymptotical

Circuit value problem

ARAM

CVP

ArrayPrint

ClockRead

ASSIGN statement

ClockReset

ASSIGN statement

Coles algorithm

placement of

in PPP

Asymptotic complexity

main principles

Asymptotical analysis

time consumption

Average case

Coles merging pro cedure

Coles parallel merge sort algorithm

Barrier synchronisation

Batchers bitonic sorting

Coles parallel merge sort

Best case

implementation

BigO

Comments

BigOmega

in PIL

Bitonic sorting

Communicationbetween pro cessors

time consumption

Blo ck structure

using channels

error in

Compilation of PIL programs

Blo ck

Compile time padding

variables lo cal in

Complexity class Bound

Crossranks Complexity

in Coles algorithm computational

CVP

descriptional

Complexity function

Complexity theory

Data collection facilities

in DEMOS parallel

debug command Computational mo del

ComputeCrossRanks

debug command

in Coles algorithm

Debugging

ComputeWhoIsWho

Debugging PIL programs

Declaration in Coles algorithm

pro cessor set

Computing mo del

define

Computing resources

denition of macro

Conditional compilation

Delayed output

CONSTANT

in systolic array

Co oks theorem

DEMOS

CopyNewUpToUp

DFSorder problem

in Coles algorithm

DoMerge

Cost

in Coles algorithm

comparison

DRAM

Cost optimal algorithm

Duncans taxonomy

cpp

cpp error message

Eciency

CRAM

Ecient

CRCW PRAM

Enco ding scheme

variants

END FOR

CREW PRAM

FOR EACH PROCESSOR END

clo ck p erio d

statement

global memory

END PROCEDURE

global memory usage

EREW PRAM

input and output

Error

instruction set

Error in blo ck structure

lo cal memory

Error message

CREW PRAM mo del

from the SIMULA compiler

CREW PRAM

from the simulator

pro cessor allo cation

from cpp

pro cessor number

pro cessor set

from PILfix

CREW PRAM pro cessors

Events

CREW PRAM program

ordering of

CREW PRAM programming

Execution

CREW PRAM simulator

start of

Execution time

clo ck

Exp onential time algorithm

installation

interaction with

Fast parallel algorithms

p erformance

File

real numb ers

inclusion of

system requirements

versions of

time

Floyds algorithm

CREW PRAM

Flynns taxonomy

synchronous op eration

FOR EACH PROCESSOR IN statement

time unit

link command FOR lo op

Freeing

link command

of memory

link command

Lo cal events

Gap

Lo cal memory

between theory and practice

CREW PRAM

Lo cal variables

GENERABILITY problem

log

GFLOPS

Lo op

Global events

FOR

Global memory

WHILE

access count

Lower b ound

Global memory access pattern

Global memory access statistics

Macro denition

Global memory

MaintainRanks

access to

in Coles algorithm

CREW PRAM

MakeSamples

insucient

in Coles algorithm

reading from

Massively parallel computer

time used by access

Massively parallelism

write conict

Master pro cessor

writing to

MCVP

Global reads

MemN Reads

number of

Writes MemN

Global variables

Memory

Global writes

allo cation

number of

Memory conict

Go o d sampler

Memory

in Coles algorithm

freeing

global

Halting problem

lo cal

Memory usage

ifdef

comparison

include

MemUserFree

le inclusion

MemUserMalloc

INFTY

MergeWithHelp

Inherently serial problem

in Coles algorithm

Input length

MIMD

Insertion sort

MISD

parallel

Mo del

Instance

bridging

of problem

mathematical convenience

Instruction set

of computation

CREW PRAM

Mo del of parallel computation

Insucient global memory

Mo del

Inverted Amdahl reasoning

p erformance representative

Inverted Amdahls law

realistic

IsEven

tradeos

IsOdd

Mo delling time consumption

MRAM

Length of phase

in systolic array

NC

Line numb ers

Linear sp eedup may b e misleading

PIL NCalgorithm

DEMOS features practical value

example programs NCreduction

line numb er Nicks Class

linking robustness

output

Nondeterministic algorithm

Parallel Intermediate

NoOfProcs

Language

in Coles algorithm

prepro cessor

NP

pro cedures

NPC

PIL program

NPcomplete problem

debugging

NPcompleteness

development approach

Numb er in pro cessor set p

executing

parallel

Oddeven transp osition sort

sequential

PIL

Optimal algorithm

SIMULA features

Optimal parallel sorting algorithm

variables

pilc command

ORAM

pilc command

Order notation

pilc command

PILfix

P

pilh

p

Pip elined computer

Padding

Pip elined merging

for making programs

Polylogarithmic running time

synchronous

Polynomial time algorithm

Parallel complexitytheory

Polynomial transformation

Parallel computation mo del

POP

Parallel insertion sort

power

Parallel programming

PPP

development approach

PPP Parallel Pseudo Pascal

the PIL language

Parallel Pseudo Pascal

PRAC

Parallel slackness

PRAM

CRCW Parallel work

CREW Parallelisation

EREW Parameter passing in parallel

PRAM mo del Passive pro cessors

pramlis

in Coles algorithm

Prepro cessor for PIL

PATH problem

Presortedness

Pcomplete algorithm

Printing memory area

Pcomplete problem

Prob e eect

Pcompleteness

Problem

pro of

inherently serial

PCVP

Problem instance

Phase

Problem parameters

in systolic array

Problem

length of

representation

PIL

Problem scaling comments

Problem size PIL compiler

Problems phases

Ranks continued Problems continued

in Coles algorithm

inherently serial

Read from global memory NPcomplete

READ statement

Pro cedure

READ statement

lo cal in blo ck

op eration count

Pro cedures

Reading the clo ck

in PIL

Real numb ers

Pro cessing element

in systolic array

description of

RealArrayprint

in systolic array

RECEIVE

Pro cessor activation

RECEIVE REAL

in PPP

RECEIVE REAL ZEROFY

placement of

ZEROFY RECEIVE

restrictions

Receiving from channel

Pro cessor allo cation

removeTIME

in Coles algorithm

Resetting the clo ck

in PPP

Resources

Pro cessor numb er

used in computation

CREW PRAM

Resynchronisation

in pro cessor set

Resynchronisation of pro cessors

printing it

Robustness of Nicks Class

Pro cessor requirement

run command

in Coles algorithm

run command

discussion

run command

SET PROCESSOR

Running time

Pro cessor set

CREW PRAM

SATISFIABILITY problem

declaration of

Scaled sp eedup

numb er in p

SEND

Pro cessor

SEND

synchronisation

SEND REAL

Pro cessors

Serial work

communication b etween

SIMD

CREW PRAM

simdeb

Program structure

simdeb

Program

SIMULA

start of execution

SISD

Program structure

Size

using systoolpil

of problem

Program tracing

Software crisis

Program

Space

versions of

Sp eedup

Programming

linear

CREW PRAM

scaled

on the simulator

sup erlinear

Pseudo co de

Stage

PUSH

in systolic array

PUSH

Straddle

Pyramid of pro cessors

in Coles algorithm

in Coles algorithm

Straight insertion sort

Sup ercomputer

RAM mo del

Sup erlinear sp eedup

Ranks SYNC

SYNC Time

SYNC waiting

SYNC

timessim

timessim

instead of compile time

Off T

padding

T On

SYNC

Top down development

removing

SYNC

Tracing

replace by CHECK

of PIL programs

Synchronisation

turn o

turn on Synchronisation barrier

T ThisIs

Synchronous computing structure

T Time

Synchronous MIMD programming

T TITITO

T TR

Synchronous op eration

Synchronous programming

Unb ounded parallelism

Synchronous statements

Unication problem

SYNCHRONY LOST

UNIT RESOLUTION problem

System stack

Upp er b ound

Use Systolic array

time used data availability

Use

delayed output

Use

example

used in systolic array

phase

User stack

stage

UserStackPtr

unit time delay

UserStackPtr

systoolpil

systoolpil

Variable

program structure

lo cal

making it lo cal

T procedures

Versions

procedures T

of a program

Taxonomy

Duncans

Wait

Flynns

Wait

T Flag

Wait

Theoretical computer science

Warning

Theory and practice

Weather forecasting

gap

WHILE lo op

Time

Whitespace character

Time complexity function

Working areas

Time constants

in Coles algorithm

Time consumption

Worst case

Time delay

WRAM

in systolic array

Write conict

Time mo delling

WRITE CONFLICT

exact

WRITE statement

rules for making

WRITE statement

simplied

op eration count

Time unit

Write to global memory

Time used

for accessing the global

memory