Job Strategies for Networks of

Workstations

1 1 2 3

B B Zhou R P Brent D Walsh and K Suzaki

1

Computer Sciences Lab oratory Australian National University

Canb erra ACT Australia

2

CAP Research Program Australian National University

Canb erra ACT Australia

3

Electrotechnical Lab oratory Umezono

sukuba Ibaraki Japan

Abstract In this pap er we rst intro duce the concepts of utilisation ra

tio and eective sp eedup and their relations to the system p erformance

We then describ e a twolevel scheduling scheme which can b e used to

achieve go o d p erformance for parallel jobs and go o d resp onse for inter

active sequential jobs and also to balance b oth parallel and sequential

workloads The twolevel scheduling can b e implemented by intro ducing

on each pro cessor a registration oce We also intro duce a lo ose gang

scheduling scheme This scheme is scalable and has many advantages

over existing explicit and implicit coscheduling schemes for scheduling

parallel jobs under a time sharing environment

Intro duction

The trend of parallel computer developments is toward networks of worksta

tions or scalable paral lel systems In this typ e of system each pro cessor

having a highsp eed pro cessing element a large memory space and full function

ality of a standard op erating system can op erate as a standalone workstation

for sequential computing Interconnected by highbandwidth and lowlatency

networks the pro cessors can also b e used for To establish

a truly generalpurp ose and userfriendly system one of the main problems is

to provide users with a single system image By adopting the technique of dis

tributed shared memory for example we can provide a single addressing

space for the whole system so that communication for transferring data b etween

pro cessors is completely transparent to the client programs In this pap er we

discuss another very imp ortant issue relating to the provision of single system

image that is eective job scheduling strategies for b oth sequential and paral lel

processing on networks of workstations

Many job scheduling schemes have b een intro duced in the literature and

some of them implemented on commercial parallel systems These scheduling

schemes for parallel systems can b e classied into either space sharing or time

sharing or a combination of b oth With space sharing a system is partitioned

into subsystems each containing a subset of pro cessors There are b oundary

lines laid b etween subsystems and so only pro cessors of the same subsystem

can b e co ordinated to solve problems assigned to that subsystem During the

computation each subsystem is allo cated only for a single job at a time

The space partition can b e either static or adaptive With static partitioning

the system conguration is determined b efore the system starts op erating The

whole system has to b e stopp ed when the system needs to b e recongured

With adaptive partitioning pro cessors in the system are not divided b efore the

computation When a new job arrives a job manager in the system rst lo cates

idle pro cessors and then allo cates certain numb er of those idle pro cessors to

that job according to some pro cessor allo cation p olicies eg those describ ed

in Therefore the b oundary lines are drawn during the

computation and will disapp ear after the job is terminated Normally the static

partitioning is used for very large systems while the adaptive partitioning is

adopted in systems or subsystems of small to medium size One disadvantage of

space partitioning is that short jobs can easily b e blo cked by long ones for a long

time b efore b eing executed However in practice short jobs usually demand a

short turnaround time To alleviate this problem jobs can b e group ed into classes

and a sp ecial treatment will b e given to the class of short jobs However it

can only partially solve the problem Thus time sharing needs to b e considered

Many scheduling schemes for timesharing of a parallel system have b een

prop osed in the literature They may b e classied into two basic typ es The

rst one is local scheduling With lo cal scheduling there is only a single queue

on each pro cessor Except for higher or lower priorities b eing given pro cesses

asso ciated with parallel jobs are not distinguished from those asso ciated with

sequential jobs The metho d simply relies on existing lo cal schedulers on each

pro cessor to schedule parallel jobs Thus there is no guarantee that the pro cesses

b elonging to the same parallel job can b e executed at the same time across the

pro cessors When many parallel programs are simultaneously running on a sys

tem pro cesses b elonging to dierent jobs will comp ete for resources with each

other and then some pro cesses have to b e blo cked when communicating or syn

chronising with nonscheduled pro cesses on other pro cessors This eect can lead

to a great degradation in overall system p erformance One metho d

to alleviate this problem is to use twophase blo cking which is also called

implicit coscheduling in In this metho d a pro cess waiting for communication

spins for some time in the hop e that the pro cess to b e communicated with on

the other pro cessor is also scheduled and then blo cks if a resp onse has not b een

received The rep orted exp erimental results show that for parallel workloads this

scheduling scheme p erforms b etter than the simple lo cal scheduling However

the problem is that the scheduling p olicy is based on communication require

ments Then it tends to give sp ecial treatment to jobs with a high frequency of

communication demands The p olicy is also indep endent of service times The

p erformance of parallel computation is thus unpredictable

The second typ e of scheduling schemes for time sharing is coscheduling

or gang scheduling which may b e a b etter scheme in adopting shortjobrst

p olicy Using this metho d a numb er of parallel programs is allowed to enter a

service queue as long as the system has enough memory space The pro cesses

of the same job will run simultaneously across the pro cessors for only certain

amount of time which is called scheduling slot When a scheduling slot is ended

the pro cessors will contextswitch at the same time to give the service to pro

cesses of another job All programs in the service queue take turns to receive

the service in a co ordinated manner across the pro cessors Thus programs never

interfere with each other and short jobs are likely to b e completed more quickly

There are also certain drawbacks asso ciated with coscheduling A signicant one

is that it is designed only for parallel workloads For networks of workstations we

need an eective scheduling strategy for b oth sequential and parallel pro cessing

The simple coscheduling technique is not a suitable solution

The future networks of workstations should provide a programmingfree en

vironment to general users By providing a variety of highp erformance com

puting libraries for a wide range of applications plus userfriendly interfaces for

the access to those libraries parallel computing will no longer b e considered

just as clients sp ecial requests but b ecome a natural and common phenomenon

in the system Along with many other critical issues therefore highly eective

job management strategies are required for the system to meet various clients

requirements and to achieve high eciency of resource utilisation Because of

the lack of ecient job scheduling strategies most networks of workstations are

currently used exclusively either as an MPP for pro cessing parallel batch jobs

or as a group of separate pro cessors for interactive sequential jobs The p oten

tial p ower of this typ e of system are not exploited eectively and the system

resources are not utilised eciently under these circumstances

In this pap er we discuss some new ideas for eectively scheduling b oth se

quential and parallel workloads on networks of workstations To achieve a de

sired p erformance for a parallel job on a network of workstations with a variety

of comp etitive background workloads it is essential to provide a sustained ra

tio of CPU utilisation to the asso ciated pro cesses on each pro cessor to allo cate

more pro cessors to the job if the assigned utilisation ratio is small and then to

co ordinate the execution across the pro cessors We rst intro duce the concepts

of utilisation ratio and eective speedup and their relations to the system p er

formance in Section In this section we also argue that b ecause the resources

in a system are limited one cannot guarantee every parallel job to have a sus

tained CPU utilisation ratio in a time sharing environment One way to solve

the problem is that we give short jobs sustained utilisation ratios to ensure a

short turnaround time while to each large job we allo cate a large numb er of pro

cessors and assign a utilisation ratio which can vary in a large range according

to the current system workload so that small jobs will not b e blo cked and the

resource utilisation can b e kept high we then present in Section a twolevel

scheduling scheme which can b e used to achieve go o d p erformance for paral

lel jobs and go o d resp onse for interactive sequential jobs and also to balance

b oth parallel and sequential workloads The twolevel scheduling can b e imple

mented by intro ducing on each pro cessor a registration oce which is describ ed

in Section We discuss a scalable coscheduling scheme loose gang schedul

ing in Section This scheme requires b oth global and lo cal job managers It

is scalable b ecause the coscheduling is mainly controlled by lo cal job managers

on each pro cessor so that frequent signalbroadcasting for simultaneous context

switch across the pro cessors is avoided Using a global job manager we b elieve

that the system can work more eciently than those using only lo cal schedulers

With a lo cal job manager on each pro cessor the system will b ecome more exible

and more eective in handling more complicated situations than those adopting

only the conventional gang scheduling p olicy Finally the conclusions are given

in Section

Utilisation Ratio and Eective Sp eedup

Assuming that the overall computational time for a parallel job on p dedicated

pro cessors is T p the conventional ly dened speedup is then obtained as

d

T

d

S p

d

T p

d

This sp eedup can only b e achieved by using dedicated pro cessors It may b e

imp ossible to achieve on a network of workstations b ecause there a parallel job

usually has to timeshare resources with other sequentialparallel jobs If we

provide a sustained ratio of CPU utilisation for a job on each pro cessor and use

more processors however we can still achieve the desired p erformance in terms

of time

Dene utilisation ratio for   as the ratio of CPU utilisation for

a given job on each pro cessor By a given the job on a pro cessor can on the

average obtain a service time T in each unit of time T In our scheduling

strategy each parallel job will b e assigned a utilisation ratio which is usually

determined based on the current system working conditions Dierent ratios can

also b e given on dierent pro cessors for naturally unbalanced parallel jobs to

achieve b etter system loadbalancing

Assume that the same utilisation ratio is assigned to a parallel job across

all the asso ciated pro cessors and that the jobs pro cesses are gang scheduled

The turnaround time T p for that job can then b e calculated as

e

T p

d

T p

e

where T p is the computational time obtained on p dedicated pro cessors

d

Dening eective sp eedup S p as the ratio of T and T p then

e d e

T T

d d

S p S p

e d

T p T p

e d

where S p is the conventional sp eedup obtained on p dedicated pro cessors

d

To achieve a desired p erformance we may set a p erformance target and

require

T  T p

d e

or

S p 

e

If the eective sp eedup for a given job is lower than that target the p erformance

will b e considered unacceptable

From equations in and we can obtain

S p 

d

Using the ab ove inequality we can easily determine how many pro cessors

should b e allo cated to a given job in order to achieve a desired p erformance

when a particular is given Assuming and for example S p

d

must b e greater than or equal to Allo cating pro cessors or more to that job

can then achieve a desired p erformance if S  When the current system

d

workload is not heavy we may need to use less numb er of pro cessors to achieve

the same p erformance If there are several idle pro cessors we may set in

the ab ove example Then only pro cessors may b e required if S 

d

In practice the exact sp eedup S p may not b e known except for those

d

programs in standard generalpurp ose parallel computing libraries Thus the

values can only b e approximate in those cases However go o d approximations

can often b e obtained For example the results of the Linpack Benchmark

can b e used as a go o d approximation for problems of matrix computation

The utilisation ratios of the existing jobs may b e decreased whenever a new

job enters the system to timeshare the resources The problem is how to ensure

a sustained ratio of CPU utilisation for each job so that the p erformance can b e

predictable in a time sharing environment Since the resources in a system are

really limited the answer to this question is simply that we cannot guarantee

every job to have a sustained ratio when the system workload is heavy

One way to solve the ab ove problem is to adopt the following scheme First we

set a limit to the length of each scheduling round T or a limit to the numb er of

jobs in the system A common misunderstanding ab out timesharing for parallel

jobs is that go o d p erformance will b e obtained as long as parallel jobs can enter

the system and start op eration quickly As we mentioned previously that the

resources in a system are limited however go o d p erformance just cannot b e

guaranteed if the length of scheduling round is unb ounded Consider a simple

example when several large jobs are timesharing the resources in a round robin

manner In this case the conventional gang scheduling simply fail to pro duce

go o d p erformance in terms of turnaround time

Because of the limit to the length of each scheduling round short jobs still

can b e blo cked for a long time We then adopt a scheduling p olicy that is small

jobs should have sustained utilisation ratios to ensure a short turnaround time

while each large job should b e assigned a large numb er of pro cessors but given a

utilisation ratio which can vary in a large range according to the current system

workload In this way we think that small jobs will not b e blo cked the resource

utilisation can b e kept high and reasonably go o d p erformance for large jobs may also b e obtained

Based on the ab ove ideas a multiclass timespace sharing system is designed

A detailed description of this system is b eyond the scop e of this pap er Interested

readers may refer for more details

TwoLevel Scheduling

It can b e seen from the previous section that our scheduling strategy is based

on the utilisation ratios assigned to parallel jobs In this section we intro duce a

twolevel scheduling scheme for balancing the workloads for b oth sequential and

parallel pro cessing

At the top level or global level the gang scheduling or a loose gang scheduling

scheme to b e discussed in the next section is adopted to co ordinate parallel

computing Each scheduling round T is divided into time slots An example

of the time distribution for dierent pro cesses on each pro cessor is shown in

(i)

Fig In the gure time slot t is allo cated only to sequential processes

s

(i)

is assigned to a single paral lel asso ciated with sequential jobs while slot t

p

asso ciated with a parallel job A parallel pro cess may share its time

slots with sequential pro cesses through the scheduling at the b ottom level or

local level However no parallel pro cesses will share the same time slots This

is to avoid many dierent typ es of parallel jobs comp eting for resources at the

same time and then to guarantee that each parallel pro cess can obtain its prop er

share of resources The relation b etween a scheduling round and those time slots

satises the following equation

n

X

(i)

t T T

s

p

i=1

P

(i)

m

where T t is the total time dedicated for sequential jobs in a

s

s

i=1

scheduling round and is distributed to gain go o d resp onse to interactive clients

(1) (2) (3) (4)

(1) (2) (3) (4)

t t t t

p p p p

t t t t

s s s s

T

Fig The time distribution in a scheduling round

The width of each time slot is determined by the corresp onding utilisation

(i) (i)

ratio or We can then calculate the width of each time slot as

p s

(i) (i)

T t

p p

and

T T

s s

P

(i)

m

where

s

s

i=1

There are many ways to distribute T For example each slot for a parallel

s

pro cess can b e followed by a small slot for sequential pro cesses and T is

s

uniformly distributed across the whole scheduling round Then

T

s

(i)

t

s

n

We can also distribute T proportional ly to the width of each time slot for

s

parallel pro cesses that is

(i)

p

(i)

T t

s

P

s

(j )

n

p

j =1

The calculation for proportional distribution is a bit more complicated than

that for uniform distribution However it is useful when a propershare p olicy

which will b e describ ed later in the section is applied at the lo cal level

Dierent lo cal p olicies can b e adopted to schedule pro cesses within each time

slot In those time slots dedicated for sequential pro cessing conventional lo cal

scheduling schemes of any standard op erating system will b e go o d enough In

(i)

the following we discuss how to schedule pro cesses in each time slot t in

p

which parallel pro cessing is involved

To ensure that a parallel pro cess can obtain its assinged share of CPU util

(i)

isation the whole slot t may b e dedicated just to the asso ciated parallel

p

pro cess In that case a very high priority will b e given and the pro cess simply

do es busywaiting or spins during communicationsynchronisation so that no

other pro cesses can disturb its execution within each asso ciated time slot One

problem asso ciated with this p olicy is that the p erformance of sequential jobs

esp ecially of those which demand go o d interactive resp onse may signicantly b e

aected Therefore its use will b e treated as sp ecial cases under the environment

of networks of workstations to achieve certain clients sp ecial requests

To prevent great p erformance degradation of sequential interactive jobs im

plicit coscheduling scheme can b e adopted However a p otential problem is that

the execution of a parallel pro cess may b e disturb ed by several sequential pro

cesses and then it is p ossible that certain parallel pro cesses may not receive their

prop er shares in their asso ciated time slots

The ab ove problem may b e alleviated by adopting a propershare p olicy In

this p olicy we do not consider individual shares allo cated for each sequential job

Except for sp ecial ones eg multimedia workloads which may b e treated in the

same way as parallel jobs to achieve constantrate services only a combined

(i)

share of sequential pro cesses t is considered Each distributed time slot for

s

(i) (i)

to is also integrated with its asso ciated time slot t sequential pro cesses t

p s

(i)

form a single time slot of width t that is

(i) (i) (i)

t t t

s p

In each integrated time slot implicit coscheduling is applied to supp ort b oth

parallel and sequential pro cessing When its allo cated share is not used up in

(i)

time t a parallel pro cess can still obtain services till the end of the integrated

p

(i)

(i) (i)

time slot t though t is longer than t When a parallel pro cess has

p

consumed its share b efore the end of an integrated time slot however it will b e

blo cked and the services in the remaining time slot then dedicated to sequential

pro cesses With this p olicy parallel pro cesses and sequential pro cesses as a whole

may b e guaranteed to obtain their prop er shares during the computation

Similar to the one describ ed in the p olicy may b e realised by applying

the proportionalshare technique which are originally used for realtime appli

cations However our scheduling scheme is much simpler and easier to

implement b ecause only the prop er share of a single parallel pro cess is considered

against a combined share of sequential pro cesses in each time slot

Now the problem is how to distribute the total time T allo cated for pro

s

cessing sequential jobs The uniform distribution using the equation in is

(i)

easy to calculate However the resulting t may b e to o small to comp en

s

(i)

sate the lost share of parallel pro cesses which have large s Therefore the

p

prop ortional distribution using may b e a more prop er one

Normalising T that is setting T the equation in will b ecome

n

X

(i)

s

p

i=1

Using equations in and we obtain

(i)

p

(i) (i) (i)

T t t t

P

s p

(j )

n

p

j =1

(i)

The width of an integrated time slot t can directly b e obtained by using

(i)

the equation in and thus there is no need to explicitly calculate t s

s

Registration Oce

When a parallel pro cess has used up its time slot it will b e preempted at the

global level and another parallel pro cess b e dispatched After b eing dispatched

parallel pro cesses may timeshare resources with sequential pro cesses on each

pro cessor Just like sequential pro cesses parallel pro cesses will then b e either

in running state or ready and blocked states which is controlled by a lo cal

scheduler Because in our twolevel scheduling the execution of parallel pro cesses

are controlled at b oth global and lo cal levels sp ecial care has to b e taken to

avoid p otential scheduling conicts For example the global scheduler wants to

preempt a parallel pro cess which is currently not in running state To solve this

problem we intro duce a registration oce on each pro cessor

The registration oce is constructed by using a linked list as shown in Fig

When a parallel job is initiated each asso ciated pro cess will enter the lo cal se

quential queueing system the same way as sequential pro cesses on the corre

sp onding pro cessor Just like sequential pro cesses parallel pro cesses can b e ei

ther in running state or in ready state requesting for service or in blo cked state

manager servant

alg

timer

P

registration oce

pro cess dispatched

H no de no de no de no de T

pro cs pro cs pro cs pro cs

IN OUT IN IN

Fig The organisation of a registration oce

during communicationsynchronisation However every parallel pro cess has to

b e registered in the registration oce that is on each pro cessor the linked list

will b e extended with a new no de which has a p ointer p ointing to the pro cess

just b eing initiated Similarly when a parallel job is terminated it has to check

out from the oce that is the corresp onding no de on each pro cessor will b e

deleted from the linked list

As we discussed in the previous section certain parallel pro cesses may b e as

signed a very high priority so that they can o ccupy the whole time slots allo cated

to them In that case the execution of sequential workloads can b e seriously de

(i)

teriorated To alleviate this problem we may intro duce certain time slots t

s

which are dedicated to sequential jobs only This can b e done by intro ducing

dummy nodes in the linked list A dummy no de is the same typ e of no des in a

linked list except its p ointer p oints to NULL the constant zero instead of a real

parallel pro cess It seems that there is a dummy paral lel process asso ciated with

that no de When a service is given to that dummy parallel pro cess the whole

time slot will b e dedicated to sequential pro cesses

There is a servant working in the oce When the servant comes to a place

or a no de in the linked list the pro cess asso ciated with that no de can receive

services or be dispatched When a pro cess is dispatched it will b e marked out

Other pro cesses which are not dispatched will b e marked in In practice a pro cess

may b e blocked if it is marked in Therefore a parallel pro cess can come out of the

blo cked status only if it is ready for service controlled by the lo cal scheduler and

the event out o ccurs controlled by the top level scheduler By letting only one

parallel pro cess b e marked out on each pro cessor at any time we can guarantee

that only one parallel pro cess timeshares resources with sequential pro cesses in

each time slot

When a time slot is ended for the current parallel pro cess the servant will

move to a new no de The parallel pro cess asso ciated with that no de can then

b e serviced next However the movement of the servant is totally controlled

by an oce manager which has a timer to determine when the servant is to

move and an algorithm to determine which no de the servant is to move to The

algorithm can b e simple ones such as the conventional roundrobin To obtain a

high system throughput however other more sophisticated scheduling schemes

may also b e considered The timer is to ensure that pro cesses can obtain their

(i) (i)

s in each scheduling round s or t allo cated service times that is t

s p

The use of registration oces is similar to that of the twodimensional ma

trix adopted in the conventional coscheduling Each column of the matrix cor

resp onds to a time slot and each row to a pro cessor The coscheduling is then

controlled based on that matrix It is easy to see that the linked list on each

pro cessor plays the same role as a row of that matrix in coscheduling parallel

pro cesses However the key dierence is that our twolevel scheduling scheme

allows b oth parallel and sequential jobs to b e executed simultaneously

Lo ose Gang Scheduling

The conventional gang scheduler is centralised The system has a central con

troller At the end of each time slot the controller broadcasts a message to all

pro cessors The message contains the information ab out which parallel work

load will receive a service next The centralised system is easy to implement

esp ecially when the scheduling algorithm is simple However frequent signal

broadcasting for simultaneous across the pro cessors may degrade

the overall system p erformance on machines such as networks of workstations

and spacesharing p olicies may not easily b e adopted to enhance the eciency

of resource utilisation Because in our system there is a registration oce on

each pro cessor we can adopt a loose gang scheduling p olicy to alleviate these

problems

In our system there is a global job manager It is used to monitor the work

ing conditions of each pro cessor to lo cate and allo cate pro cessors and to assign

utilisation ratios to parallel jobs and to balance parallel and sequential work

loads We b elieve that resources in networks of workstations cannot eciently

b e utilised without an eective global job manager This global job manager is

also able to broadcast signals for the purp ose of synchronisation to co ordinate

the execution of parallel jobs However the signals need not b e frequently broad

cast for simultaneous context switch b etween time slots across the pro cessors

They are sent only once after each scheduling round or even many scheduling

rounds to adjust the p otential skew of the corresp onding time slots or simply

time skew across the pro cessors caused by using local job managers on each

pro cessor

There is a lo cal job manager on each pro cessor It is used to monitor and

rep ort to the global job manager the working conditions on that pro cessor It

also takes orders from the global job manager to prop erly set up its registration

oce and to co ordinate the execution of parallel jobs with other pro cessors

With help of the global job manager the eective coscheduling is guaranteed by

using lo cal job managers on each pro cessor

Time

T

pro cessor

6 1 2

Space

pro cessor

pro cessor

3 4

pro cessor

pro cessor

5

Fig The timespace allo cation for six jobs on ve pro cessors

In the following we give a simple example which demonstrates more clearly

the eectiveness of using the lo ose gang scheduling scheme and which also

presents another way of deriving the registration oce for the scheme

Our simple example considers the execution of six jobs on ve pro cessors

We assume that the timespace allo cation has already b een done that is the

numb er of pro cessor and the utilisation ratio have b een assigned for each job

as depicted in Fig For various reasons such as describ ed in the previous

sections the shap es of timespace allo cation may not b e the same for each job as

indicated in the gure This will make it very dicult for a centralised controller

to coschedule jobs However the problem can easily b e solved by adopting our

lo ose gang scheduling

On each pro cessor we run a lo cal job manager and we also set up a schedul

ing table which is given by the global job manager Parallel pro cesses are then

scheduled according to this scheduling table In our example there are three dif

ferent scheduling tables as shown in Fig a The pro cesses and the lengths

of their allo cated time slots in a scheduling round are listed in each table in

an ordered manner It is easy to see that if the pro cessors are synchronised

at the b eginning of each scheduling round It is also p ossible that the pro ces

sors can b e synchronised once many scheduling rounds and lo cal job managers

schedule parallel pro cesses according to the given scheduling tables the correct

coscheduling across the pro cessors is then guaranteed

Because b oth content and size of each table vary from time to time during the

computation it is quite natural to implement the scheduling tables using linked

lists which results in our registration oce A registration oce on pro cessor

is depicted in Fig b Note each no de in the linked list has a p ointer which

p oints at the corresp onding pro cess so that any unnecessary search for parallel

pro cesses can b e avoided

pro cessor pro cessor

pro cessor

1 2 6 3 4 6 3 5

J J J J J J J J

1 2 6 3 4 6 3 5

a

pro cessor

H T

1 2 6

J J J

1 2 6

b

Fig a The scheduling tables assigned for each pro cessor and b The registration

oce on pro cessor

With the collab oration of the global and lo cal job managers the system can

work correctly and eectively A p otential disadvantage of the lo ose gang schedul

ing is that there is an additional cost for executing the coscheduling algorithm

(i)

(i)

or t are usually in on each pro cessor However in practice time slots t

p

order of seconds This extra cost for running a pro cess for coscheduling will b e

relatively very small

Conclusions

In this pap er we discussed some new ideas for eectively scheduling b oth parallel

and sequential workloads on networks of workstations

To achieve a desired p erformance in a system with a variety of comp etitive

background workloads the key is to assign a sustained CPU utilisation ratio on

each pro cessor to a parallel job so that the p erformance b ecomes predictable

Because the resources in a system are limited however we cannot guarantee

that every job will b e given a sustained utilisation ratio One way to solve this

problem is that small parallel jobs are assigned a sustained ratio of resource

utilisation while each large parallel job is allo cated a large numb er of pro cessors

and assigned a utilisation ratio which can vary in a wide range according to

the current system workload Thus small jobs are not blo cked by larger ones

and a short turnaround time is guaranteed high eciency of resource utilisation

can b e achieved and reasonably go o d p erformance for large jobs may also b e

obtained

To balance the workloads for b oth sequential and parallel pro cessing we

intro duced a twolevel scheduling scheme At the global level parallel jobs are

coscheduled so that they can obtain their prop er shares without interfering with

each other and they can also b e co ordinated across the pro cessors to achieve

high eciency in parallel computation At the lo cal level many dierent p olicies

eg the busywaiting or spinning and the implicit coscheduling or twophase

blo cking can b e considered to schedule b oth parallel and sequential pro cesses

We intro duced a prop ershare p olicy for eectively scheduling pro cesses at the

lo cal level By adopting this p olicy we can obtain go o d p erformance for each

parallel job and also maintain go o d resp onse for interactive sequential jobs The

twolevel scheduling can b e implemented by adopting a registration oce on

each pro cessor The organisation of the registration oce which is also describ ed

in is simple and the main purp ose is to eectively schedule parallel pro cesses

at b oth global and lo cal levels

We also intro duced a lo ose gang scheduling scheme to coschedule parallel jobs

across the pro cessors This scheme requires b oth global and lo cal job managers

The coscheduling is mainly controlled by lo cal job managers on each pro ces

sor so frequent signalbroadcasting for simultaneous context switch across the

pro cessors is avoided There is only a bit extra work for global job manager to

adjust p otential time skew The name loose gang has two meanings First the

coscheduling is achieved by mainly using lo cal job managers but not just a cen

tral controller and second parallel pro cesses may timeshare their allo cated time

slots with sequential pro cesses Since b oth global and lo cal job managers play

eective roles in job scheduling we think this may lead a way for us to nd

go o d strategies for eciently scheduling b oth parallel and sequential workloads

on networks of workstations

A new system based on these ideas is currently under construction on a

distributed memory parallel machine the Fujitsu AP at the Australian

National University

References

T Agerwala J L Martin J H Mirza D C Sadler D M Dias and M Snir SP

system architecture IBM Systems Journal

S V Anastasiadis and K C Sevcik Parallel application scheduling on networks of

workstations Journal of Paral lel and Distributed Computing pp

T E Anderson D E Culler D A Patterson and the NOW team A case for

NOW networks of workstations IEEE Micro Feb pp

R H Arpaci A C Dusseau A M Vahdat L T Liu T E Anderson and D

A Patterson The interaction of parallel and sequential workloads on a network of

workstations Proceedings of ACM SIGMETRICSPERFORMANCE Joint

International Conference on Measurement and Modeling of Computer Systems

May pp

A C ArpaciDusseau and D E Culler Extending prop ortionalsha re scheduling

to a network of workstations Proceedings of International Conference on Paral lel

and Distributed Processing Techniques and Applications Las Vegas Nevada June

M Crovella P Das C Dubnicki T LeBlanc and E Markatos Multiprogramming

on multipro cessors Proceedings of the Third IEEE Symposium on Paral lel and

Distributed Processing Dec pp

J J Dongarra Performance of various computers using standard linear equations

software Technical Rep ort CS Computer Science Department University

of Tennessee Nov

A C Dusseau R H Arpaci and D E Culler Eective distributed scheduling of

parallel workloads Proceedings of ACM SIGMETRICS International Confer

ence

D G Feitelson and L Rudolph Gang scheduling p erformance b enets for ne

grained synchronisatio n Journal of Paral lel and Distributed Computing

Dec pp

D Ghosal G Serazzi and S K Tripathi The pro cessor working set and its use in

scheduling multipro cessor systems IEEE Transactions on Software Engineering

May pp

A Gupta A Tucker and S Urushibara The impact of op erating system schedul

ing p olicies and synchronisation metho ds on the p erformance of parallel applica

tions Proceedings of the ACM SIGMETRICS Conference on Measurement

and Modeling of Computer Systems May pp

K Li IVY A shared virtual memory system for parallel computing Proceedings

of International Conference on Paral lel Processing pp

SP Lo and V D Gligor A comparative analysis of multipro cessor scheduling

algorithms Proceedings of the th International Conference on Distributed Com

puting Systems Sept pp

V K Naik S K Setia and M S Squillante Performance analysis of job schedul

ing p olicies in parallel sup ercomputing environments Proceedings of Supercomput

ing Nov pp

V K Naik S K Setia and M S Squillante Pro cessor allo cation in multipro

grammed distributedmemory parallel computer systems IBM Research Rep ort

RC

J K Ousterhout Scheduling techniques for concurrent systems Proceedings of

Third International Conference on Distributed Computing Systems May pp

E Rosti E Smirni L Dowdy G Serazzi and B M Carlson Robust partitioning

p olicies of multipro cessor systems Performance Evaluation pp

S K Setia M S Squillante and S K Tripathi Analysis of pro cessor allo cation in

multiprogrammed distributedmemory parallel pro cessing systems IEEE Trans

actions on Paral lel and Distributed Systems April pp

I Stoica H Ab delwahab K Jeay S Baruah J Gehrke and C G Plaxton

A Prop ortional share resource allo cation algorithm for realtime timeshared sys

tems IEEE RealTime Systems Symposium Dec

K Suzaki H Tanuma S and Y Ichisugi Design of combination of time sharing

and space sharing for parallel task scheduling Proceedings of the International

Conference on Paral lel and Distributed Processing Techniques and Applications

Las Vegas Nevada Nov

C A Waldspurger and W E Weihl deterministic prop ortional

share resource management Technical Rep ort MITLCSTM MIT Lab ora

tory for Computer Science MIT June

J Zahorjan and E D Lazowska Spinning versus blo cking in parallel systems

with uncertainty Proceedings of the IFIP International Seminar on Performance

of Distributed and Paral lel Systems Dec pp

B B Zhou X Qu and R P Brent Eective scheduling in a mixed parallel and

sequential computing environment Proceedings of the th Euromicro Workshop on

Paral lel and Distributed Processing Madrid Jan

B B Zhou R P Brent D Walsh and K Suzaki A multiclass timespace sharing

system Tech Rep DCS and CSLab Australian National University in pro cess