Task Management Issues in Distributed Systems

Ahilan Anantha, Maki Sugimoto, Andreas Suryawan, Peter Tran

University of California, San Diego

November 21, 1998

Abstract tem architect: complexity, residual dep enden-

cies, p erformance, and transparency. [7 ] We

One of the main goals of distributed systems is

analyze how each op erating system cop es with

allowing idle pro cessing resources to b e utilized.

these con icting factors to provide eciency

To accomplish this, there must b e mechanisms

and maintainability. And nally, we consider

to distribute tasks across machines. We exam-

improvements to these systems.

ine the task management mechanisms provided

by several distributed op erating systems, and

2 Techniques and Attributes

analyze their e ectiveness.

of Task Management

1 Intro duction

2.1 Ownership of CPU Resources

A ma jor motivation for constructing a dis-

The op erating systems we discuss in this pap er

tributed op erating system is to p erform co or-

fall into two basic classes of environments: 1

dination of decentralized resources in order to

those where machines are "owned" by particu-

raise the utilization of the system as a whole.

lar users, such that a machine's pro cessing may

It is through the management of tasks that a

be used only when the owner is not using it,

system is able to optimize the parallelism b eing

and 2 where there is notion of ownership of ma-

o ered, thereby increasing utilization.

chines, all pro cessors are available for use by all

There are quite a few interesting attributes

users. Op erating systems of the rst class are

of distributed op erating systems, and notable

typically designed for environments of graphical

techniques used in handling these. We examine

workstations.

the following attributes and techniques of task

The Sprite and Condor op erating systems are

management:

designed for the lab graphical workstation en-

vironment. There is no separate CPU server

1. Ownership of CPU resources

cluster, the workstations themselves make up

the distributed system. A user is exp ected to

2. Homogeneous vs heterogeneous environ-

interact with the op erating system by way of

ment

a windowing system, which is a highly CPU

3. Remote execution/pro cess migration

and memory intensive interactive pro cess. In-

teractive pro cesses have randomly uctuating

4. Namespace transparency

loads b ecause they act in resp onse to user

activity, which itself is randomly uctuating.

5. Load information and control manager

At the same time, interactive pro cesses have

Design choices made in the systems surveyed minimum delay requirements b ecause users re-

re ect a set of tradeo s considered by the sys- quire real-time resp onse. Windowing systems 1

nals would have minimal pro cessing resources p ose an even greater problem b ecause they re-

to o er. All the resources of a graphical ter- quire a large p ercentage of pro cessing power,

minal can thus be reserved for the windowing while traditional text mo de interaction is fairly

system. lightweight. In order to satisfy these delay re-

quirements, it b ecomes necessary to reserve the

Clouds, Alpha, MOSIX, Plan 9, and Solaris

maximum amount of CPU resources that a win-

MC fall under this category. Solaris MC pro-

dowing system would require.

vides a pro cess migration mechanism for server

In non-distributed systems, the user of a

machines, allowing pro cesses to be migrated

graphical workstation is exp ected to actively

across server machines in cases where one server

control which pro cesses may run on the system

must be brought down for maintenance. In

to satisfy his delay requirements. If the user

these cases, migrated pro cesses can b e p ermit-

of a graphical workstation has ownership of all

ted to be inecient and to tax the resources

the user pro cesses that can hog the system re-

of the hosting system. But the necessity of

sources, he can susp end or terminate the pro-

maintaining services across server disconnec-

cesses that prevent the useability of the console.

tions outweighs these factors. Therefore So-

However, if other users were p ermitted to eas-

laris MC suggests the need for a distinction

ily run pro cesses on remote systems, the console

between servers which are prepared to o er

user will lose the ability to control the interac-

the resources to accept migrated pro cesses and

tive resp onse time.

user workstations which are not willing to ac-

For this reason, these op erating systems give

cept the burden of migrated pro cesses. Clouds,

a second class status to remote pro cesses. Re-

Alpha, and Plan 9 are alike in that they con-

mote pro cesses are only allowed to utilize the

sist of separate highend CPU and data servers.

resources of a workstation if the workstation is

A MOSIX system consists of a large number of

not already busy serving its console user. The

commo dityworkstations, all of whichmay play

CPU resources can b e taken back from remote

an equal part in serving data and pro cessing.

pro cesses if the console user desires them. As

such, the console user can be considered the

2.2 Remote Execution

owner of a workstation's CPU resources.

Remote execution and pro cess migration are

Sprite and Condor will only p ermit remote

the techniques used in distributed systems to

pro cesses to run on a system when the system

share CPU resources. Remote execution is the

is idle and will evict remote pro cesses once a

ability to create pro cesses on remote machines.

user starts using the console.

Pro cess migration is the ability to relo cate pro-

The other class of distributed systems con-

cesses b etween no des in mid-execution.

sists of environments of dedicated CPU servers,

Plan 9 supp orts remote execution on CPU

data servers, and graphical terminals. The bulk

servers explicitly sp eci ed by the user. Pro-

of the pro cessing power in these environments

cesses cannot be migrated, therefore remotely

is contained in the CPU server. The computers

executed pro cesses sp end their entire lifespan

with graphical displays are essentially graphical

on the remote CPU server, from creation to ter-

terminals, they have sucient pro cessing abil-

mination. Condor also supp orts remote execu-

ityto run the windowing system pro cesses but

tion only.

require no more. All other CPU intensive pro-

Sprite supp orts remote execution through the cesses are executed remotely on a CPU server.

mechanism of pro cess migration. A request for No user "owns" the CPU server, every user

remote execution of a new pro cess on Sprite gets a guaranteed share of its resources. Con-

would need to be reduced to creating the pro- versely, no remote pro cesses would be allowed

cess lo cally, and attempting to migrate the pro- on a graphical terminal. The work of trying

cess so on after. In the case of an immediate to determine whether a graphical workstation

remote execution, lo cal memory need not b e al- is idle is unnecessary, since the graphical termi- 2

is a path of execution made up of a series of lo cated for the co de and data if an idle com-

calls to ob ject metho ds. Each call is referred puter is available to start with. Sprite, however,

to as an invo cation that the ob ject resp onds do esn't provide this optimization. All execution

to. An ob ject by itself is passive. When a b egins lo cally.

thread invokes an ob ject's metho d, the thread

Pro cess migration is only a request that an

enters that ob ject's virtual-address space and

application can make. Pro cesses are only p er-

b egins execution. Eachinvo cation of a metho d

mitted to execute on remote workstations that

is called a segment of the thread that invokes

are idle. If there are no idle machines then

it. This segmentation of threads is how Clouds

the request may be denied and the pro cess

provides distributed execution.

would continue to execute on the lo cal machine.

Note how this is di erent from the tradi- Therefore, a pro cess cannot also b e exp ected to

tional mo del of pro cesses and pro cess migration. b e able to b egin execution on a remote system

There, a pro cess executes within one virtual- either.

address space unless it is migrated to another

When the computer ceases to b e idle, all re-

no de. Migration is exp ensive, and it is exp ected

mote pro cesses must b e evicted. Therefore, ev-

not to o ccur more than once or twice.

ery pro cess must have the notion of a home ma-

In Clouds, migration takes place with ob- chine, which is the machine from which the user

ject granularity. That is, as a thread pro- invoked the pro cess.

ceeds, it mayinvoke ob jects on di erent no des.

Solaris MC provides remote execution and

A thread's path of execution ncessarily crosses

pro cess migration mechanisms. User must ex-

through all of these no des. It is strategic place-

plicitly call the rexec system call to carry out

ment of ob jects that would b e used as a mech-

remote execution. Unlike Plan 9, the destina-

anism for load balancing.

tion no de need not be sp eci ed. Pro cess mi-

Since there are no address space asso ciated gration is assumed to be used mainly for o -

with each of the threads, ob jects on remote or loading pro cesses from a no de b eing shutdown

lo cal no de can b e invoked with the same seman- for mainteanance.

tics. Threads can cross no de b oundaries with

MOSIX is the only oprating system that

the minimum p enaltyofnetwork overheads.

makes use of pro cess migration for the purp ose

of load-balancing. Any user pro cess can b e mi-

grated any time to anyavailable no de transpar-

2.3 Namespace Transparency

ently.

A required feature of distributing systems is

hiding from a pro cess the fact that it is exe-

2.2.1 Thread Migration in Ob ject Based

cuting remotely or lo cally. This transparency

Systems

should also be maintained with regard to the

user. The user should b e able to interact with Ob ject-based distributed systems have a di er-

the pro cess in the same way as the lo cal case entway of organizing resources by representing

regardless of where the co de is executing. them with passive ob jects. Ob jects encapsu-

late co de and data. An ob ject's co de is exe-

Maintaining this

cuted using a pro cedural interface called invo-

transparency requires changes to the traditional

cation. Ob jects are large-grained in that they

op erating system mo del. Transparency refers

have their own virtual-address space, and there

to the distributed op erating system giving each

is relatively large overhead with the invo cation

pro cess a single, uniform view of resources, in-

and storage of an ob ject. For these reasons, ob-

cluding the lesystem and I/O devices, regard-

jects generally implement storage and execution

less of which computer it is running on. The de-

of large-grained data and programs.

sign and implementation of the space of accessi-

ble resources, or namespace, directly a ects the Clouds is an example of an ob ject-based dis-

management of the distribution of pro cesses. tribute op erating system. A thread in Clouds 3

only. This protects the internal environmentof Many op erating systems achieve this trans-

an ob ject. Capabilities-based protection is pro- parency by the enforcement of a uniform, global

vided for controlling global accesses to ob jects. namespace. The lesystem will app ear the

[1] same to every pro cess on every no de. One so-

lution follows from mounted lesystems in tra-

Plan 9 has an interesting p olicy for managing

ditional UNIX. A namespace is constructed as

name spaces. Every client pro cess can have a lo-

a union of mounted le systems.

cal namespace, whichhave the same semantics

Ob ject-based distributed systems provide a

of a lo calized lesystem interface. User-level

servers in Plan 9 have the ability to "exp ort" di erent solution. There is no notion of a tra-

lesystem interfaces to their clients. In fact, ditional lesystem, only ob jects. Resources are

these exp orted lesystem interfaces are the pri- encapsulated in ob jects. Naming takes place

mary means for which Plan 9 servers exp ort all with ob ject granularity. This provides a at

their resources. Some of the ob jects in these namespace. At the system level, all ob jects are

name spaces may refer to globally distinct les identi ed by a globally unique bit string. A

in the distributed lesystem, but some may re- user-level name service is provided to translate

fer to a lo cal copy of a global resource. user-registered names to system-level names.

For example, the same global Plan 9 lesys- Sprite employs the uniform global name

tem can be used by clients of di erent pro- space mo del. File servers provide domains, sim-

cessor architectures. A user on di erent sys- ilar to UNIX lesystems, that are mounted as

tems may refer to a binary executable using sub domains of each other with one domain se-

a common pathname, such as /bin/date, but lected as the topmost, ro ot, domain. This view

the actual binary le that is utilized will de- of the domain hierarchy is the same for every

computer in the cluster. This can b e contrasted

p end on the pro cessor architecture. Devices

with Sun's NFS, where every clientmaycho ose

stored in /dev will refer to devices in the lo cal

the lo cal mount p oint of a remote lesystem. In

name space. Some of these devices may refer to

Sprite, the remote le server decides the mount

actual kernel recognized devices, or they may

refer to pseudo device interfaces which user- p oint all the clients must use. Among other

level servers exp ort. For example, a window advantages, this guarantees that every le in

in the Plan 9 windowing system exp orts the de- the distributed lesystem has a single globally

vices /dev/mouse, /dev/bitblt, and /dev/cons de ned pathname. This makes it p ossible to

which refer to the mouse, bitmapp ed display migrate programs that attempt to manipulate

interface, and character mo de console inter- les. [7]

face. Each window will exp ort the same de- Solaris MC also employs the uniform global

vices in their name spaces, but the actual de- namespace mo del. The Solaris MC le system,

vice les are lo cal copies of the pseudo devices which is built on top of the existing Solaris le

exp orted by the Plan 9 windowing system. The system, interp oses all le op erations and for-

windowing system will multiplex accesses to the wards them to the server where the le actually

actual physical devices.

resides. Any pro cess can op en a le lo cated

To supp ort the ability to run pro cesses on anywhere in the system using the same path-

name, thus allowing programs to b e lo cated on

remote servers, and have them app ear to be

arbitrary no des. [4 ]

running lo cally, Plan 9 provides the ability to

exp ort the lo cal name space to a remotely exe- The ob ject-oriented semantics of Clouds pro-

cuting pro cess. The remotely executing pro cess vides a at namespace along with global acces-

will then have the same view of the lesystem sibility any thread can reference any ob ject.

as it would if it had b een executing lo cally. And This is also essentially Clouds' mechanism for

it would have access to the same devices real distributed shared memory. Only through in-

or virtual as on the lo cal system, b ecause these vo cation is access to an ob ject's data allowed;

would also b e exp orted as part of the lo cal name input and output parameters are pass-by-value 4

space. [6 ] 2.5 Load information and control

manager

Distributed op erating systems, by their nature,

2.4 Homogeneous vs Heterogeneous

pool pro cessing resources together. Access to

Environments

common pro cessing resources must b e mediated

by some entityorentities. The determination of

which pro cess will execute on which pro cessor

Many distributed op erating systems can b e run

we term task distribution management, and the

on heterogeneous environment. However, all of

entities that make this determination we term

these op erating systems that allow pro cess mi-

task distribution managers. Task distribution

gration have instated the requirement that all

decision making may be centralized onto one

computers involved in pro cess migration have

manager or decentralized onto many managers.

the same pro cessor architecture.

One disadvantage of centralized management

The primary obstacle to heterogeneity is that

is its inherent inscalability. The overhead asso-

the execution state of a pro cess is highly archi-

ciated with maintaining all the load information

tecture dep endent. When the source and des-

and making choices among all the no des grows

tination systems are of the same architecture,

with the number of no des. Another disadvan-

the co de and data segments, registers, stack,

tage is that the failure of the central no de brings

and heap can simply be copied without any

down the whole mechanism.

changes. With di ering pro cessor architectures,

We can decentralize this decision making by

all of these might need to b e signi cantly mo di-

giving a number of no des the ability to act

ed. Such mo di cation is likely to b e exp ensive

as task distribution managers. Each manager

and will add signi cant complexity to the sys-

would control a partition of the no des in the

tem.

system. In this con guration, each managing

The op erating systems we've discussed that

no de essentially b ecomes the central manager

supp ort pro cess migration MOSIX, Sprite, So-

for a smaller distributed system [9 ]. It can make

laris MC have the requirement that all ma-

its own decisions to utilize pro cessors in its par-

chines accepting migrated pro cesses be of the

tition of participating machines.

same pro cessor typ e.

A task distribution manager accumulates the

Clouds' distributed execution mo del also

load information of the no des in the parti-

do es not explicitly supp ort heterogeneity. Ob-

tion they control, and uses this information to

jects on the data servers are stored in a sin-

cho ose the pro cessor where a task should run.

gle machine language, so heterogeneous CPU

In the Sprite system, every Sprite machine

servers would require the machine co de b e con-

runs a background pro cess called the "load-

verted from one language to another. This is a

average daemon", which monitors the usage

complication that would break the symmetry of

of the machine. When the machine app ears

the Clouds system, and would b e exp ensive to

idle, the daemon noti es the "central migra-

carry out.

tion server" that the machine is prepared to

Many of these op erating systems will p ermit

accept migrated pro cesses. User pro cesses that

a data server to b e of a di erent pro cessor archi-

invoke migration call a standard library routine,

tecture, since migration would never take place

RequestIdleHosts, to obtain a list of idle Mig

there.

hosts, and then reference the host identi er in

the migrate pro cess system call. The central mi- Plan 9's CPU servers can be heterogeneous.

gration server maintains the database in virtual Each program is compiled b eforehand for the

memory,toavoid the overhead of remote lesys- architecture it intends to b e executed on. This

tem op erations. The load-average daemons and prohibits the implementation of pro cess migra-

the library routine Mig RequestIdleHosts com- tion in Plan 9. 5

the remote execution mechanism to the design- municate with the server using a message pro-

ers of these op erating systems, complexitymay to col. Sprite decides that a machine is idle if

b e limited for maintainability. [7 ] and only if a it had no keyb oard or mouse in-

out for at least 30 seconds, and b there are, on

These factors con ict with each other. High

average, fewer runnable pro cesses than pro ces-

transparency are likely to b e require more com-

sors. This decision was made purely heuristi-

plexity and residual dep endencies. Residual de-

cally; originally the input threshold was 5 min-

p endencies a ect p erformance, b ecause of the

utes. The Sprite designers chose not to deter-

high delays asso ciated with forwarding. A fast

mine the most ecient utilization of idle hosts,

migration pro cess mayinvolve the use of resid-

b ecause there were plenty of idle hosts available.

ual dep endencies to avoid the transfer of state,

[7 ]

this can reduce the p erformance of the execu-

tion of the remote pro cess. [7 ]

MOSIX is fully decentralized; every no de acts

as task distribution manager. At regular in-

tervals, each no de sends information ab out its

3.1 Sprite

available resources to a randomly chosen par-

The Sprite op erating system guarantees trans-

tition of no des [9]. Each no de therefore only

parency to remotely executing pro cesses. The

maintains load information for a random parti-

user can interact with a migrated pro cess in the

tion of no des, and will cho ose no des among this

same manner as b efore migration to ok place.

set for the destination of the pro cess migration.

The user can continue to provide input to a pro-

The use of randomness supp orts scaling and dy-

cess and receive output from it in an identical

namic con guration [9 ].

way. The user can also control the execution of

the pro cess using the same job control mecha-

nisms provided for controlling lo cal pro cesses.

3 Tradeo Comparisons

No distinction is made between lo cally execut-

ing and migrated pro cesses when using these job

The design of distributed op erating systems

control mechanisms. However, Sprite requires

involve making tradeo s among four factors:

the user-level application to initiate pro cess mi-

transparency, residual dep endencies, p erfor-

gration. So for an application to take advantage

mance, and complexity. Perfect transparency

of pro cess migration, it not only must b e aware

would mean that b oth the user and the pro-

of migration but it must determine when to re-

cess act the same way to a remotely executing

quest migration of subpro cesses. Sprite do es

pro cess as to a lo cal one. Both the user and

not automatically migrate pro cesses except for

the pro cess need not be aware of the fact that

eviction.

a pro cess has b een migrated. If remote execu-

Sprite transfers most of the state asso ciated tion leaves residual dep endencies, that means

with a pro cess, but still retains some residual the source machine must continue to provide

dep endencies. Sprite transfers virtual memory, services to the remotely executing pro cess. By

op en le handles, and execution state. Access p erformance, we mean that that the remote ex-

to le and memory are the most intensive op er- ecution mechanism should only induce minimal

ations, so elimination of residual dep endencies overheads in pro cessing and allo cation. The de-

in these areas tremendously improves p erfor- lay asso ciated with initiating remote execution,

mance. By restricting migration to the case of or migrating a pro cess, should be low, and re-

homogeneous pro cessor architectures, the exe- motely executing pro cesses should p erform as

cution state transfer b ecomes simple. Forward- eciently as lo cally executing ones. The com-

ing is required for access to lo cal I/O devices. plexity of the remote execution mechanism b e-

For message channels between pro cesses, the comes imp ortant b ecause it could p otentially

source machine must arrange to route messages a ect every piece of the op erating system ker-

for the migrated pro cess. All signals are for- nel. Dep ending on the relative imp ortance of 6

automatically know when a pro cessor b ecomes warded from the source machine. The state

idle. In Sprite, user level programs can only ask transfer and state forwarding mechanisms are

which pro cessors are idle at a given time; they implemented transparently. The only visible af-

cannot arrange to be noti ed when pro cessors fect would b e a reduction of p erformance when

b ecome idle. state forwarding is used instead of state trans-

fer. In one case Sprite is not able to provide

transparency, and that is for access to mem-

3.3 Plan 9

ory mapp ed I/O devices. Sprite simply for-

Plan 9 supp orts transparency for remotely exe-

bids the migration of pro cesses that use memory

cuting pro cesses with the ability to exp ort the

mapp ed I/O. [7]

lo cal name space to the remote pro cess. Since

Plan 9 do esn't supp ort pro cess migration, state

3.2 MOSIX

transfer is not required. Virtual memory is

only allo cated on the remote system. However,

MOSIX's progress migration mechanism is very

all references to les and devices are forwarded

similar to Sprite's. Both rely on a common le

back to the lo cal system using a network RPC

system to avoid the need to forward le op era-

proto col called 9P. The lo cal system runs a pro-

tions. Virtual address space and execution state

gram called exp ortfs, which translates 9P calls is transferred. However, unlike Sprite, only ac-

into system calls on the lo cal machine. A for- tive pages are ooaded from the source ma-

warding mechanism is unavoidable for access to

chine. MOSIX do esn't store the backing store

lo cal devices. But other distributed systems do

on the distributed le system, so the source sys-

not forward accesses to les on distributed le

tem must b e consulted when bringing in pages

systems to the source system, the dataserver

from the backing store. This is di erent from

can be referenced directly. Plan 9's mecha-

what o ccurs in Sprite. Sprite ushes all dirty

nism sacri ces p erformance for the simplicity

pages back to the le server. The migrated

of residual dep endencies. [6 ]

pro cess will page fault on every page it ac-

cesses, and load all pages from the le server.

MOSIX's mechanism involves a residual dep en-

3.4 Clouds

dency that Sprite do es not, since the source ma-

The ob ject-based paradigm of Clouds has many

chine must serve requests for virtual memory

advantages over traditional systems. The view

pages throughout the execution of the migrated

of the system to the user as a uniform, at

pro cess. However, in Sprite those pages would

namespace of ob jects is conceptually simple.

need to be demand loaded from the le server

Ob ject metho d invo cation also provides the

anyway. And MOSIX reduces the number of

simplicity of pro cedural semantics.

page faults during the initial stages of the mi-

The overhead asso ciated with invo cation, grated program's execution, b ecause the active

however, can incur a substantial p erformance pages are transferred b efore execution b egins.

p enalty. There is no shared memory between [9 ]

ob jects. Input and output parameters are pass-

Unlike Sprite, the MOSIX kernel actively mi-

by-value. In a thread which pro duces many in-

grates pro cesses using a load balancing algo-

vo cations, the execution time can b ecome dom-

rithm rather than forcing the user-level appli-

inated by the copying of parameters necessary

cation to make the request to migrate. This

for eachinvo cation.

adds more transparency to the pro cess migra-

The nature in which a thread invokes a tion mechanism. Not only are pro cesses un-

metho d, enters a ob ject's virtual address space, aware of b eing migrated, the user also do es

and then continues execution in the invoked ob- not need to be aware of the need to migrate.

ject has the b ene t of not creating residual de- This should also result in greater p erformance.

p endencies. The only information needed by Since the kernel gathers load information, it will 7

the invokee is the set of input parameters from mum. In exchange for achieving transparancy

the originating ob ject. [1 ] and minimum complexity, p erformance is sac-

ri ced. For example, all le op eration and sys-

tem calls are interp osed by the Solaris MC layer,

3.5 Condor

p erformance of lo cal le op eration and lo cal sys-

Condor is a software system that runs on top

tem calls will b e lower. [4 ]

of a UNIX kernel. This provides ease of p orta-

bility and simpli es the task of op erating sys-

4 Further Improvements

tem design. Complicated features of the op-

erating system not directly related to the dis-

4.1 Load Balancing Algorithms

tributed asp ect, such as device driver supp ort,

An imp ortant goal of a distributed system go es

can be b orrowed from the underlying op erat-

beyond the ability to simply share resources.

ing system. This reduction in implementation

System designers are faced with doing this shar-

complexity comes at the exp ense of system p er-

ing eciently and in a way that maximally

formance. Placing the distributed mechanisms

utilizes resources. Resp onse time and total

outside the kernel incurs execution overhead

throughput are the driving forces b ehind this

and delay in passing load statistics and in load

work. The question then is how to balance the

sharing decisions [11 ] .

work load across all no des in the system. Load

When a job is submitted to a remote ma-

balancing is also referred to as global schedul-

chine, a user is not required to have an account

ing.

on the remote machine. The participating ma-

Utilizing idle CPU resources via pro cess mi-

chines agree to allow other users to gain access

gration has b een discussed in an earlier section.

and use the machines whenever the machines

With exception of MOSIX, the op erating sys-

are idle. Since the users have no accounts on

tems surveyed generally only provide the mi-

the remote machines, they cannot gain access

gration mechanism, and leave the p olicy im-

to the remote machines' lesystem. However,

plementation up to the application layer. The

when executing a pro cess, it is p ossible that

op erating systems that provide this p olicy use

the pro cess needs to read and write to a lo cal

algorithms that pro cess load information gath-

le. Therefore, Condor needs to provide access

ered from each no de to determine the destina-

to the lo cal lesystem. Condor, then, needs to

tion when migrating a pro cess. A standard load

remotely execute system calls to the home ma-

metric is the average number of tasks in the

chine when the running pro cess needs access to

ready queue of a pro cessor.

the lesystem. This residual dep endency on the

Gathering accurate load information from

home machine induces communication overhead

each no de is the primary problem in load bal-

for le op erations.

ancing. There are two basic mo dels for storing

this information: centralized and decentralized.

3.6 Solaris MC

In the centralized mo del, a central server is

used as storage of the load information, and

One of the primary goals of Solaris MC is to

therefore is given the resp onsibility for schedul-

integrate distributed features into existing op-

ing decisions. A fully decentralized system dis-

erating systems, namely Solaris, with maximum

tributes this resp onsibility among the individ-

compatibility. The distributed structure of the

ual no des.

system is transparant to the users and the appli-

Another issue in load balancing is the metho d cations. Byte level compatibilty is guaranteed

of transmission of the load information. One for existing applications. Facilities to utilize

metho d is to have no des broadcast this informa- remote CPU resources is provided. To mini-

tion from time to time. Another metho d uses mize the increase of system complexity, mo d-

p olling of the no des for this information. i cations to the Solaris kernel is kept mini- 8

actual execution state of a pro cess, which is ma- Polling to gather load information leads to

chine dep endant, the authors prop ose a metho d a great number of messages b eing transmit-

of constructing an equivalent machine indep en- ted as requests and resp onses. The problem

dant state, which can be migrated. However, of high message trac also o ccurs when having

the approach can only work if the program is no des broadcast their load information. This

itself machine indep endant. approach is not scalable. [2 ]

The technique requires that the compiler gen-

Lau et al, prop osed a solution to the prob-

erate machine indep endant intermediate co de

lem of messaging overhead for load informa-

along with the machine language co de. The

tion transmission in the decentralized scheduler

machine indep endant co de will describ e the op-

mo del. This solution involves the use of anti-

erations on an abstract machine, while the ma-

tasks and load state vectors. An anti-task is a

chine language co de describ es op erations for a

sp ecial typ e of message that is passed among

physical machine. Compilers can be exp ected

the computational no des. The path of an anti-

to optimize the machine co de for a particular

task is determined by the load state vector. An

pro cessor typ e, such that the internal states of

anti-task contains a table in which the entries

the machine indep endant co de and the machine

are the load state values of the no des that the

language co de will corresp ond only at a subset

anti-task has visited. Each of these entries is

of execution p oints of a program. Such p oints

time-stamp ed and contains a visited ag. Each

are called migration p oints. When a migration

no de has a table with the same structure, minus

is requested, the program will continue to ex-

the visited ag. The table that the no de main-

ecute until the next migration p oint. To keep

tains is called the load state vector, and the

delays in migration small, we'd like to have as

table on the anti-task side is called the anti-

many migration p oints as p ossible. If we allow

task's tra jectory. When an anti-task visits a

migration delay in the range of seconds, there is

no de, the information in each table is shared

ro om for millions of machine instructions to ex-

make sure that each table contains the most up

ecute b efore we would need a migration p oint.

to date information.

It is necessary for all pro cedures in the call stack

Using minimum and maximum threshold

to have reached a state corresp onding to an ab-

load values, a no de is categorized as b eing in a

stract state for the execution p oint to be con-

light, normal, or heavy workload state. Lau et

sidered a migration p oint.

al, devised an algorithm that takes into account

Once wehave reached such a migration p oint,

the information in the tra jectory load state

we must generate an abstract program state.

information plus the visited ag which cause

Compilers must generate source-level symbol

anti-tasks to travel sp ontaneously towards the

tables describing the lo cations of every global

most heavily loaded no des. The total infor-

and pro cedure-lo cal variable, this is essentially

mation presented by arriving anti-tasks to the

what debugging features describ e. We would

heavily loaded no de give it a highly accurate

use the same technique that a source-level de-

view of the global state, increasing the chance

bugger uses to gather the state of all global

that the no de makes a go o d load balancing de-

and pro cedure-lo cal variables. Now that global

cision. [3 ]

and call stack data have b een accounted for, we

must nd the state of the heap. The heap needs

4.2 Heterogeneous Pro cess Migra-

to b e traced, following each p ointer variable in

tion

the global and call stack to nd the transitive

closure of the ob jects they are p ointing to. We Marvin M. Theimer and Barry Hayes discuss an

must also be able to interpret every eld of a approach to migrating pro cesses across hetero-

heap ob ject, b ecause data representation con- geneous pro cessor architectures in their pap er

versions may b e necessary across platforms the "Heterogeneous Pro cess Migration By Recom-

size and representations of integers and oating pilation". Since it is not p ossible to migrate the 9

ments across distributed ob jects. If distributed p ointnumb ers often di er.

computing is the primary feature a programmer

After accumulating this abstract state, we

is seeking, then the opp ortunity cost of adapt-

construct a "migration program" which initial-

ing the programs from the traditional and fa-

izes itself with the machine indep endant state

miliar mo del could be justi ed. However, a

and pro ceeds executes the rest of the co de. This

general purp ose programmer who do esn't de-

program is recompiled for the target system,

p end on distributed computing abilities would

and then migrated.

see Cloud's p eculiarities as a nuisance.

This approach can only be guaranteed to

On the other end of the sp ectrum, we have

work, in the general case, for languages that

distributed systems like Solaris MC and Con-

themselves do not allow machine dep endant

dor. Both of these provide distributed comput-

co de to be written. The authors b elieve their

ing through a user-level layer that sits ab ove a

approach will work for Mo dula-2, typ e-safe

UNIX op erating system. Very minimal, if any,

Cedar, and Lisp. [8 ]

changes to the kernel need b e made. The layer

This pap er preceded the developmentofJava.

provides transparent distributed task manage-

Java is designed such that source co de is con-

ment by using existing features of the op er-

verted into a machine indep endant byte co de.

ating system. The result is a large amount

This byte co de runs on top of a Java virtual ma-

of overhead and signi cant reduction in p er-

chine pro cess. Remote execution of Java byte-

formance. However, the cost of maintaining

co de will of course require no recompilation, but

the distributed op erating system is simpli ed.

migrating a Java pro cess requires transferring

These layered systems can rely on the vender of

the state of the virtual machine. The Java vir-

the general purp ose op erating system to main-

tual machine satis es exactly the requirements

tain the most volatile comp onents of op erating

of the abstract machine describ ed by the au-

systems, such as supp orting new hardware de-

thors. It is no longer necessary to determine mi-

vices. Since the underlying op erating system

gration p oints, every execution p ointinaJava

caters to a more general user base, the issue

byteco de is a migration p oint. The same mech-

of providing op erating system supp ort for the

anism for lo cating and recording the values of

small community of distributed computer sys-

all global, pro cedure-lo cal, and heap data can

tem users is greatly alleviated.

b e applied to the Java Virtual Machine.

However, these layer based distributed sys-

tems do have severe p erformance issues. It

would be convenient if there could be a way

5 Conclusion

to implement these distributed system features

within a p opular kernel and still b e able to man- In this pap er, we've analyzed distributed op-

age this co de separately from the rest of the erating systems that o ered widely varied ap-

kernel. Sprite attempted to provide an ecient proaches. On one end of the sp ectrum, we

pro cess migration mechanism in their kernel, have Clouds: an op erating system designed

but chose not to automate it. The main reason from the ground up to be distributed. Its pro-

for this was that there were many di erent goals gramming mo del is drastically di erent from

of the Sprite pro ject, only one of whichwas dis- standard metho dologies used to day, b ecause

tributed task management. The Sprite design- the programming mo del has b een reinvented

ers wanted to minimize the e ect pro cess mi- to supp ort the idea of disjoint distributed re-

gration would have on other developing parts of sources. Clouds requires a total rethinking of

the op erating system kernel. So they prevented program design, but at the same time provides

the kernel from actively migrating pro cesses, so the most simple and ecient distributed com-

that they could allow develop ers to test parts puting mo del. Clouds avoids the complexi-

of the op erating system indep endantly of the ties and overhead asso ciated with transferring

e ects of pro cess migration. of state, Clouds only transfers pro cedure argu- 10

[7] John K. Ousterhout, Frederick Douglis, MOSIX attempts to provide distributing sys-

"Transparent Process Migration: Design tem features by designing kernel extensions to

Alternatives and the Sprite Implementa- p opular op erating systems, such as BSD/OS

tion" Software|Practice & Exp erience, and . The MOSIX develop ers pro duce

August 1991. source co de patches for particular versions of

the Linux kernel, and thereby allowing MOSIX

[8] Marvin M. Theimer, Barry Hayes, "Het-

to b e built as a kernel mo dule. The eciency of

erogeneous Process Migration by Recompi-

kernel mo de distributed task management sup-

lation", IEEE 11th Int'l Conference on Dis-

p ort is coupled with the advantage of integra-

tributed Computing Systems, 1991

tion with a widely used and maintained main-

stream op erating system.

[9] Amnon Barak, Oren La'adan, "The

In conclusion, we predict that the approach

MOSIX Multicomputer

that MOSIX takes is most likely to b e success-

for High Performance Cluster Comput-

ful in integrating distributed task management

ing," 1997.

features into mainstream op erating systems.

[10] K.G. Shin and C.-J. Hou, "Design and

evaluation of e ective load sharing in dis-

References

tributed real-time systems," IEEE Trans.

on Parallel and Distributed Systems, vol.

[1] Partha Dasgupta, Richard J.

5, no.,7, July 1994.

LeBlanc, Mustaque Ahamad, Umakishore

Ramachandran, "The Clouds Distributed

[11] Chao-Ju Hou, Kang G. Shin, "Implemen-

Operating System," IEEE Computer, Vol-

tation of Decentralized Load Sharing in

ume 24, 1991.

Networked Workstations Using the Condor

Package," 1994

[2] Marvin M. Theimer, Keith A. Lantz,

"Finding Id le Machines in a Workstation-

based Distributed System," IEEE Trans. on

Parallel and Distributed Systems, 1988.

[3] Sau-Ming Lau, Qin Lu, Kwong-Sak Le-

ung, "Dynamic Load Distribution Using

Anti-Tasks and Load State Vectors," IEEE

Trans. on Parallel and Distributed Sys-

tems, 1988.

[4] Yousef A. Khalidi, Jose Bernab eu, Vlada

Matena, Ken Shirri , and Moti Thadani,

"Solaris MC: A multicomputer OS," Pro-

ceedings of 1996 USENIX Conference, Jan-

uary 1996.

[5] Ken Shirri , "Building Distributed Pro-

cess Management on an Object-Oriented

Framework USENIX 1997

[6] Rob Pike, Dave Presotto, Sean Dorward,

Bob Flandrena, Ken Thompson, Howard

Trickey, and Phil Winterb ottom, "Plan 9

from Bel l Labs",1995 11