Task Management Issues in Distributed Systems
Ahilan Anantha, Maki Sugimoto, Andreas Suryawan, Peter Tran
University of California, San Diego
November 21, 1998
Abstract tem architect: complexity, residual dep enden-
cies, p erformance, and transparency. [7 ] We
One of the main goals of distributed systems is
analyze how each op erating system cop es with
allowing idle pro cessing resources to b e utilized.
these con icting factors to provide eciency
To accomplish this, there must b e mechanisms
and maintainability. And nally, we consider
to distribute tasks across machines. We exam-
improvements to these systems.
ine the task management mechanisms provided
by several distributed op erating systems, and
2 Techniques and Attributes
analyze their e ectiveness.
of Task Management
1 Intro duction
2.1 Ownership of CPU Resources
A ma jor motivation for constructing a dis-
The op erating systems we discuss in this pap er
tributed op erating system is to p erform co or-
fall into two basic classes of environments: 1
dination of decentralized resources in order to
those where machines are "owned" by particu-
raise the utilization of the system as a whole.
lar users, such that a machine's pro cessing may
It is through the management of tasks that a
be used only when the owner is not using it,
system is able to optimize the parallelism b eing
and 2 where there is notion of ownership of ma-
o ered, thereby increasing utilization.
chines, all pro cessors are available for use by all
There are quite a few interesting attributes
users. Op erating systems of the rst class are
of distributed op erating systems, and notable
typically designed for environments of graphical
techniques used in handling these. We examine
workstations.
the following attributes and techniques of task
The Sprite and Condor op erating systems are
management:
designed for the lab graphical workstation en-
vironment. There is no separate CPU server
1. Ownership of CPU resources
cluster, the workstations themselves make up
the distributed system. A user is exp ected to
2. Homogeneous vs heterogeneous environ-
interact with the op erating system by way of
ment
a windowing system, which is a highly CPU
3. Remote execution/pro cess migration
and memory intensive interactive pro cess. In-
teractive pro cesses have randomly uctuating
4. Namespace transparency
loads b ecause they act in resp onse to user
activity, which itself is randomly uctuating.
5. Load information and control manager
At the same time, interactive pro cesses have
Design choices made in the systems surveyed minimum delay requirements b ecause users re-
re ect a set of tradeo s considered by the sys- quire real-time resp onse. Windowing systems 1
nals would have minimal pro cessing resources p ose an even greater problem b ecause they re-
to o er. All the resources of a graphical ter- quire a large p ercentage of pro cessing power,
minal can thus be reserved for the windowing while traditional text mo de interaction is fairly
system. lightweight. In order to satisfy these delay re-
quirements, it b ecomes necessary to reserve the
Clouds, Alpha, MOSIX, Plan 9, and Solaris
maximum amount of CPU resources that a win-
MC fall under this category. Solaris MC pro-
dowing system would require.
vides a pro cess migration mechanism for server
In non-distributed systems, the user of a
machines, allowing pro cesses to be migrated
graphical workstation is exp ected to actively
across server machines in cases where one server
control which pro cesses may run on the system
must be brought down for maintenance. In
to satisfy his delay requirements. If the user
these cases, migrated pro cesses can b e p ermit-
of a graphical workstation has ownership of all
ted to be inecient and to tax the resources
the user pro cesses that can hog the system re-
of the hosting system. But the necessity of
sources, he can susp end or terminate the pro-
maintaining services across server disconnec-
cesses that prevent the useability of the console.
tions outweighs these factors. Therefore So-
However, if other users were p ermitted to eas-
laris MC suggests the need for a distinction
ily run pro cesses on remote systems, the console
between servers which are prepared to o er
user will lose the ability to control the interac-
the resources to accept migrated pro cesses and
tive resp onse time.
user workstations which are not willing to ac-
For this reason, these op erating systems give
cept the burden of migrated pro cesses. Clouds,
a second class status to remote pro cesses. Re-
Alpha, and Plan 9 are alike in that they con-
mote pro cesses are only allowed to utilize the
sist of separate highend CPU and data servers.
resources of a workstation if the workstation is
A MOSIX system consists of a large number of
not already busy serving its console user. The
commo dityworkstations, all of whichmay play
CPU resources can b e taken back from remote
an equal part in serving data and pro cessing.
pro cesses if the console user desires them. As
such, the console user can be considered the
2.2 Remote Execution
owner of a workstation's CPU resources.
Remote execution and pro cess migration are
Sprite and Condor will only p ermit remote
the techniques used in distributed systems to
pro cesses to run on a system when the system
share CPU resources. Remote execution is the
is idle and will evict remote pro cesses once a
ability to create pro cesses on remote machines.
user starts using the console.
Pro cess migration is the ability to relo cate pro-
The other class of distributed systems con-
cesses b etween no des in mid-execution.
sists of environments of dedicated CPU servers,
Plan 9 supp orts remote execution on CPU
data servers, and graphical terminals. The bulk
servers explicitly sp eci ed by the user. Pro-
of the pro cessing power in these environments
cesses cannot be migrated, therefore remotely
is contained in the CPU server. The computers
executed pro cesses sp end their entire lifespan
with graphical displays are essentially graphical
on the remote CPU server, from creation to ter-
terminals, they have sucient pro cessing abil-
mination. Condor also supp orts remote execu-
ityto run the windowing system pro cesses but
tion only.
require no more. All other CPU intensive pro-
Sprite supp orts remote execution through the cesses are executed remotely on a CPU server.
mechanism of pro cess migration. A request for No user "owns" the CPU server, every user
remote execution of a new pro cess on Sprite gets a guaranteed share of its resources. Con-
would need to be reduced to creating the pro- versely, no remote pro cesses would be allowed
cess lo cally, and attempting to migrate the pro- on a graphical terminal. The work of trying
cess so on after. In the case of an immediate to determine whether a graphical workstation
remote execution, lo cal memory need not b e al- is idle is unnecessary, since the graphical termi- 2
is a path of execution made up of a series of lo cated for the co de and data if an idle com-
calls to ob ject metho ds. Each call is referred puter is available to start with. Sprite, however,
to as an invo cation that the ob ject resp onds do esn't provide this optimization. All execution
to. An ob ject by itself is passive. When a b egins lo cally.
thread invokes an ob ject's metho d, the thread
Pro cess migration is only a request that an
enters that ob ject's virtual-address space and
application can make. Pro cesses are only p er-
b egins execution. Eachinvo cation of a metho d
mitted to execute on remote workstations that
is called a segment of the thread that invokes
are idle. If there are no idle machines then
it. This segmentation of threads is how Clouds
the request may be denied and the pro cess
provides distributed execution.
would continue to execute on the lo cal machine.
Note how this is di erent from the tradi- Therefore, a pro cess cannot also b e exp ected to
tional mo del of pro cesses and pro cess migration. b e able to b egin execution on a remote system
There, a pro cess executes within one virtual- either.
address space unless it is migrated to another
When the computer ceases to b e idle, all re-
no de. Migration is exp ensive, and it is exp ected
mote pro cesses must b e evicted. Therefore, ev-
not to o ccur more than once or twice.
ery pro cess must have the notion of a home ma-
In Clouds, migration takes place with ob- chine, which is the machine from which the user
ject granularity. That is, as a thread pro- invoked the pro cess.
ceeds, it mayinvoke ob jects on di erent no des.
Solaris MC provides remote execution and
A thread's path of execution ncessarily crosses
pro cess migration mechanisms. User must ex-
through all of these no des. It is strategic place-
plicitly call the rexec system call to carry out
ment of ob jects that would b e used as a mech-
remote execution. Unlike Plan 9, the destina-
anism for load balancing.
tion no de need not be sp eci ed. Pro cess mi-
Since there are no address space asso ciated gration is assumed to be used mainly for o -
with each of the threads, ob jects on remote or loading pro cesses from a no de b eing shutdown
lo cal no de can b e invoked with the same seman- for mainteanance.
tics. Threads can cross no de b oundaries with
MOSIX is the only oprating system that
the minimum p enaltyofnetwork overheads.
makes use of pro cess migration for the purp ose
of load-balancing. Any user pro cess can b e mi-
grated any time to anyavailable no de transpar-
2.3 Namespace Transparency
ently.
A required feature of distributing systems is
hiding from a pro cess the fact that it is exe-
2.2.1 Thread Migration in Ob ject Based
cuting remotely or lo cally. This transparency
Systems
should also be maintained with regard to the
user. The user should b e able to interact with Ob ject-based distributed systems have a di er-
the pro cess in the same way as the lo cal case entway of organizing resources by representing
regardless of where the co de is executing. them with passive ob jects. Ob jects encapsu-
late co de and data. An ob ject's co de is exe-
Maintaining this
cuted using a pro cedural interface called invo-
transparency requires changes to the traditional
cation. Ob jects are large-grained in that they
op erating system mo del. Transparency refers
have their own virtual-address space, and there
to the distributed op erating system giving each
is relatively large overhead with the invo cation
pro cess a single, uniform view of resources, in-
and storage of an ob ject. For these reasons, ob-
cluding the lesystem and I/O devices, regard-
jects generally implement storage and execution
less of which computer it is running on. The de-
of large-grained data and programs.
sign and implementation of the space of accessi-
ble resources, or namespace, directly a ects the Clouds is an example of an ob ject-based dis-
management of the distribution of pro cesses. tribute op erating system. A thread in Clouds 3
only. This protects the internal environmentof Many op erating systems achieve this trans-
an ob ject. Capabilities-based protection is pro- parency by the enforcement of a uniform, global
vided for controlling global accesses to ob jects. namespace. The lesystem will app ear the
[1] same to every pro cess on every no de. One so-
lution follows from mounted lesystems in tra-
Plan 9 has an interesting p olicy for managing
ditional UNIX. A namespace is constructed as
name spaces. Every client pro cess can have a lo-
a union of mounted le systems.
cal namespace, whichhave the same semantics
Ob ject-based distributed systems provide a
of a lo calized lesystem interface. User-level
servers in Plan 9 have the ability to "exp ort" di erent solution. There is no notion of a tra-
lesystem interfaces to their clients. In fact, ditional lesystem, only ob jects. Resources are
these exp orted lesystem interfaces are the pri- encapsulated in ob jects. Naming takes place
mary means for which Plan 9 servers exp ort all with ob ject granularity. This provides a at
their resources. Some of the ob jects in these namespace. At the system level, all ob jects are
name spaces may refer to globally distinct les identi ed by a globally unique bit string. A
in the distributed lesystem, but some may re- user-level name service is provided to translate
fer to a lo cal copy of a global resource. user-registered names to system-level names.
For example, the same global Plan 9 lesys- Sprite employs the uniform global name
tem can be used by clients of di erent pro- space mo del. File servers provide domains, sim-
cessor architectures. A user on di erent sys- ilar to UNIX lesystems, that are mounted as
tems may refer to a binary executable using sub domains of each other with one domain se-
a common pathname, such as /bin/date, but lected as the topmost, ro ot, domain. This view
the actual binary le that is utilized will de- of the domain hierarchy is the same for every
computer in the cluster. This can b e contrasted
p end on the pro cessor architecture. Devices
with Sun's NFS, where every clientmaycho ose
stored in /dev will refer to devices in the lo cal
the lo cal mount p oint of a remote lesystem. In
name space. Some of these devices may refer to
Sprite, the remote le server decides the mount
actual kernel recognized devices, or they may
refer to pseudo device interfaces which user- p oint all the clients must use. Among other
level servers exp ort. For example, a window advantages, this guarantees that every le in
in the Plan 9 windowing system exp orts the de- the distributed lesystem has a single globally
vices /dev/mouse, /dev/bitblt, and /dev/cons de ned pathname. This makes it p ossible to
which refer to the mouse, bitmapp ed display migrate programs that attempt to manipulate
interface, and character mo de console inter- les. [7]
face. Each window will exp ort the same de- Solaris MC also employs the uniform global
vices in their name spaces, but the actual de- namespace mo del. The Solaris MC le system,
vice les are lo cal copies of the pseudo devices which is built on top of the existing Solaris le
exp orted by the Plan 9 windowing system. The system, interp oses all le op erations and for-
windowing system will multiplex accesses to the wards them to the server where the le actually
actual physical devices.
resides. Any pro cess can op en a le lo cated
To supp ort the ability to run pro cesses on anywhere in the system using the same path-
name, thus allowing programs to b e lo cated on
remote servers, and have them app ear to be
arbitrary no des. [4 ]
running lo cally, Plan 9 provides the ability to
exp ort the lo cal name space to a remotely exe- The ob ject-oriented semantics of Clouds pro-
cuting pro cess. The remotely executing pro cess vides a at namespace along with global acces-
will then have the same view of the lesystem sibility any thread can reference any ob ject.
as it would if it had b een executing lo cally. And This is also essentially Clouds' mechanism for
it would have access to the same devices real distributed shared memory. Only through in-
or virtual as on the lo cal system, b ecause these vo cation is access to an ob ject's data allowed;
would also b e exp orted as part of the lo cal name input and output parameters are pass-by-value 4
space. [6 ] 2.5 Load information and control
manager
Distributed op erating systems, by their nature,
2.4 Homogeneous vs Heterogeneous
pool pro cessing resources together. Access to
Environments
common pro cessing resources must b e mediated
by some entityorentities. The determination of
which pro cess will execute on which pro cessor
Many distributed op erating systems can b e run
we term task distribution management, and the
on heterogeneous environment. However, all of
entities that make this determination we term
these op erating systems that allow pro cess mi-
task distribution managers. Task distribution
gration have instated the requirement that all
decision making may be centralized onto one
computers involved in pro cess migration have
manager or decentralized onto many managers.
the same pro cessor architecture.
One disadvantage of centralized management
The primary obstacle to heterogeneity is that
is its inherent inscalability. The overhead asso-
the execution state of a pro cess is highly archi-
ciated with maintaining all the load information
tecture dep endent. When the source and des-
and making choices among all the no des grows
tination systems are of the same architecture,
with the number of no des. Another disadvan-
the co de and data segments, registers, stack,
tage is that the failure of the central no de brings
and heap can simply be copied without any
down the whole mechanism.
changes. With di ering pro cessor architectures,
We can decentralize this decision making by
all of these might need to b e signi cantly mo di-
giving a number of no des the ability to act
ed. Such mo di cation is likely to b e exp ensive
as task distribution managers. Each manager
and will add signi cant complexity to the sys-
would control a partition of the no des in the
tem.
system. In this con guration, each managing
The op erating systems we've discussed that
no de essentially b ecomes the central manager
supp ort pro cess migration MOSIX, Sprite, So-
for a smaller distributed system [9 ]. It can make
laris MC have the requirement that all ma-
its own decisions to utilize pro cessors in its par-
chines accepting migrated pro cesses be of the
tition of participating machines.
same pro cessor typ e.
A task distribution manager accumulates the
Clouds' distributed execution mo del also
load information of the no des in the parti-
do es not explicitly supp ort heterogeneity. Ob-
tion they control, and uses this information to
jects on the data servers are stored in a sin-
cho ose the pro cessor where a task should run.
gle machine language, so heterogeneous CPU
In the Sprite system, every Sprite machine
servers would require the machine co de b e con-
runs a background pro cess called the "load-
verted from one language to another. This is a
average daemon", which monitors the usage
complication that would break the symmetry of
of the machine. When the machine app ears
the Clouds system, and would b e exp ensive to
idle, the daemon noti es the "central migra-
carry out.
tion server" that the machine is prepared to
Many of these op erating systems will p ermit
accept migrated pro cesses. User pro cesses that
a data server to b e of a di erent pro cessor archi-
invoke migration call a standard library routine,
tecture, since migration would never take place
RequestIdleHosts, to obtain a list of idle Mig
there.
hosts, and then reference the host identi er in
the migrate pro cess system call. The central mi- Plan 9's CPU servers can be heterogeneous.
gration server maintains the database in virtual Each program is compiled b eforehand for the
memory,toavoid the overhead of remote lesys- architecture it intends to b e executed on. This
tem op erations. The load-average daemons and prohibits the implementation of pro cess migra-
the library routine Mig RequestIdleHosts com- tion in Plan 9. 5
the remote execution mechanism to the design- municate with the server using a message pro-
ers of these op erating systems, complexitymay to col. Sprite decides that a machine is idle if
b e limited for maintainability. [7 ] and only if a it had no keyb oard or mouse in-
out for at least 30 seconds, and b there are, on
These factors con ict with each other. High
average, fewer runnable pro cesses than pro ces-
transparency are likely to b e require more com-
sors. This decision was made purely heuristi-
plexity and residual dep endencies. Residual de-
cally; originally the input threshold was 5 min-
p endencies a ect p erformance, b ecause of the
utes. The Sprite designers chose not to deter-
high delays asso ciated with forwarding. A fast
mine the most ecient utilization of idle hosts,
migration pro cess mayinvolve the use of resid-
b ecause there were plenty of idle hosts available.
ual dep endencies to avoid the transfer of state,
[7 ]
this can reduce the p erformance of the execu-
tion of the remote pro cess. [7 ]
MOSIX is fully decentralized; every no de acts
as task distribution manager. At regular in-
tervals, each no de sends information ab out its
3.1 Sprite
available resources to a randomly chosen par-
The Sprite op erating system guarantees trans-
tition of no des [9]. Each no de therefore only
parency to remotely executing pro cesses. The
maintains load information for a random parti-
user can interact with a migrated pro cess in the
tion of no des, and will cho ose no des among this
same manner as b efore migration to ok place.
set for the destination of the pro cess migration.
The user can continue to provide input to a pro-
The use of randomness supp orts scaling and dy-
cess and receive output from it in an identical
namic con guration [9 ].
way. The user can also control the execution of
the pro cess using the same job control mecha-
nisms provided for controlling lo cal pro cesses.
3 Tradeo Comparisons
No distinction is made between lo cally execut-
ing and migrated pro cesses when using these job
The design of distributed op erating systems
control mechanisms. However, Sprite requires
involve making tradeo s among four factors:
the user-level application to initiate pro cess mi-
transparency, residual dep endencies, p erfor-
gration. So for an application to take advantage
mance, and complexity. Perfect transparency
of pro cess migration, it not only must b e aware
would mean that b oth the user and the pro-
of migration but it must determine when to re-
cess act the same way to a remotely executing
quest migration of subpro cesses. Sprite do es
pro cess as to a lo cal one. Both the user and
not automatically migrate pro cesses except for
the pro cess need not be aware of the fact that
eviction.
a pro cess has b een migrated. If remote execu-
Sprite transfers most of the state asso ciated tion leaves residual dep endencies, that means
with a pro cess, but still retains some residual the source machine must continue to provide
dep endencies. Sprite transfers virtual memory, services to the remotely executing pro cess. By
op en le handles, and execution state. Access p erformance, we mean that that the remote ex-
to le and memory are the most intensive op er- ecution mechanism should only induce minimal
ations, so elimination of residual dep endencies overheads in pro cessing and allo cation. The de-
in these areas tremendously improves p erfor- lay asso ciated with initiating remote execution,
mance. By restricting migration to the case of or migrating a pro cess, should be low, and re-
homogeneous pro cessor architectures, the exe- motely executing pro cesses should p erform as
cution state transfer b ecomes simple. Forward- eciently as lo cally executing ones. The com-
ing is required for access to lo cal I/O devices. plexity of the remote execution mechanism b e-
For message channels between pro cesses, the comes imp ortant b ecause it could p otentially
source machine must arrange to route messages a ect every piece of the op erating system ker-
for the migrated pro cess. All signals are for- nel. Dep ending on the relative imp ortance of 6
automatically know when a pro cessor b ecomes warded from the source machine. The state
idle. In Sprite, user level programs can only ask transfer and state forwarding mechanisms are
which pro cessors are idle at a given time; they implemented transparently. The only visible af-
cannot arrange to be noti ed when pro cessors fect would b e a reduction of p erformance when
b ecome idle. state forwarding is used instead of state trans-
fer. In one case Sprite is not able to provide
transparency, and that is for access to mem-
3.3 Plan 9
ory mapp ed I/O devices. Sprite simply for-
Plan 9 supp orts transparency for remotely exe-
bids the migration of pro cesses that use memory
cuting pro cesses with the ability to exp ort the
mapp ed I/O. [7]
lo cal name space to the remote pro cess. Since
Plan 9 do esn't supp ort pro cess migration, state
3.2 MOSIX
transfer is not required. Virtual memory is
only allo cated on the remote system. However,
MOSIX's progress migration mechanism is very
all references to les and devices are forwarded
similar to Sprite's. Both rely on a common le
back to the lo cal system using a network RPC
system to avoid the need to forward le op era-
proto col called 9P. The lo cal system runs a pro-
tions. Virtual address space and execution state
gram called exp ortfs, which translates 9P calls is transferred. However, unlike Sprite, only ac-
into system calls on the lo cal machine. A for- tive pages are ooaded from the source ma-
warding mechanism is unavoidable for access to
chine. MOSIX do esn't store the backing store
lo cal devices. But other distributed systems do
on the distributed le system, so the source sys-
not forward accesses to les on distributed le
tem must b e consulted when bringing in pages
systems to the source system, the dataserver
from the backing store. This is di erent from
can be referenced directly. Plan 9's mecha-
what o ccurs in Sprite. Sprite ushes all dirty
nism sacri ces p erformance for the simplicity
pages back to the le server. The migrated
of residual dep endencies. [6 ]
pro cess will page fault on every page it ac-
cesses, and load all pages from the le server.
MOSIX's mechanism involves a residual dep en-
3.4 Clouds
dency that Sprite do es not, since the source ma-
The ob ject-based paradigm of Clouds has many
chine must serve requests for virtual memory
advantages over traditional systems. The view
pages throughout the execution of the migrated
of the system to the user as a uniform, at
pro cess. However, in Sprite those pages would
namespace of ob jects is conceptually simple.
need to be demand loaded from the le server
Ob ject metho d invo cation also provides the
anyway. And MOSIX reduces the number of
simplicity of pro cedural semantics.
page faults during the initial stages of the mi-
The overhead asso ciated with invo cation, grated program's execution, b ecause the active
however, can incur a substantial p erformance pages are transferred b efore execution b egins.
p enalty. There is no shared memory between [9 ]
ob jects. Input and output parameters are pass-
Unlike Sprite, the MOSIX kernel actively mi-
by-value. In a thread which pro duces many in-
grates pro cesses using a load balancing algo-
vo cations, the execution time can b ecome dom-
rithm rather than forcing the user-level appli-
inated by the copying of parameters necessary
cation to make the request to migrate. This
for eachinvo cation.
adds more transparency to the pro cess migra-
The nature in which a thread invokes a tion mechanism. Not only are pro cesses un-
metho d, enters a ob ject's virtual address space, aware of b eing migrated, the user also do es
and then continues execution in the invoked ob- not need to be aware of the need to migrate.
ject has the b ene t of not creating residual de- This should also result in greater p erformance.
p endencies. The only information needed by Since the kernel gathers load information, it will 7
the invokee is the set of input parameters from mum. In exchange for achieving transparancy
the originating ob ject. [1 ] and minimum complexity, p erformance is sac-
ri ced. For example, all le op eration and sys-
tem calls are interp osed by the Solaris MC layer,
3.5 Condor
p erformance of lo cal le op eration and lo cal sys-
Condor is a software system that runs on top
tem calls will b e lower. [4 ]
of a UNIX kernel. This provides ease of p orta-
bility and simpli es the task of op erating sys-
4 Further Improvements
tem design. Complicated features of the op-
erating system not directly related to the dis-
4.1 Load Balancing Algorithms
tributed asp ect, such as device driver supp ort,
An imp ortant goal of a distributed system go es
can be b orrowed from the underlying op erat-
beyond the ability to simply share resources.
ing system. This reduction in implementation
System designers are faced with doing this shar-
complexity comes at the exp ense of system p er-
ing eciently and in a way that maximally
formance. Placing the distributed mechanisms
utilizes resources. Resp onse time and total
outside the kernel incurs execution overhead
throughput are the driving forces b ehind this
and delay in passing load statistics and in load
work. The question then is how to balance the
sharing decisions [11 ] .
work load across all no des in the system. Load
When a job is submitted to a remote ma-
balancing is also referred to as global schedul-
chine, a user is not required to have an account
ing.
on the remote machine. The participating ma-
Utilizing idle CPU resources via pro cess mi-
chines agree to allow other users to gain access
gration has b een discussed in an earlier section.
and use the machines whenever the machines
With exception of MOSIX, the op erating sys-
are idle. Since the users have no accounts on
tems surveyed generally only provide the mi-
the remote machines, they cannot gain access
gration mechanism, and leave the p olicy im-
to the remote machines' lesystem. However,
plementation up to the application layer. The
when executing a pro cess, it is p ossible that
op erating systems that provide this p olicy use
the pro cess needs to read and write to a lo cal
algorithms that pro cess load information gath-
le. Therefore, Condor needs to provide access
ered from each no de to determine the destina-
to the lo cal lesystem. Condor, then, needs to
tion when migrating a pro cess. A standard load
remotely execute system calls to the home ma-
metric is the average number of tasks in the
chine when the running pro cess needs access to
ready queue of a pro cessor.
the lesystem. This residual dep endency on the
Gathering accurate load information from
home machine induces communication overhead
each no de is the primary problem in load bal-
for le op erations.
ancing. There are two basic mo dels for storing
this information: centralized and decentralized.
3.6 Solaris MC
In the centralized mo del, a central server is
used as storage of the load information, and
One of the primary goals of Solaris MC is to
therefore is given the resp onsibility for schedul-
integrate distributed features into existing op-
ing decisions. A fully decentralized system dis-
erating systems, namely Solaris, with maximum
tributes this resp onsibility among the individ-
compatibility. The distributed structure of the
ual no des.
system is transparant to the users and the appli-
Another issue in load balancing is the metho d cations. Byte level compatibilty is guaranteed
of transmission of the load information. One for existing applications. Facilities to utilize
metho d is to have no des broadcast this informa- remote CPU resources is provided. To mini-
tion from time to time. Another metho d uses mize the increase of system complexity, mo d-
p olling of the no des for this information. i cations to the Solaris kernel is kept mini- 8
actual execution state of a pro cess, which is ma- Polling to gather load information leads to
chine dep endant, the authors prop ose a metho d a great number of messages b eing transmit-
of constructing an equivalent machine indep en- ted as requests and resp onses. The problem
dant state, which can be migrated. However, of high message trac also o ccurs when having
the approach can only work if the program is no des broadcast their load information. This
itself machine indep endant. approach is not scalable. [2 ]
The technique requires that the compiler gen-
Lau et al, prop osed a solution to the prob-
erate machine indep endant intermediate co de
lem of messaging overhead for load informa-
along with the machine language co de. The
tion transmission in the decentralized scheduler
machine indep endant co de will describ e the op-
mo del. This solution involves the use of anti-
erations on an abstract machine, while the ma-
tasks and load state vectors. An anti-task is a
chine language co de describ es op erations for a
sp ecial typ e of message that is passed among
physical machine. Compilers can be exp ected
the computational no des. The path of an anti-
to optimize the machine co de for a particular
task is determined by the load state vector. An
pro cessor typ e, such that the internal states of
anti-task contains a table in which the entries
the machine indep endant co de and the machine
are the load state values of the no des that the
language co de will corresp ond only at a subset
anti-task has visited. Each of these entries is
of execution p oints of a program. Such p oints
time-stamp ed and contains a visited ag. Each
are called migration p oints. When a migration
no de has a table with the same structure, minus
is requested, the program will continue to ex-
the visited ag. The table that the no de main-
ecute until the next migration p oint. To keep
tains is called the load state vector, and the
delays in migration small, we'd like to have as
table on the anti-task side is called the anti-
many migration p oints as p ossible. If we allow
task's tra jectory. When an anti-task visits a
migration delay in the range of seconds, there is
no de, the information in each table is shared
ro om for millions of machine instructions to ex-
make sure that each table contains the most up
ecute b efore we would need a migration p oint.
to date information.
It is necessary for all pro cedures in the call stack
Using minimum and maximum threshold
to have reached a state corresp onding to an ab-
load values, a no de is categorized as b eing in a
stract state for the execution p oint to be con-
light, normal, or heavy workload state. Lau et
sidered a migration p oint.
al, devised an algorithm that takes into account
Once wehave reached such a migration p oint,
the information in the tra jectory load state
we must generate an abstract program state.
information plus the visited ag which cause
Compilers must generate source-level symbol
anti-tasks to travel sp ontaneously towards the
tables describing the lo cations of every global
most heavily loaded no des. The total infor-
and pro cedure-lo cal variable, this is essentially
mation presented by arriving anti-tasks to the
what debugging features describ e. We would
heavily loaded no de give it a highly accurate
use the same technique that a source-level de-
view of the global state, increasing the chance
bugger uses to gather the state of all global
that the no de makes a go o d load balancing de-
and pro cedure-lo cal variables. Now that global
cision. [3 ]
and call stack data have b een accounted for, we
must nd the state of the heap. The heap needs
4.2 Heterogeneous Pro cess Migra-
to b e traced, following each p ointer variable in
tion
the global and call stack to nd the transitive
closure of the ob jects they are p ointing to. We Marvin M. Theimer and Barry Hayes discuss an
must also be able to interpret every eld of a approach to migrating pro cesses across hetero-
heap ob ject, b ecause data representation con- geneous pro cessor architectures in their pap er
versions may b e necessary across platforms the "Heterogeneous Pro cess Migration By Recom-
size and representations of integers and oating pilation". Since it is not p ossible to migrate the 9
ments across distributed ob jects. If distributed p ointnumb ers often di er.
computing is the primary feature a programmer
After accumulating this abstract state, we
is seeking, then the opp ortunity cost of adapt-
construct a "migration program" which initial-
ing the programs from the traditional and fa-
izes itself with the machine indep endant state
miliar mo del could be justi ed. However, a
and pro ceeds executes the rest of the co de. This
general purp ose programmer who do esn't de-
program is recompiled for the target system,
p end on distributed computing abilities would
and then migrated.
see Cloud's p eculiarities as a nuisance.
This approach can only be guaranteed to
On the other end of the sp ectrum, we have
work, in the general case, for languages that
distributed systems like Solaris MC and Con-
themselves do not allow machine dep endant
dor. Both of these provide distributed comput-
co de to be written. The authors b elieve their
ing through a user-level layer that sits ab ove a
approach will work for Mo dula-2, typ e-safe
UNIX op erating system. Very minimal, if any,
Cedar, and Lisp. [8 ]
changes to the kernel need b e made. The layer
This pap er preceded the developmentofJava.
provides transparent distributed task manage-
Java is designed such that source co de is con-
ment by using existing features of the op er-
verted into a machine indep endant byte co de.
ating system. The result is a large amount
This byte co de runs on top of a Java virtual ma-
of overhead and signi cant reduction in p er-
chine pro cess. Remote execution of Java byte-
formance. However, the cost of maintaining
co de will of course require no recompilation, but
the distributed op erating system is simpli ed.
migrating a Java pro cess requires transferring
These layered systems can rely on the vender of
the state of the virtual machine. The Java vir-
the general purp ose op erating system to main-
tual machine satis es exactly the requirements
tain the most volatile comp onents of op erating
of the abstract machine describ ed by the au-
systems, such as supp orting new hardware de-
thors. It is no longer necessary to determine mi-
vices. Since the underlying op erating system
gration p oints, every execution p ointinaJava
caters to a more general user base, the issue
byteco de is a migration p oint. The same mech-
of providing op erating system supp ort for the
anism for lo cating and recording the values of
small community of distributed computer sys-
all global, pro cedure-lo cal, and heap data can
tem users is greatly alleviated.
b e applied to the Java Virtual Machine.
However, these layer based distributed sys-
tems do have severe p erformance issues. It
would be convenient if there could be a way
5 Conclusion
to implement these distributed system features
within a p opular kernel and still b e able to man- In this pap er, we've analyzed distributed op-
age this co de separately from the rest of the erating systems that o ered widely varied ap-
kernel. Sprite attempted to provide an ecient proaches. On one end of the sp ectrum, we
pro cess migration mechanism in their kernel, have Clouds: an op erating system designed
but chose not to automate it. The main reason from the ground up to be distributed. Its pro-
for this was that there were many di erent goals gramming mo del is drastically di erent from
of the Sprite pro ject, only one of whichwas dis- standard metho dologies used to day, b ecause
tributed task management. The Sprite design- the programming mo del has b een reinvented
ers wanted to minimize the e ect pro cess mi- to supp ort the idea of disjoint distributed re-
gration would have on other developing parts of sources. Clouds requires a total rethinking of
the op erating system kernel. So they prevented program design, but at the same time provides
the kernel from actively migrating pro cesses, so the most simple and ecient distributed com-
that they could allow develop ers to test parts puting mo del. Clouds avoids the complexi-
of the op erating system indep endantly of the ties and overhead asso ciated with transferring
e ects of pro cess migration. of state, Clouds only transfers pro cedure argu- 10
[7] John K. Ousterhout, Frederick Douglis, MOSIX attempts to provide distributing sys-
"Transparent Process Migration: Design tem features by designing kernel extensions to
Alternatives and the Sprite Implementa- p opular op erating systems, such as BSD/OS
tion" Software|Practice & Exp erience, and Linux. The MOSIX develop ers pro duce
August 1991. source co de patches for particular versions of
the Linux kernel, and thereby allowing MOSIX
[8] Marvin M. Theimer, Barry Hayes, "Het-
to b e built as a kernel mo dule. The eciency of
erogeneous Process Migration by Recompi-
kernel mo de distributed task management sup-
lation", IEEE 11th Int'l Conference on Dis-
p ort is coupled with the advantage of integra-
tributed Computing Systems, 1991
tion with a widely used and maintained main-
stream op erating system.
[9] Amnon Barak, Oren La'adan, "The
In conclusion, we predict that the approach
MOSIX Multicomputer Operating System
that MOSIX takes is most likely to b e success-
for High Performance Cluster Comput-
ful in integrating distributed task management
ing," 1997.
features into mainstream op erating systems.
[10] K.G. Shin and C.-J. Hou, "Design and
evaluation of e ective load sharing in dis-
References
tributed real-time systems," IEEE Trans.
on Parallel and Distributed Systems, vol.
[1] Partha Dasgupta, Richard J.
5, no.,7, July 1994.
LeBlanc, Mustaque Ahamad, Umakishore
Ramachandran, "The Clouds Distributed
[11] Chao-Ju Hou, Kang G. Shin, "Implemen-
Operating System," IEEE Computer, Vol-
tation of Decentralized Load Sharing in
ume 24, 1991.
Networked Workstations Using the Condor
Package," 1994
[2] Marvin M. Theimer, Keith A. Lantz,
"Finding Id le Machines in a Workstation-
based Distributed System," IEEE Trans. on
Parallel and Distributed Systems, 1988.
[3] Sau-Ming Lau, Qin Lu, Kwong-Sak Le-
ung, "Dynamic Load Distribution Using
Anti-Tasks and Load State Vectors," IEEE
Trans. on Parallel and Distributed Sys-
tems, 1988.
[4] Yousef A. Khalidi, Jose Bernab eu, Vlada
Matena, Ken Shirri , and Moti Thadani,
"Solaris MC: A multicomputer OS," Pro-
ceedings of 1996 USENIX Conference, Jan-
uary 1996.
[5] Ken Shirri , "Building Distributed Pro-
cess Management on an Object-Oriented
Framework USENIX 1997
[6] Rob Pike, Dave Presotto, Sean Dorward,
Bob Flandrena, Ken Thompson, Howard
Trickey, and Phil Winterb ottom, "Plan 9
from Bel l Labs",1995 11