<<

Evolving to a Migrating Mo del

Bryan Ford Jay Lepreau

University of Utah

Abstract

Wehave mo died Mach to treat crossdomain remote pro cedure call RPC as a single

entity instead of a sequence of op erations With RPC thus elevated we improved

the transfer of control during RPC bychanging the thread mo del Like most op erating systems

Mach views threads as statically asso ciated with a single task with two threads involved in an

RPC An alternate mo del is that of migrating threads in which during RPC a single thread

abstraction moves b etween tasks with the logical owofcontrol and server co de is passively

executed Wehave compatibly replaced Machs static threads with migrating threads in an

attempt to isolate this asp ect of op erating system design and implementation The key elementof

our design is a decoupling of the thread abstraction into the execution context and the schedulable

thread of control consisting of a chain of contexts A key element of our implementation is that

threads are now based in the kernel and temp orarily make excursions into tasks via up calls

The new system provides more precisely dened semantics for thread manipulation and additional

control op erations allows scheduling and accounting attributes to follow threads simplies kernel

co de and improves RPC p erformance Wehave retained the old thread and IPC interfaces for

backwards compatibility with no changes required to existing client programs and only a minimal

change to servers as demonstrated by a functional single server and clients The logical

complexity along the critical RPC path has b een reduced by a factor of nine Lo cal RPC doing

normal marshaling has sp ed up by factors of We conclude that a migratingthread mo del

is sup erior to a static mo del that kernelvisible RPC is a prerequisite for this improvement and

that it is feasible to improve existing op erating systems in this manner

Intro duction and Overview

We b egin by dening and explaining four concepts that are key to this pap er They are kernel and user

threads remote procedurecal l static threadsandmigrating threadsWe explain howkernel threads interact

in implementing RPC and the dierence b etween implementing RPC with static and migrating threads

Threads As the term is used in most op erating systems and thread packages conceptually a thread is

a sequential ow of control In traditional Unix a single pro cess contains only a single kernelprovided

thread Mach and many other mo dern op erating systems supp ort multiple threads p er pro cess p er task in

Mach terminology called kernel threads They are distinguished from user threadsprovided by userlevel

thread packages which implementmultiple threads of control atop kernelprovided threads by manipulation

of the program counter and stack from userspace In the rest of this pap er we use the term thread to

refer to a kernel thread unless qualied

In most op erating systems a thread includes muchmorethantheowofcontrol For example in

Mach a thread also i is the schedulable entity with priority and scheduling p olicy attributes ii

contains resourceaccounting statistics suchasaccumulated CPU time iii contains the execution context of

a computationthe state of the registers program counter stackpointer and references to the containing

task and designated exception handler iv provides the point of thread control visible to user programs



through a thread control port

This researchwas sp onsored in part by the HewlettPackard ResearchGrants Program and by the OSF Research Institute



In Mach a port is a kernel entity that is a capabilitya communicationchannel and a name If one has the name of a p ort

RPC Remote procedurecal l as the name suggests mo dels the pro cedure call abstraction but is imple

mented b etween dierent tasks The ow of control is temp orarily moved to another lo cation the pro ce

dure b eing called and later returned to the original p oint and continued RPC can b e used b etween remote

computing no des but is often used b etween tasks on the same no de local RPC This pap er fo cuses only on

the lo cal case in the rest of the pap er for brevitywe use the unqualied term RPC to refer only to lo cal

RPC more restricted than the usual use of the term When a thread in a client task needs service provided

by another task such as op ening a le it bundles up a packet of data containing everything the service

provider the server or le system in this case needs to pro cess the request This is known as marshaling a

message The client thread then invokes the kernel to copy the message into the servers address space and

allow the server to start pro cessing it Some time later the server returns its results to the client in another

message allowing the clienttocontinue pro cessing

It is imp ortant to note that although virtually all mo dern op erating systems provide an RPC abstraction

at some level there is a continuum in that supp ort Some OSs supp ort RPC as a fully kernelvisible entity

Amo eba some provide a kernel interface and sp ecial optimizations for the combined message send and

receiveinvolved in RPC but fundamentally understand only oneway message passing Mach while some

supp ort RPC only through libraries layered on other interpro cess communication IPC facilities Unix

Static Threads The dierence b etween static and migrating threads lies in the way the control transfer

between client pro cessing and server pro cessing is implemented In RPC based on static threads twoentirely

separate threads are involved one conned to or static in the client task and the other conned to the

server task When the clientinvokes the kernel to start an RPC the kernel not only copies the clients

message into the server but also puts the clients thread to sleep and wakes up a thread in the server to

pro cess the message This thread known as a servicethreadwas created previously by the server for the sole

purp ose of waiting around for RPC requests and pro cessing them When the service thread is nished with

the request it invokes the kernel whichputstheserver thread back to sleep and wakes up the client thread

again In switching control from one thread to another a full context switch is involveda change of address



mappings task thread stack registers priority etc including an invo cation of the kernel scheduler

With RPC based on static threads the servers computational resources the right to use the CPU are

used to provide service to the client In the ob jectoriented world this is known as an active ob ject

mo del b ecause a server ob ject contains threads that actively provide service

Migrating Threads If RPC is fully visible to the kernel an alternate mo del of control transfer can b e

implemented Migrating threads allows threads to move from one task to another as part of their normal

functioning In this mo del during an RPC the kernel do es not blo ck the clientthreaduponitsIPCkernel

call but instead arranges for it to continue executing in the servers co de No service thread needs to b e

awakened by the kernelinstead for the purp oses of RPC the server is merely a passive rep ository for

co de to b e executed by client threads It is for this reason that in the ob jectbased world this is called the

passive ob ject mo del Only a partial context switchisinvolvedthe kernel switches address space and

some subset of the CPU registers such as the user stackpointer but do es not switch threads or priorities

and never involves the scheduler There is no server thread state registers stack to restore The clients

own computational resources rights to the CPU are used to provide services to itself Note that binding in

this mo del can b e very similar to that in a thread switching mo del and will b e detailed in Section

Although most op erating systems supp ort RPC using the static thread mo del whether kernelvisible or

not it is imp ortant to note that this is not the case for al l system services All systems using the pro cess



mo del for execution of their own kernel co de eg Unix Mach Chorus Amo eba actually migrate

the users thread into the kernel address space during a kernel call No context switchtakes placeonly the

stack and privilege level are changedand the user threads resources are used to provide services to itself

one can p erform op erations on the ob ject it represents



Sometimes this invo cation of the scheduler is heavily optimized and inlined into the IPC path but it is still there



The pro cess mo del is in contrast to the interrupt mo del as exemplied by the op erating system in whichkernel

co de must explicitly save state b efore p otentially blo cking

Static vs Migrating Threads In actuality there is a continuum b etween these two mo dels For example

in some systems such as QNX certain client thread attributes such as priority can b e passed along to

inherited by the servers thread Or a service thread may retain no state b etween clientinvo cations

only providing resources for execution as in the Peregrine RPC system Thus it can b ecome imp ossible

to precisely classify every thread and RPC implementation However most systems clearly lie towards one

end of the sp ectrum or the other

Providing Migrating Threads on Mach

Mach uses a static thread mo del and a thread contains all of the attributes outlined in the Threads

paragraph ab ove In our work we decoupled these semantic asp ects of the thread abstraction into two groups

and added a new abstraction the activation stack which records the clientserver relationships resulting from

RPCs A thread is now only i the logical owofcontrol represented byastackofactivations in tasks and

ii the schedulable entity with priority and resource accounting attributes An activation represents i

the execution context of a computation including the task whose co de it is executing its exception handler

program counter registers and stackpointer and ii the uservisible p ointofcontrol

The abstraction exp orted to user co de that corresp onds to the old thread abstraction is nowwhatwe

internally call the activation This is not only what makes sense for the needs of user programs but also

provides compatibility with original Mach

The real thread as dened ab ove the schedulable entity is no longer sub ordinate to a task By making

RPC a single identiable entity to the kernel and explicitly recording the relationship b etween individual

activations in the activation stack which result from RPC wehaveelevated RPC to an entity fully visible

to and supp orted bythekernel instead of a sequence of message passing op erations Our thread abstraction

now more closely mo dels the original conceptual basis of a thread a logical ow of control It turns out

that elevating the thread and RPC abstractions also enhances control lability b ecause the kernel can now

take more elab orate and precise actions on a single activation or on the entire thread For example it can

propagate information alerts along the chain of activations Another b enet of intro ducing to the kernel

the notion of an intertask RPC is that a numb er of aggressive IPC optimizations b ecome p ossible This

was one of our original motivations but many other b enets have since surfaced

Outline of Goals and Benets

Our original goals in this pro ject were several i change Mach s thread mo del to a migrating one

ii retain backwards compatibility and iii enable p erformance improvements via RPC optimizations not

p ossible with static threads During the design and implementation we discovered wecouldachievemuch

more iv ordinary messagebased fully marshaled RPC b ecame much faster v thread controllabilitywas

enhanced vi kernel co de b ecame much simpler vii an applesapples comparison of static vs migrating

threads was achieved and viii several other advantages discussed b elow b ecame evident

In the rest of this pap er we describ e this work in detail We rst discuss related work then cover the

advantages of a migrating thread mo del describ e our kernel implementation and interface including discus

sion of the thread controllability issues examine how RPC works in the new system and mention how the

Unix server could b e changed to b etter leverage migrating threads Finallywe present the implementation

status and preliminary results outline future work and the conclusions wedraw from this work

Related Work

Most op erating systems use a static thread mo del but there are a numb er of exceptions Suns Spring

op erating system supp orts a migrating thread mo del very similar to ours although it uses dierent termi

nology Springs shuttle corresp onds to our thread and their thread corresp onds to our activation

Spring addresses the controllability issues but did not have to b e concerned with backwards compatibility

Alphawas probably the rst system to fully adopt migrating threads It is oriented to realtime con

straints and its migrating thread abstraction is esp ecially imp ortant for carrying along scheduling exception

handling and resource attributes In b oth of these systems a thread can migrate across no des in a distributed

environment and indeed Alphas term for a migrating thread is a distributed thread Psyche is a single

addressspace system that supp orts migrating threads The Lightweight RPC systemonTaos exploited

migrating threads control transfer as a critical part of its design but fo cused on highp erformance lo cal

RPC and included additional data transfer optimizations This makes it dicult to isolate the b enets of the

improved control transfer Ob jectoriented systems have traditionally distinguished b etween active and

passive ob jects corresp onding to static and migrating thread mo dels Clouds exemplies a passive

ob ject migrating thread mo del while Emerald as we do provides b oth active and passive ob jects

supp ort for b oth styles of execution Chorus can use only threadswitching b etween userlevel tasks

but b etween tasks running in the kernels protection domain it has message handlers which op erate in a

migrating thread mo del

We b elieve that many of these migrating threads systems did not fully address the attendant controlla

bility issuesdid not fully supp ort debugging or need to provide Unix semantics for example

Scheduler activations are kernel threads with sp ecial supp ort for userlevel scheduling Scheduler

activations are concerned primarily with the b ehavior of kernel threads within a protection domain while

our work deals with thread b ehavior across protection domains As such these works are largely orthogonal

and theoretically could b e combined in the same system but we do not deal with this issue here

Except for LRPC on Taos all of the existing systems supp orting migrating threads were designed this

way from the start They are all dierent from traditional op erating systems in manyways other than thread

mo del To our knowledge heretofore the thread mo del issue itself has not b een separated out and examined

comparing all of p erformance functionality and simplication Our goal is to do this by comparing the

two thread mo dels in the same op erating system providing information fo cused on the thread mo del By

implementing migrating threads on Machwe also demonstrate how an existing op erating system with

static threads can b e adapted to migrating threads

Motivation

A migrating thread approach has several advantages which are outlined in this section The ma jorityof

the b enets are linked to use with RPC and are describ ed rst But there are also controllabilityadvantages

for threads during al l kernel interaction and these are outlined in section In the context of the Alpha

OS also discusses many advantages oered by migrating threads

Remote Pro cedure Call

Many of the advantages of migrating threads stem from their use in conjunction with RPC Migrating

threads provide a more appropriate underlying abstraction on which to build RPC interfaces than do static

threads Many of the problems with static threads stem from the semantic gap b etween the control mo del

a pro cedure call abstraction within a single thread of control and the mechanism used to implement the

mo deltwo threads executing in dierent tasks Using migrating threads for RPC provides b enets in

p erformance functionality and in ease of implementation Since RPC is very frequently used esp ecially

in newer based op erating systems where most internal system interactions are based on RPC

this asp ect of the system can b e of great imp ortance in determining the p erformance and functionalityof

the system as a whole

Invo cation Eciency

For RPCs to b e p erformed in the static thread mo del two threads one in each task must synchronize

in the kernel Two threadtothread context switches are required during the op eration one on call and one

on return However in the migrating thread mo del the entire RPC can b e p erformed by just one thread

that temp orarily moves into the server task p erforms the requested op eration and then returns to the client

task with the results No synchronization rescheduling or full context switch need b e done

Thread migration also p ermits optimizations such as those done in LRPC and in other exibly struc

tured or shared address space systems eg Lipto FLEX and Mach InKernel Servers In these

systems there is some degree of interdomain memory sharing or protection relaxation thus blurring domain

b oundaries RPC implemented by threads that migrate from one domain to another can takeadvantage of

this b oundary blurring providing many optimizations in argument passing and stack handling At the limit

RPC implemented in a migrating thread mo del could b e sp ecialized to a simple pro cedure call

These advantages apply in general to any service invocation mechanism not just large servers invoked

through RPC For example in an ob jectbased environment invo cation of relatively negrained ob jects is

prohibitively inecient if all ob jects must b e active With passive ob jects it is more feasible to apply similar

invo cation abstractions to b oth medium and coursegrained ob jects

While the existence of fast ecient based on static threads demonstrates that high p erfor

mance is p ossible in that mo del such systems often imp ose semantic restrictions that distort their implemen

tation towards a migrating thread mo del For example QNX a commercial realtime op erating system

supp orts only unqueued synchronous direct pro cesstopro cess message passing with priority inheritance

this design makes it a de facto migrating threads system Other microkernels such as L retain the full

semantics of static threads but to achieve high p erformance must imp ose severe restrictions on the exibility

of scheduling and other asp ects of the system not directly related to RPC

Thread Attributes and Realtime Service

In the static thread mo del when a client task p erforms an RPC control is transferred to an entirely

dierent thread that has its own scheduling parameters such as execution priorityaswell as other attributes

such as resource limits Unless sp ecic actions are taken the attributes of the thread in the server will b e

completely unrelated to those of the client thread This can cause the classic problems of starvation and

priority inversion when a highpriority client is unfairly made to comp ete with lowpriority clients that

are accessing the same server On the other hand if the client thread migrates into the server to p erform

the op eration all such attributes can b e prop erly maintained with no extra eort Obviously this issue is

of particular imp ortance to systems providing realtime service

A related advantage is in resource accounting which can b e made more accurate since the work done in

a server on b ehalf of a client can automatically b e so attributed

Interruptions during RPC

Often due to asynchronous conditions it is desired to interrupt an RPC in which a client is blo cked

either temp orarily or p ermanentlyTo do this cleanly in the static thread mo del it is not enough merely to

ab ort the message sendreceive op eration b ecause the server will continue pro cessing the request without

any indication that the client no longer desires its completion If some entitywants to ab ort an RPC in

which a thread is blo cked it must nd the server to which the RPC is directed knowhowtointeract with

that server enough to send it a request to ab ort an RPC op eration and provide the server with some kind

of identication sp ecifying which RPC is to b e ab orted This usually proves to b e a complex and dicult

pro cess In addition every server that maybeaccessedmust supp ort these ab ort op erations This can b e

dicult to guarantee in practice esp ecially if any usermo de task can set itself up as a server and allow

other user threads to make RPCs to it as Mach allows Migrating threads on the other hand provide

achannel through which standardized requests for interruption can b e propagated

Server Simplication

In the case of p ersonalityservers that emulate monolithic op erating systems such as Unix or OS we

exp ect migrating threads to simplify the server b ecause the original op erating system on which the server is

based is likely to have used a limited migrating thread mo del in which threads migrate into the monolithic

kernel for system calls Maintaining this mo del in the p ersonality server should achieve greater co de reuse

and simplify the handling of interruptions thread management and control mechanisms suchas

Unix signals

We also exp ect migrating threads to simplify RPC service in servers due to the anticipated simpler

management of activation p o ols than thread p o ols as describ ed in Section

Kernel RPC Path Simplication

As later shown by our results in Section migrating threads greatly simplify the kernel RPC path

as well RPC paths based on migrating threads tend to b e short and ow naturally while optimized RPC

paths based on static threads are often long convoluted and contain innumerable tests

Thread Controllability and Kernel Simplication

A migrating threads implementationgives other controllability and simplication b enets unrelated to

RPC In a static thread mo del threads are often intended to b e completely control lable resources Ideallyin

this mo del anyentity with appropriate privilege such as a program holding a thread control p ort in Mach

is able arbitrarily to stop a thread and mo dify its state at any time Conceptually threads execute

only usermo de instructions and therefore there is never a time when system integrity could b e violated by

manipulation of the thread

Unfortunately this mo del in its purest form do es not work in real op erating systems Threads must b e

able to invokekernellevel co de in order to communicate with other entities in the system if they are to do

anything more than pure computation Since a thread executing unknown kernelcodemay not b e arbitrarily

manipulated the mo del of complete controllabilitymust break down somewhat it must b e p ossible to defer

or reject thread control op erations when necessary

Traditional op erating systems havevarious ways of working around this problem which usually work

but are often complex inconsistent and unpredictable For example Mach provides a thread control

op eration which ab orts a system call in which the target thread is blo cked so that the thread can b e

manipulated However manykernel op erations cannot b e ab orted in a transparent restartable way so the

entity trying to control the thread mayhavetowait an arbitrary length of time or retry an arbitrary number

of times b efore it can safely do so If this is the case who is really b eing controlledthe target thread or

the thread trying to control it

Since the complete controllability mo del is not realistic anyway reducing the ambitiousness of the mo del

to allow for migrating threads provides more precisely dened semantics for thread manipulation In fact

by forcing the b oundaries of controllability to b e explicitly dened and recording the owofcontrol across

tasks additional thread control mechanisms such as crossdomain alerts can b e provided bythekernel

Dening the b oundaries of control also makes all control mechanisms much simpler to implement as shown

in Section

Kernel Implementation

In this section we describ e the underlying structure of our implementation of migrating threads in the

Mach microkernel Many of the techniques we used could b e similarly applied to other traditional

multithreaded op erating systems such as monolithic Unix kernels

Thread Implementation

Conceptually a traditional Mach user thread started executing in a particular task and o ccasionally

trapp ed into the kernel to communicate with outside entities The kernel later returned from the system

call and resumed the user co de The initial and normal lo cation of a thread was in and threads

only visited the kernel o ccasionally to request services

In our migrating thread implementation the situation is in a sense reversed A thread starts executing

as a purely kernelmo de entity and later makes an up callinto user space to run user co de Conceptually

the kernel is home base for all threads the only time userlevel co de is executed is during temp orary

excursions into a task A thread executing in user mo de is asso ciated with the task in which it is currently

running but a thread running in the kernel is not tightly asso ciated with any userlevel task

While a thread in the kernel can nowmake up calls into user space the traditional kerneluser interface

is still preserved Once a thread is executing in user space it can make calls back to the kernel in the form

of traps and exceptions Alternatively the kernel can make further up calls into the same or a dierentuser

task This redenition of the kerneluser interface is the primary mechanism supp orting migrating threads

in our implementation

A distinction should b e made b etween the kernel and what we refer to as glue co de The kernel is

conceptually a protection domain muchlike a userlevel task in which threads can execute wait migrate in

and out and so on its primary distinction is that it is sp ecially privileged and provides basic system control

services Glue co de is the lowlevel highly systemdep endent co de that enacts the transitions between

all protection domains b oth user and kernel The distinction b etween the kernel and glue co de is often

overlo oked b ecause b oth typ es of co de usually execute in sup ervisor mo de and are often linked together in

a single binary image However this do es not necessarily have to b e the case for example in QNX

the K microkernel consists of essentially nothing but glue co de while the kernel prop er is placed in a

sp ecially privileged but otherwise ordinary pro cess It will b ecome clear in later sections that even though

the kernel and glue co de may still b e lump ed together in the presence of migrating threads the distinction

between them b ecomes extremely imp ortant

Control Abstractions and Mechanism

Even in the static thread mo del in practice the goal of complete controllability of threads cannot b e fully

realized While the case of a thread b eing in the kernel can to some extentbeworked around as a sp ecial

case with the addition of migrating threads the controllabilityissuemust b e more carefully considered

Now not only is the kernel out of b ounds for thread control but in order to maintain protection b etween

tasks threads that have migrated to other protection domains may also b e uncontrollable For example if

one thread in a client migrates intoaserver for an RPC it would not b e p ermissible for another thread in

the same client to stop or manipulate the CPU state of the rst thread while executing server co de

To provide controllability and protection at the same time we split the concept of a thread into two

parts the part used by the scheduler and the part providing explicit control The rst whichwe still

refer to as the thread migrates b etween tasks and enters and leaves the kernel The second a usermo de

invo cation or activation remains p ermanently xed to a particular task Arbitrary control is p ermitted only

onaspecic activation not on the thread as a whole

Whenever a thread migrates into a task including the initial up call from the kernel on thread creation

an activation is added to the top of the threads activation stack When a thread returns from a migration

the corresp onding activation is p opp ed o the activation stack This new kernelvisible abstraction the stack

or chain of activations helps provide controllability

Activations are created either implicitly during thread creation or explicitly by servers exp ecting to

receive incoming migrating threads An explicitly created activation is unoccupied until a thread migrates

into the task and activates it

Within the kernel control of activations is implemented primarily with asynchronous procedurecal lsor

APCs similar to asynchronous traps ASTs in monolithic kernels When returning from the kernel into an

activation glue co de checks for APCs attached to the activation and if present calls them For example to

susp end an activation an APC is attached to that activation which will blo ckuntil resumed Previously

Mach dealt with thread susp ension as part of the scheduler adding more complexity to its alreadycomplex

state machine nowthescheduler knows nothing ab out one thread susp ending another Instead the kernels

ordinary blo cking mechanism is used in which a thread only susp ends itself

Kernel Stack Management

Since the activation chain can b e broken at anypoint all linkage information b etween activations is

stored in the activations themselvesandasinglekernel stack is sucient for the entire thread This is also

required for it to b e p ossible to do task migration across no des in a distributed system b ecause state held on

akernel stack cannot b e easily encapsulated for transp ort This explicit saving of state is generically known

as using a continuation although our implementation is very dierent from the way continuations havebeen

used in Mach in the past In particular we conne continuations purely to glue transition co de all

highlevel kernel co de uses an ordinary pro cess mo del

Controllability Semantics Interface and Implementation

In this section we describ e the semantics of thread control op erations the interface to those op erations

and some asp ects of the implementation We b elieve our approach could b e similarly applied to other

traditional multithreaded op erating systems

Thread Control Interface

In the original Machkernel threads were exp orted to usermo de programs in the form of thread control

ports through which control op erations could b e invoked In our system while threads still exist the control

abstraction presented to userlevel co de is instead the activation control port This can work b ecause the old

thread execution abstraction exp orted to the user was b ound to a single task like activations are now We

maintain compatibility with existing Machcodeby making activation control p orts direct replacements for

thread p orts at the binary levelall system calls which previously exp ected or returned thread p orts now

use activation p orts instead For compatibility at the source level appropriate synonyms are provided

Alerts

abort In our migrating threads implementation wehaveprovided the functionalityofMach s thread

call which ab orts an inprogress kernel op eration and returns control to user co de However wehave provided

it in a cleaner and more general form An alert is a form of asynchronous message passed from a client to the

kernel or a server it is calling asking the callee to ab ort the requested op eration and return control to the

client as so on as p ossible Alerts are primarily an informationpassing mechanism supp orted by the kernel in

a uniform way They do not in themselves provide control over threads b ecause they have no forcefulness

alerts are merely requests not demands We currently implement only a p olling interface for a server

to discover an alert but probably will also provide an exception interface in the future Alerts are much like

those in Spring The Alert and TestAlert facilities of Taos are analogous but apparently do not

op erate crossdomain

By default new activations added to a threads stackhave alerts blo cked to preventinterference with an

unwary servers functioning The kernel is already capable of honoring most alerts and new servers written

to work with migrating threads can b e designed to honor them to o In eect wehaveprovided a generic

interruption request mechanism whichworks uniformly for b oth migrating RPCs and kernel calls

We also provide another op eration which rst generates an alert at the target activation then breaks the

chain returning control to the client immediatelyThisworks much like termination discussed b elow

Susp ension

In Mach thread semantics the basic purp ose of susp ending a thread is to prevent it from executing

any more usermode instructions until it is resumed Therefore susp ending a tasks threads turns that

task into a passiveentity allowing its address space and other state to b e examined or mo died without

interference from its threads It is not required that all of the threads computation b e immediately stopp ed

as long as that computation do es not implicitly reference the threads task For example explicit device

IO could b e allowed to pro ceed while the thread is susp ended but kernel copyin and copyout op erations

which implicitly aect the tasks address space could not

Our current implementation allows suchkernel activitytoproceedWe do not exp ect this to b e a problem

in practice but in any case it should b e solved as a sideeect of related work In that work we are further

separating kernel co de from glue co de describ ed earlier When an activation is susp ended the kernel

ensures that neither usermo de or glue instructions will b e executed in that activation If the thread is

executing elsewhere it will not b e aected until it attempts to return to the susp ended activation

To maintain correct susp ension semantics in this mo del implicit references made bykernel system calls

to the callers task must b e conned to glue co de This is relativelyeasytodoinMach b ecause most kernel

calls are implemented as generic RPCs to kernel ob jects only the lowlevel RPC co de needs to ob ey the



control semantics not the actual co de implementing the kernel calls

Termination

The termination mechanism in our implementation is illustrated in Figure If an activation not at the

top of a threads activation stack is terminated or if the thread is in a kernel call at the time then the

thread splits into two separate threads with identical scheduling parameters One thread is left with the

top part of the activation stack ab ove the terminated activation and the other thread is given the b ottom

part The thread with the upp er segmentcontinues executing in the topmost server uninterrupted ensuring

that thread termination do es not violate protection An alert is automatically propagated upward through



Note that traditional Unix susp ension semantics are stronger than Mach susp ension semantics they imply that kernel

op erations are actually completed or ab orted b efore the thread is susp ended On Mach b oth with static and migrating threads

implementing Unix susp ension semantics involves other Machcontrol op erations in addition to simple Mach susp ension

Figure Activation Termination Task Failed Task Task

error abort return request

Kernel

this thread providing a hint that the work b eing done is probably no longer of value The thread given the

b ottom part of the activation stack returns to its nowtopmost activation with an appropriate error co de



This is essentially the same as the termination mechanism used in Spring

In our system not only can a thread b e split when running in user mo de in a more recentactivation than

the terminated activation but also when the thread is executing in the kernel on b ehalf of the terminated

activation Previously attempting to terminate a thread could p otentially blo ck for some time while the

caller waited for the victim to leave the kernel or otherwise get to a clean p oint In the new mo del

terminating an activation is always an immediate op eration if the terminated activation happ ens to b e

calling the kernel then it is left b ehind to nish whatever op eration it was p erforming and quietly self

destruct afterwards The issue of glue and kernel op erations implicitly aecting the threads task is dealt

with as describ ed ab ove under Susp ension

Our termination mechanism required careful planning of kernel data structures and lo cking mechanisms

in particular the line b etween the kernel and glue co de had to b e dened preciselyOnceworked out however

this technique not only added additional controllability but considerably simplied the implementation of

control mechanisms in the kernel as weshow in Section

CPU State

The original Mach design provided thread op erations which in the ideal complete controllability

mo del would allow a threads entire CPU state to b e saved restored examined and mo died at any time All

CPU state op erations were provided bytwo primitives thread get state and thread set state dened

to pro duce consistent results only while the target thread was susp ended However b ecause of the problems

with the complete controllability mo del many of the things for which the CPU state control mechanisms

are commonly thought to b e useful in fact cannot b e reliably implemented in b ounded time in Mach

For example encapsulation of a tasks state for checkp ointing or transp ortation to another no de either may

require the controlling thread to wait an arbitrarily long time or else requires ab orting kernel op erations

yielding p otentially inaccurate state

Since the existing CPU state control op erations are already problematic and it would b e dicult to

achieve complete backward compatibility with them wechose to structure these op erations in our migrating

threads implementation to t current uses of these op erations in particular those made by the Unix server



We are considering providing alternate termination semantics which fully preservethesynchronousnatureofRPCThe

terminated activation would b e merely spliced out so control is returned to the earlier activations only after those later in the

chain have exited It may not b e p ossible to guarantee these semantics over a partitionable network however

and emulator and by application that create their own threads and control them in straightforward ways

Mach requires thread abort to b e called on a thread just b efore examining or setting its state

unless the thread has just b een created Otherwise the state op eration could work with stale information

pro ducing useless results Under migrating threads ab orting an activation b efore manipulating its state

is not strictly required If not done the CPU state op erations wait patiently until the thread is in the

target activation without interfering with its functioning Thus wehave lo osened the restrictions on these

op erations while maintaining backwards compatibility with their only valid usage in original Mach

Scheduling Parameters

The nal Mach thread control op erations that must b e mapp ed to activation op erations are those

managing thread scheduling parameters suchaspriorityscheduling p olicy and CPU usage statistics Unlike

the op erations describ ed ab ove these op erations are still conceptually p erformed on threads rather than

activations However the original Mach thread control p orts havebecomeactivation p orts raising the

question of how the interface for these op erations should b e handled

Since every active activation is attached to exactly one thread in our current implementationwe exp ort

thread op erations as op erations on activations and in the kernel redirect the op erations to the attached

thread However this raises a protection problem since any activation in the thread can mo dify the global

scheduling state For example a server could lower a threads maximum priority which cannot b e raised

without sp ecial privileges while pro cessing an RPC leaving the client with a crippled thread up on return

In our initial implementation this is not a problem in practice b ecause the Unix server is trusted by all clients

A b etter solution would b e to provide an activation op eration which forbids future activations higher in the

stack from changing global thread state Thread state could still b e manipulated from that activation or

lower assuming it has not also b een forbidden at a lower level

Task Control Interface

Most task control op erations work the same way under migrating threads as in the original Machde

sign Others were mo died in straightforward ways to match the new thread mo del In particular the

threads call now returns a list of the activation p orts of the task instead of thread p orts When a task

task is susp ended resumed or terminated all of the activations within it instead of threads are similarly

susp ended resumed or terminated

Migrating RPC

Once the basic kernel mechanism for supp orting migrating threads was in place it remained to demon

strate its eect on RPC p erformance and complexity Because our fo cus at this p oint is primarily on control

transfer during RPC our initial implementation retains the original Mach data transfer interface based on

marshaling and unmarshaling done by usermo de stubs In this section we describ e this RPC system as

well as the changes required to servers to make them supp ort migrating RPC No changes were required to

make them run with traditional RPC since the kernel itself is almost completely backward compatible

Clientside

From the clients p oint of view RPC semantics including binding are unmo died Existing binaries

with normal mach msg calls are supp orted the kernel checks the message options to make sure they sp ecify a

true RPC and checks the destination p ort to ensure that the server is capable of handling migrating RPCs

In practice almost all MIGgenerated mach msg calls meet these requirements so most clients automatically

make use of migrating RPC Note that the data can still contain p ort rights and outofline memory

Serverside

Initializing a server to supp ort migrating RPCs is done in nearly the same way as in servers supp orting

only threadswitching RPC In accomplishing the ma jor p ortion of binding the server exp orts send rights

to clients exactly as b efore In addition the server must create one or more uno ccupied activations each

containing a p ointer to a stack in its own address space and in the nal p ortion of binding the entry p oint



of its normal dispatch function Providing this information to the kernel can b e encapsulated within a

function eg in the cthreads package

Traditional staticthread RPC is still supp orted automatically A large p o ol of server threads is no longer

needed but at least one must still exist to pro cess o ccasional asynchronous messages b ecause in the current

implementation this is used as a fallbackmechanism when migrating RPC cannot b e used

When a migrating RPC is made into the server the kernel allo cates an uno ccupied activation from the

servers p o ol copies the incoming message onto the server stack and makes an up call into the server task

to the dispatch routine This MIGgenerated routine is identical to the one used to dispatch traditional

messages except that it returns through a sp ecial kernel entryp oint On return the kernel do es not need to

do any securitychecks or p ort manipulation and the reply p ort provided by the clientinthe mach msg call

is never used at all

If a migrating RPC is attempted and the kernel discovers that there are no activations currently available

in our initial implementation the kernel falls back to the normal message path causing a normal message

to b e queued to the p ort This is not ideal and we plan to detect when the last available activation is

ab out to b e used for a migrating RPC and instead of immediately making the requested RPC temp orarily

sidetrack and make a sp ecial notication up call into the server At this p oint the server can create more

activations if it deems this desirable If it do es it returns them to the kernel and the original RPC can

pro ceed Otherwise it returns immediately and the RPC blo cks until a stack is freed

When this is implemented server management of an activation p o ol should b e substantially more straight

forward than management of a thread p o ol for several reasons The resources will b e allo cated on demand

by client threads themselves instead of resource needs having to b e predicted in advance by the server The

server can use some simple decay function to deallo cate activations whicharecheap for the kernel to manage

since they are simply passive data structures in contrast to kernel threads In contrast with a thread p o ol

a server is separated by the kernel wall from clients requestsif inadequate numb ers of server threads

are present client messages build up in the servers queues without its knowledge The server has a more

complex job as it attempts to keep the numberofthreadswaiting for requests equal to or slightly greater

than the numb er of pro cessors it has to keep trackofthenumb er of threads waiting for messages running

in the server and blo cked on outgoing RPCs or kernel calls The last asp ect is particularly awkward b ecause

it requires surrounding every such blo cking call with op erations to wake up and manage server threads If it

do es not deadlo ck can result Finally substantial complexity is due to multiplexing cthreads over kernel



threads which cannot b e done with migrating threads However if it is found necessary to limit the

total numb er of executing threads in a particular server in order to avoid saturation due to excessivekernel

context switching some of this simplication will not b e present

Userlevel Thread Issues

The most imp ortant issue with migrating RPC is that the userlevel threads and synchronization package

most widely used on Mach cthreads has signicant limitations in the presence of migrating RPC

Server Thread Management The cthreads presents a signicant problem to the server of a

migrating RPC Servers use cthreads to multiplex user threads on top of kernel threads replacing kernel

mo de context switches with much faster userlevel context switches whenever p ossible However one of the

main assumptions made by the userlevel threads package is that all of the kernel threads on whichitis

running its userlevel threads are interchangeablethat one kernel thread can b e used for an op eration just

as well as another This assumption can b e satised in a static thread mo del although in the pro cess it

makes realtime monitoring and control of server threads dicult

In a migrating thread mo del however kernel threads migrating in from clients are not interchangeable

they mayhave dierent priorities and other attributes Even ignoring this the returntokernel after an

RPC has b een pro cessed must b e done on the same kernel thread that the RPC came in on In general

trying to multiplex threads in this manner loses one of the main advantages of our design providingakernel



When programming to the RPC stub generator interface as is usually done this is not a change in binding semantics



Skeptics should insp ect the OSF servers ux server loop and related co de where the outlined complexitywas found

necessary for p erformance and safety

entity the activation stack which represents a particular piece of work in progress ie an entire logical

thread of control Therefore multiplexing a servers userlevel threads on top of incoming kernel threads is

not appropriate In cthreadsmultiplexing can easily b e avoided by wiring the userlevel thread

However some sp eed is lost in the elimination of userlevel thread multiplexing b ecause synchronization

op erations in the server sometimes now require kernellevel context switches instead of userlevel context

switches Measuring real applications including on multipro cessors will b e necessary b efore wecanbe

sure the gains from b etter RPC p erformance are not outweighed by this additional cost We b elievethat

the sp eed advantage of userlevel context switching is not as signicantintypical RPC servers as it is in

computeintensive applications which are the traditional b enchmarks for thread implementations In well

designed servers providing system functions we susp ect that internal contention can b e minimized so

that the imp ortance of RPC sp eed outweighs that of context switchspeedWepoint out that in many

commercial microkernelbased systems including QNX Chorus and KeyKOS OS servers do not

generally multiplex userlevel threads over multiple kernel threads Instead these systems either provide

multithreading purely with kernel threads or their functions are suciently decomp osed so that eachserver

can b e based on a single kernel thread requiring no internal synchronization However until there is more

extensive p erformance analysis of servers using migrating RPC losing userlevel threads when servicing

RPCs remains a concern

Note that it is only for guest threads migrating in from other tasks that userlevel thread multiplexing

is a problem threads native to the server can still use some kind of userlevel thread system or even a

sp ecialized multiplexing mechanism suchasscheduler activations

A More Appropriate Synchronization System Since cthreads can no longer multiplex userlevel

threads on kernel threads in servers it should b e replaced with a synchronization library b etter optimized

to provide synchronization over kernel threads Also kernelvisible synchronization will b e necessary to

fully implement priority inheritance as we discuss in a longer pap er We are planning a replacement for

cthreads that provides synchronization primitives in a userlevel library but in co op eration with the kernel

The Unix Server

To function on the new kernel using traditional RPC no changes were necessary to the OSF single

server and emulator or to the libraries they use To supp ort migrating RPC a few changes were required

Initiallywechose ways whichhave minimal impact on existing co de but b etter cleaner mechanisms can

b e provided in the longer term The server was mo died to invoke the new setup function in the cthreads

library and to wire incoming cthreads The existing complex management of the servers thread p o ol

while basically no longer used was retained We made no mo dications to the emulator Since weare

providing backwardscompatible semantics for thread manipulation no mo dications were needed to the

existing complex co de for handling Unix signals

Even when our implementation is tuned we do not anticipate a large p erformance improvement in the

single server to a large extent b ecause it was written with the assumption that RPC is very exp ensive

Therefore the server avoids RPC as much as p ossible instead resorting to other approaches like shared

memory pages whose p erformance is not enhanced by migrating RPC However our initial goal is not

primarily to show p erformance improvement but to demonstrate the gain in simplicity and cleanliness

provided by migrating threads and how migrating threads can b e implemented in a backwardcompatible

way in an existing op erating system

Desirable Mo dications Weanticipate that the Unix server could b e made simpler with two mo dica

tions that take advantage of migrating threads One is emulating Unix signals under Mach and is describ ed

in Another is that b ecause Mach provides no standard way of propagating ab ort requests into

RPCs the Unix server must manually handle all Unix system call interruptions such as those caused by

p ending signals It ought to b e considerably simplied by taking advantage of the propagating ab ort op er

ations nowprovided by the kernel This would also makeinterruption semantics naturally extend to other

servers in the system such as ones installed byMachsp ecic application programs running under Unix

Results

Status

Wehave completed the kernel implementation of the system describ ed in this pap er An unmo died

emulatorbased Unix server runs normally on the new microkernel using traditional threadswitching RPC

A server mo died to provide activations so that it uses migrating RPC runs multiuser Unix signals

including C and Z are working

Exp erimental Environment

All timings were collected on a single HP with MB RAM This machine has a Mhz PA

RISC pro cessor K ochip Icache K ochip Dcache entry ITLB and entry DTLB with a

page size of K The caches are directmapp ed and virtually addressed with a cache miss cost of ab out

cycles RPC test times were collected by reading the PAs clo ck register which increments every cycle and

can b e read in user mo de Other times were obtained from the Unix server

The system software is our p ort of the Machkernel version NMK and the emulatorbased OSF

single server version b The compiler is GCC u with full optimization

RPC Path Breakdown and Analysis

To analyze the times sp eedup in null RPC presented in the section wecounted instructions by

hand along the kernels null RPC path These counts are broken down bytyp e of pro cessing in Table

along with the relative oldnew ratio and absolute numb er of instructions improvement in each category

of the improvement is due to the inverted serverkernel interface since the kernel is now calling the

server rather than vice versa the kernel no longer needs to save and restore the servers registers on every

RPC results from the kernels rsthand knowledge of RPC it no longer needs to create translate

and consume reply p orts in order to match a reply to its request comes directly from the optimized

control transfer of migrating threads switching activations is much simpler than switching threads of

the improvement is due to simpler data management particularly the elimination of the need to maintain a

temp orary message buer in the kernels address space due to direct copy from source to destination It is



arguable whether this asp ect is related to migrating threads

It is interesting that nearly half of the cost of migrating RPC now resides at the kernelclient b oundary

more than half if measured by memory op erationssee b elow Therefore further improvements to other

parts of the kernel RPC path will probably lead to only minimal overall sp eedup That up calls are so cheap

in comparison esp ecially in memory op erations mayhave implications for system structuring in a system

supp orting migrating threads

Table Null RPC Path Breakdown and Improvement

Improvement Improvement

Instruction Count Instructions LoadsStores

Sw

Stage Switch Migrate Sw Mg Sw Mg

Mg

Kernel entryexit Client side

Kernel entryexit Server side

Port translation

ThreadActivation switch

Message copy

Entire kernel path

The last two columns of Table show the improvement for each stage as measured bythenumber of

loadstore op erations which should contribute disprop ortionately to total cycles due to memory subsystem



At rst sight this asp ect may seem completely orthogonalHowever on the switching path the message copyin is at the

very b eginning while the copyout into the destination is at the end separated byhundreds of lines of complex co de Combining

these into one direct copy op eration would b e very dicult On the much simpler and shorter migrating path direct copywas

easy to supp ort Note that even with null RPC some copying is involved the word message header

costs We observe that the p ercentage improvements in instruction count and memory op erations are ap

proximately equal for each stage of RPC This suggests that instruction countisavalid measure of the

relative contribution of each stage to the overall p erformance gain

Examination of the context switch co de in the old optimized RPC path explains much of its cost the

kernel essentially executes a p ortion of the scheduler sp ecially handco ded inline Numerous constraints

must b e satised b oth old and new threads must b e in just the right states run and wait queues must

b e maintained correctlylocks on p orts threads IPC spaces and other data structures must b e taken and

released in the right order to avoid deadlo cks timers are manipulated interrupt levels are changed resources

acquired along the waymust b e carefully tracked to ensure that it will b e p ossible to unroll everything if

for some reason the computation falls o the optimized path

Table shows the instruction mix for each path broken into three categories total instructions

loadsstores and branches The migrating path has a somewhat higher p ercentage of loads and stores

vs presumably due to the fact that the basic memoryintensive asp ects of IPCregister saving

and restoring memory copying and data structure traversalare less obscured by computational overhead

The relative incidence of branch instructions is muchlower however vs This along with

the ninefold reduction in total numberofbranch instructions reects the lower logical complexity of the

migrating path

Table Null RPC Path Instruction Mix

Switch Migrate

Stage All LoadStore Branch All LoadStore Branch

Kernel entryexit Client side

Kernel entryexit Server side

Port translation

ThreadActivation switch

Message copy

Entire kernel path

We exp ect that our results on the RPC path would in general extend to other architectures b esides the

PARISC Although sometimes dicult most architectures can achieve the single direct copy when the date

are contiguous for example by temp orary mapping Changing the interrupt prioritylevel IPL is done

four times on the switching path due to scheduler involvement but not at all on the migrating path while

IPL changes are cheap on the PARISC they are very exp ensive on some other architectures making

migrating threads esp ecially imp ortant on them The unavoidable cost of address space switching is much

higher on some architectures whichwould lead to a lower improvement ratio but even then we exp ect the

b enets to b e considerable

Micro and Macro Benchmark Results

Measurements of the costs of crosstask migrating and traditional switching RPC are presented in Table

The columns on the left include only the kernel costs of RPC while the ones on the right include b oth kernel

and user marshaling costs obtained in another set of runs On this machine a null lo cal RPC now sp ends

less than microseconds in the kernel The sp eedup from migrating threads varies with parameter size

from a factor of for null RPC a factor of for K of data to a factor of for long inline marshaled

data This factor of comes from the fact that the data is copied three times in the switching pathonce

during marshaling and twice in the kernelbut only twice on the complete migrating path

One interesting observation is that the numb er of cycles p er instruction CPI is considerably worse on

the migrating path than on the original path Webelieve that some or

all of this is due to two factors the higher p ercentage of loadstore instructions as describ ed ab ove and

the fact that the instructions on the handco ded migrating path were not carefully scheduled and optimized

like the C compiler did to most of the old path Therefore more careful co ding of the migrating RPC path

could somewhat lower CPI More investigation of the CPI dierence is warranted

Table RPC Times in Cycles

Kernel Time Kernel Time User Marshaling

Test Switching Migrating Ratio Switching Migrating Ratio

Null RPC

In

K In

 

  

K In

 

The measurementofthekernel time for K migrating RPC showed severe side eects of the HPs

directmapp ed cache At the top is our original measurement a suspiciously low time resulting in a rather

unb elievable  sp eed improvement Below it is the same measurement taken after shifting the message

buers slightly so that the cache lines would conict resulting in an improvementof below the factor of

twowewould exp ect due to the data b eing copied once instead of twice This demonstrates the imp ortance

of cache eects in data transfer and deserves further investigation in the future

As a preliminary test of overall p erformance impact we measured the time for a make of the gas

assembler Under migrating threads the elapsed time went from to seconds an improvementof

ab out The link phase to ok ab out seconds A link of a larger program the HP linker itself improved

from seconds to seconds an improvementofWe b elieve this greater improvement is due to ld

having a higher ratio of system calls to computation

One area where weslowdown is in RPCs to the kernel These do not currently migrate since wehavent

changed the kernel to provide activations on its p orts We do not exp ect doing so to b e dicult When that

is done all messages which originally would have used the optimized path true RPCs should b e migrating

We exp ect a tuned implementation to achievemoreoverall sp eedup and if other RPC optimizations

enabled by migrating threads are p erformed signicantly more sp eedup

Kernel Co de Simplication

Conned Controllability Making threads indep endent of tasks and uncontrollable outside of user mo de

greatly reduced co de complexityinanumb er of areas The source le containing most thread control

op erations was reduced by more than half from K to K In the new K source le supp orting activations

supp ort for control op erations account for only ab out K This simplication largely resulted from cleaner

management of thread susp ension resumption and termination The original Mach thread control

mechanisms had to makenumerous tests for sp ecial cases such as a thread manipulating itself or two threads

trying to control each other simultaneously p ossibly causing kernel deadlo ck Now that controllabilityis

conned within welldened b oundaries as it must b e to supp ort migrating threads these tricky cases never

o ccur b ecause kernel co de is always out of b ounds

The task managementcodewas reduced from K to K for similar reasons the cleaner mo del simplied

lo cking and eliminated many sp ecialcase situations such as the case of a thread terminating its own task

Migrating RPC On the original switching path the p ort translation and context switchcodewere mostly

written in machineindep endent C co de while the other categories were PAsp ecic assembly language On

the migrating RPC path the machineindep endent parts b ecame so trivial that it was easier to inline them

into the assembly language path than to go to the trouble of interfacing with a highlevel language The

entire lines of complex C co de comprising the optimized RPC path plus ab out handco ded assembler

instructions were replaced with ab out assembler instructions The resultant simplications in logical

complexity a factor of nine are evident from the gures presented in Section

The rest is for allo cation and freeing of activations and other administration unrelated to the control op erations

Although some sp eedup presumably resulted from assembly co ding webelieveitwas slight and only feasible due to the

simplicity of the migrating path Those who dier are invited to insp ect the original message path and try to handco de it

without changing its semantics

Memory Use

In the original microkernel in general only a few kernel stacks K each were required p er pro cessor due

to the continuations mechanism At the b eginning of this pro ject we disabled continuations in order to

simplify our work this immediately raised kernel memory use to one kernel stack p er thread However with

migrating threads there are nowfarfewer threads in the system and kernel stacks are still asso ciated with

threads instead of activations While running our multiuser b enchmarks we observekernel physical memory

use to b e at most K greater than in the original system However we are in the pro cess of reintro ducing

continuations under the new mo del so even this increase should b e temp orary

Regarding server use at this writing we statically allo cate a large numb er of activations

and the old thread p o ol remains in the server Therefore ab out three times as muchVMisusedas

b efore MB vs MB When weremove the thread p o ol server VM use should b e ab out the same as in

the old system b ecause for the most part each server threaduser stack b ecomes one server activationuser

stack Of course clients are unaected b ecause they are unmo died

Future Work

This work enables many further improvements to Mach and Machservers as well as raising areas for

further research Providing an appropriate replacement for the cthreads synchronization primitives is

imp ortant in order to make a fair evaluation of the impact of relying on kernellevel context switches Our

earlier work on moving trusted servers into the kernels protection domain and address space INKS

used adho c thread migration By reworking the thread abstraction from scratch our new system solves all

of the problems encountered

The NORMA NO Remote Memory Access version of Machallows IPC b etween dierentnodes

of a distributed memory multipro cessor implemented in the microkernel Extending the migrating thread

system to encompass RPC b etween no des should b e done The issues involved have already b een explored

in depth in Alpha

Going in a dierent direction our work allows improvements in Machs supp ort for realtime systems At

the implementation level wehave largely decoupled two p ortions of the thread abstraction the schedulable

entity priorityscheduling p olicies etc from the thread of control the chain of activations This makes

it feasible to decouple them entirely enabling a full implementation of priority inheritance

The Mach message format imp oses unnecessary overhead on migrating RPC The migrating thread

mo del enables other designs which could provide much higher p erformance such as LRPC In cases where

protection domains have b een merged much of the copying can b e avoided

The migrating RPC mechanism can also b e used in thread exception pro cessing This will allowa no

emulator server such as OSFMK to do more ecient argumentcopying We b elieve migrating RPC

can also b e leveraged by making the Mach pager interface synchronous with a thread servicing its own page

faults This requires security to b e explicitly provided when untrusted pagers are involved

The OSF Research Institute is adopting our co de and is planning to makemany of the ab oveimprove

ments in an Intel base Our co de will also b e available to interested parties

Conclusion

We draw three main conclusions from our work First bychanging the thread mo del of an existing

op erating system and evaluating the twoversions weshow that a migrating thread mo del is sup erior to a

static mo del Migrating threads provide sup erior functionality p erformance and co de simplication In the

area of functionality thread migration i provides more precisely dened semantics for thread manipulation

and additional control op erations ii allows scheduling and other attributes to follow threads esp ecially

imp ortant for realtime systems In p erformance thread migration i improves the p erformance of ordinary

RPC and ii enables a multitude of aggressive RPC optimizations esp ecially in systems under current

research which provide crossdomain memory or addressspace sharing However thread migration do es have

the p otential p erformance disadvantage of not allowing userlevel threads that service RPCs to b e multiplexed

atop multiple kernel threads In reducing implementation complexity thread migration simplies i kernel

co de and we exp ect ii server co de In each of these areas our implementation and measurements have

demonstrated the rst b enet while p otential gains from the second seem evident but have not yet b een

realized through full implementation in our system

Secondly since migrating threads requires that the kernel treat lo cal RPC as an identiable semantic

entitywe conclude that op erating system kernels should directly supp ort the RPC abstraction

Our third main conclusion is that it is feasible to improve at least some existing op erating systems

bychanging their thread mo del from static to migrating Even in the case of Mach which has an

unusually rich threadmanipulation interface weshowthatthisfarreaching change can b e made while

retaining backward compatibility and with only mo derate implementation eort A key elementofthat

implementation is basing threads in the kernel which temp orarily make excursions into tasks via up calls

Acknowledgements

We esp ecially thank Mike Hibler for his exp ert help with the implementation as well as discussion of

controllability and signal issues We thank Douglas Orr for varied help and Greg Minshall for his careful

review of an earlier draft We had helpful discussions with many memb ers of the OSF Research Institute

ab out many asp ects of the system and with the HP Labs Brevix group ab out controllability The anonymous

referees Brian Bershad RichDraves and Alessandro Forin provided many useful comments on earlier drafts

and we thank Mike Jones and Je Mogul for all of that plus their patient shepherding

References

Thomas E Anderson Brian N Bershad Edward D Lazowska and Henry M LevyScheduler acti

vations Eectivekernel supp ort for the userlevel management of parallelism ACM Transactions on

Computer Systems February

J S Barrera A fast Machnetwork IPC implementation In Proc of the Second USENIX Mach

Symposium pages

Brian N Bershad Thomas E Anderson Edward D Lazowska and Henry M LevyLightweight remote

pro cedure call ACM Transactions on Computer Systems February

Andrew D Birrell An intro duction to programming with threads Technical Rep ort SRC DEC

Systems Research Center January

A PBlack N Huchinson E Jul H Levy and L Carter Distribution and abstract typ es in Emerald

IEEE Trans on Software Engineering SE

Alan C Bomb erger and Norman Hardy The KeyKOS nanokernel architecture In Proc of the USENIX

Workshop on Microkernels and Other Kernel Architectures pages Seattle WA April

John B Carter Bryan Ford Mike Hibler Ravindra Kuramkote Jerey Law Jay Lepreau Douglas B

Orr Leigh Stoller and Mark Swanson FLEX A to ol for building ecient and exible systems In

Proc Fourth Workshop on Workstation Operating Systems Octob er

D R Cheriton The V distributed system Communications of the ACM March

Roger S Chin and Samuel T Chanson Distributed ob jectbased programming systems ACM Com

puting Surveys March

David D Clark The structuring of systems using up calls In Proc of the th ACM Symposium on

Operating Systems Principles pages Orcas Island WA Decemb er

Raymond K Clark E Douglas Jensen and Franklin D Reynolds An architectural overview of the

Alpha realtime distributed kernel In Proc of the USENIX Workshop on Microkernels and Other

Kernel Architectures pages Seattle WA April

Michael Condict Personal communication November

Sadegh Davari and Lui Sha Sources of unb ounded priorityinversions in realtime systems and a

comparative study of p ossible solutions ACM Operating Systems Review April

Richard P Draves Brian N Bershad Richard F Rashid and Randall W Dean Using continuations

to implement thread managementandcommunication in op erating systems In Proc of the th ACM

Symposium on Operating Systems Principles Asilomar CA Octob er

Peter Druschel Larry L Peterson and Norman C Hutchinson Beyond microkernel design Decou

pling mo dularity and protection in Lipto In Proc of the th International Conference on Distributed

Computing Systems pages Yokohama Japan June

Partha Dasgupta et al The design and implementation of the Clouds distributed op erating system

Computing Systems Winter

Bryan Ford Mike Hibler and Jay Lepreau Notes on thread mo dels in Mach Technical Rep ort

UUCS University of Utah Department April

Bryan Ford and Jay Lepreau Evolving Mach to use migrating threads Technical Rep ort UUCS

University of Utah November

Graham Hamilton and Panos Kougiouris The Spring nucleus a microkernel for ob jects In Proc of

the Summer USENIX Conference pages Cincinnati OH June

Dan Hildebrand An architectural overview of QNX In Proc of the USENIX Workshop on Micro

kernels and Other Kernel Architectures pages Seattle WA April

DB Johnson and W Zwaenep o el The Peregrine highp erformance RPC system Software Practice

and Experience February

Jay Lepreau Mike Hibler Bryan Ford and Je Law Inkernel servers on Mach Implementation

and p erformance In Proc of the Third USENIX Mach Symposium pages April

Jo chen Liedtke Improving IPC bykernel design In Proc of the th ACM Symposium on Operating

Systems Principles Asheville NC December

Op en Systems Foundation and Carnegie Mellon Univ MACH Kernel Interface

Simon Patience Redirecting system calls in Mach An alternative to the emulator In Proc of the

Third USENIX Mach Symposium pages Santa Fe NM April

M Rozier V Abrossimov F Armand I Boule M Gien M Guillemont F Herrmann C Kaiser

S Langlois PLeonard and W Neuhauser The Chorus distributed op erating system Computing

Systems Decemb er

Michael L Scott Thomas J LeBlanc and Brian D Marsh Design rationale for Psyche a general

purp ose multipro cessor op erating system In Proc of the International ConferenceonParal lel

Processing pages August

Daniel Sto dolsky J Bradley Chen and Brian N Bershad Fast interrupt priority managementin

op erating system kernels In Proc of the Second USENIX Workshop on Microkernels and Other

Kernel Architectures San Diego CA September

Vrije Universiteit Amsterdam NL The AmoebaReference Manual Programming Guide rpc

manual page ftpcsvunlamo ebamanualspropsZ

Author Information

Bryan Ford is an undergraduate in Computer Science at the University of Utah His current ma jor research

interest is improving Mach but he pursues other interests including data compression languages graphics and

music He is the author of several widely used packages for the Amiga including the XPK compression package and

the MultiPlayer music program Bryan is the designer and primary implementor of Mach migrating threads

Jay Lepreau is Assistant Director of the Center for Software Science a research group within Utahs Computer

Science Departmentwhichworks in many asp ects of systems software He has worked with Unix since and

has served as cochair of the USENIX conference and on numerous other USENIX program committees His

group has made signicantcontributions to the BSD and GNU software distributions His current researchinterests

include exible system structuring with op erating system language linking and runtime comp onents

The authors addresses are Center for Software Science Department of Computer Science University of Utah

They can b e reached electronicall y at fbafordlepreaugcsutahedu

Unix is a trademark of USL OSF is a trademark of the Op en Software Foundation OS is a trademark of IBM