Optimistic Incremental Sp ecialization:



Streamlining a Commercial Op erating System

y

Calton Pu, Tito Autrey, Andrew Black, Charles Consel, Crispin Cowan,

Jon Inouye, Lakshmi Kethana, Jonathan Walp ole, and Ke Zhang

Department of Science and Engineering

Oregon Graduate Institute of Science & Technology

([email protected])

Abstract siveinterpretation and checking of the currentenvironment

efore taking action. One of the lessons of the Synthesis

Conventional op erating system co de is written to deal with

op erating system [25] is that signi cant gains in eciency

all p ossible system states, and p erforms considerable inter-

can b e made by replacing this generic co de with specialized

pretation to determine the current system state b efore tak-

co de. The sp ecialized co de p erforms correctly only in a re-

ing action. A consequence of this approachisthatkernel

stricted environment,butitischosen so that this restricted

calls which p erform little actual work take a long time to

environment is the common case.

execute. To address this problem, we use specialized op er-

By way of example, consider a simpli ed File Sys-

ating system co de that reduces interpretation for common

tem interface in which open takes a path name and returns

cases, but still b ehaves correctly in the fully general case.

an \op en le" ob ject. The op erations on that ob ject include

We describ e how sp ecialized op erating system co de can b e

read, write, close,and seek. The metho d co de for read

generated and b ound incremental ly as the information on

and write can b e sp ecialized, at op en time, to read and

which it dep ends b ecomes available. We extend our sp ecial-

write that particular le, b ecause at that time the system

ization techniques to include the notion of optimistic incre-

knows, among other things, which le is b eing read, which

mental specialization : a technique for generating sp ecialized

pro cess is doing the reading, the le typ e, the le system

kernel co de optimisticall y for system states that are likely

blo ck size, whether the ino de is in memory, and if so, its

to o ccur, but not certain. The ideas outlined in this pap er

address, etc. Thus, a lot of the interpretation of le system

allow the conventional kernel design tenet of \optimizing for

data structures that would otherwise have togoonat ev-

the common case" to b e extended to the domain of adap-

ery read can b e done once at op en time. Performing this

tive op erating systems. We also show that aggressiveuseof

interpretation at op en time is a go o d idea if read is more

sp ecializati on can pro duce in-kernel implementations of op-

common than open, and in our exp erience with sp ecializi ng

erating system functionali ty with p erformance comparable

the Unix le system, loses only if the le is op ened for read

to user-level implementations.

and then never read.

We demonstrate that these ideas are applicable in real-

Extensive use of this kind of sp ecialization in Synthe-

world op erating systems by describing a re-implementation

sis achieved improvementinkernel call p erformance rang-

of the HP-UX le system. Our sp ecialized read system call

ing from a factor of 3 to a factor of 56 [25] for a subset of

reduces the cost of a single byte read by a factor of 3, and

the Unix system call interface. However, the p erformance

an 8 KB read by 26%, while preserving the semantics of the

improvements due directly to co de sp ecializati on were not

HP-UX read call. By relaxing the semantics of HP-UX read

separated from the gains due to other factors, includin g the

wewere able to cut the cost of a single byte read system

design and implementation of a new kernel in assembly lan-

call by more than an order of magnitude.

guage, and the extensive use of other new techniques such

as lo ck-free synchronization and feedback.

This pap er describ es work done in the context of the Syn-

1 Intro duction

thetix pro ject whichisafollow-on from Synthesis. Synthetix

extends the Synthesis results in the following resp ects. First,

Much of the complexityinconventional op erating system

Synthetix de nes a conceptual mo del of sp ecializati on. This

co de arises from the requirement to handle all p ossible sys-

mo del de nes the basic elements and phases of the sp ecial-

tem states. A consequence of this requirement is that op er-

ization pro cess. Not only is this mo del useful for deploying

ating system co de tends to b e \generic", p erforming exten-

sp ecializati on in other systems, it has also proved essential



in the development of to ols to supp ort the sp ecializati on

This research is partially supp orted byARPA grant N00014-94-1-

0845, NSF grant CCR-9224375, and grants from the Hewlett-Packard

pro cess.

Company and Tektronix.

Second, Synthetix intro duces the idea of incremental and

y

Also University of Rennes/IRISA (France).

optimistic sp ecializati on. Incremental sp ecializati on allows

sp ecialized mo dules to b e generated and b ound as the in-

formation on which they dep end b ecomes available. Op-

timistic sp ecializati on allows mo dules to b e generated for

system states that are likely to o ccur, but not certain.

Finally,weshowhow optimistic, incremental sp ecializa -

tion can b e applied to existing op erating systems written dep end on invariants that are not known until runtime, and

in conventional programming languages. In this approach, hence rely on a sp ecializati on pro cess that can take place at

p erformance improvements are achieved by customising ex- runtime. In the context of op erating systems, it is useful for

isting co de without altering its semantics. In contrast, other sp ecializati on to take place b oth at compile time and at run

extensible op erating systems allow arbitrary customizations time.

to b e intro duced at the risk of altering the system semantics. Given a list of invariants, whichmaybeavailable either

The long term goal of Synthetix is to de ne a metho d- statically or dynamically,acombination of compile-time and

ology for sp ecializi ng existing system comp onents and for run-time PE should b e capable of generating the required

building new sp ecialized comp onents from scratch. Toex- sp ecialized co de. For example, the Synthesis kernel [28] p er-

plore and verify this metho dology we are manually sp ecializ- formed the (conceptual) PE step just once, at runtime dur-

ing some small, but representative, op erating system comp o- ing open. It is in principle p ossible to apply the run-time

nents. The exp erience gained in these exp eriments is b eing partial evaluator again at every p oint where new invariants

used to develop to ols towards automating the sp ecializati on b ecome known (i.e., some or all of the p oints at whichmore

pro cess. Hence, the goal of our exp eriments is to gain an un- information b ecomes available ab out the bindings that the

derstanding of the sp ecializati on concept rather than simply program contains). We call this rep eated application of a

to optimize the comp onent in question. partial evaluator incremental specialization [15].

This pap er illustrates our approach applying sp ecializa - The discussion so far has considered generating sp ecial-

tion in the context of a commercial Unix op erating system ized co de only on the basis of known invariants, i.e., bind-

and the C . Sp eci cally,itfocuses ings that are known to b e constant. In an op erating system,

on the sp ecializati on of the read system call, in HP-UX [12] there are many things that are likely to b e constant for long

while retaining standard HP-UX semantics. Since read is p erio ds of time, but may o ccasionall y vary.For example, it

representative of manyotherUnix system calls and since is likely that les will not b e shared concurrently, and that

HP-UX is representative of manyotherUnix systems, we reads to a particular le will b e sequential. We call these

exp ect the results of our study to generalize well b eyond assumptions quasi-invariants. If sp ecialize d co de is gener-

this sp eci c implementation. ated, and used, on the assumption that quasi-invariants hold

The remainder of the pap er is organized as follows. Sec- most of the time, then p erformance should improve. How-

tion 2 elab orates on the notion of sp ecializati on, and de- ever, the system must correctly handle the cases where the

nes incremental and optimistic sp ecializati on. Section 3 quasi-invaria nts do not hold.

describ es the application of sp ecializa tion to the HP-UX Correctness can b e preserved by guarding every place

read system call. Section 4 analyses the p erformance of where quasi-invariants may b ecome false. For example, sup-

our implementation . Section 5 discusses the implication s of p ose that sp ecialized read co de is generated based on the

our results, as well as the key areas for future research. Re- quasi-invaria nt \no concurrent sharing". A guard placed in

lated work is discussed in section 6. Section 7 concludes the the open system call could b e used to detect other attempts

pap er. to op en the same le concurrently. If the guard is triggered,

the read routine must b e \unsp eciali zed", either to the com-

pletely generic read routine or, more accurately, to another

2 What is Sp ecializa ti on?

sp ecialized version that still capitalizes on the other invari-

ants and quasi-invaria nts that remain valid. We call the

Program sp ecializati on, also called partial evaluation (PE),

pro cess of replacing one version of a routine by another (in

is a program transformation pro cess aimed at customizing

a di erent stage of sp ecializati on) replugging.We refer to

a program based on parts of its input [13, 30]. In essence,

the overall pro cess of sp ecializi ng based on quasi-invari ants

this pro cess consists of p erforming constant propagation and

optimistic specialization. Because it may b ecome necessary

folding, generalized to arbitrary computations.

to replug dynamicall y, optimistic sp ecializati on requires in-

In principle, program sp ecializati on can b e applied to any

cremental sp ecializati on.

program that exhibits interpretation. That is, any program

If the optimistic assumptions ab out a program's b ehav-

whose control ow is determined by the analysis of some

ior are correct, the sp ecialized co de will function correctly.

data. In fact, this characteristic is common in op erating

If one or more of the assumptions b ecome false, the sp e-

system co de. Consider the read example of Section 1. At

cialized co de will break, and it should b e replugged. This

eachinvo cation, read do es extensive data structure analysis

transformation will b e a win if sp ecialized co de is executed

to determine facts such as whether the le is lo cal or re-

many times, i.e., if the savings that accrue from the opti-

mote, the device on which it resides, and its blo ck size. Yet,

mistic assumption b eing right, weighted by the probability

these pieces of information are invariant, and can b e deter-

that it is right, exceed the additional costs of the replugging

mined when the le is op ened. Hence, the data structure

step, weighted by the probability that it is necessary (see

analysis could b e factorized by sp ecializin g the read co de

Section 4 for details).

at op en time with resp ect to the available invariants. Since

The discussion so far has describ ed incremental and op-

the sp ecialized read co de only needs to consist of op erations

timistic sp ecializatio n as forms of runtime PE. However, in

that rely on varying data, it can b e more ecient than the

the op erating system context, the full cost of co de genera-

original version.

tion must not b e incurred at runtime. The cost of runtime

Op erating systems exhibit a wide varietyofinvariants.

co de generation can b e avoided by generating co de templates

As a rst approximation, these invariants can b e divided into

statically and optimistical ly at compile time. Atkernel call

two categories dep ending on whether they b ecome available

invo cation time, the templates are simply lled in and b ound

b efore or during runtime. Examples of invariants available

appropriately [14].

b efore runtime include pro cessor cache size, whether there is

an FPU, etc. These invariants can b e exploited by existing

PE technology at the source level. Other sp ecializati ons

Sp ecializing HP-UX read

3 1. System call startup

To explore the real-world applicabil ity of the techniques out-

lined ab ove, we applied incremental and optimistic sp ecial-

read system call. read was cho-

ization to the HP-UX 9.04 2. Identify file & type,

sen as a test case b ecause it is representativeofmanyother

system calls b ecause it is a variation of the BSD le

Unix translate into number

system [24 ]. The HP-UX implementation of read is also

representative of many other Unix implementations . There-

fore, we exp ect our results to b e applicable to other Unix -

e systems. HP-UX read also p oses a serious challenge

lik 3. Lock the inode

for our technology b ecause, as a frequently used and well-

understo o d piece of commercial op erating system co de, it

has already b een highly optimized. To ensure that our p er-

t software, wewere

formance results are applicable to curren 4. Translate file offset into

careful to preserve the currentinterface and b ehavior of HP-

read in our sp ecialized read implementation s.

UX logical block number

3.1 Overview of the HP-UX read Imple- tation

men 5. Translate logical block number

o understand the nature of the savings involved in our sp e-

T into physical block number,

cialized read implementation it is rst necessary to under-

volved in a conventional Unix

stand the basic op erations in get the buffer cache block

read implementation. Figure 1 shows the control ow for assuming that the read is from a normal le and that

read containing the data

its data is in the bu er cache. The basic steps are as follows:

1. System call startup: privilege promotion, switchto

ernel stack, and up date user structure.

k 6. Data transfer

2. Identify the le and le system typ e: translate the le

descriptor number into a le descriptor, then into a

umb er, and nally into an ino de number.

vno de n Yes

3. Lo ck the ino de.

7. Do another block?

4. Identify the blo ck: translate the le o set value into a

logical ( le) blo cknumb er, and then translate the log-

knumber intoaphysical (disk) blo cknumb er.

ical blo c No

5. Find the virtual address of the data: nd the blo ckin

he containing the desired physical blo ck

the bu er cac 8. Unlock the inode

and calculate the virtual address of the data from the

le o set.

y necessary bytes from the bu er

6. Data transfer: Cop 9 Update file offset

cache blo ck to the user's bu er.

7. Pro cess another blo ck?: compare the total number of

ytes copied to the number of bytes requested; goto

b 10. System call cleanup

step 4 if more bytes are needed.

8. Unlo ck the ino de.

Figure 1: HP-UX read Flow Graph

9. Up date the le o set: lo ck le table, up date le o set,

and unlo ck the le table.

preparation for interpretation since it is concerned with dis-

10. System call cleanup: up date kernel pro le information,

covering system state information that has b een stored pre-

switch back to user stack, privilege demotion.

viously (i.e., in data structures). Traversal is basically a

matter of dereferencing and includes function calling and

The ab ove tasks can b e categorized as either interpreta-

data structure searching. Lo cking includes all synchronizati on-

tion, traversal, locking,or work.Interpretation is the pro cess

related activities. Work is the fundamental task of the call.

of examining parameter values and system states and mak-

In the case of the read, the only work is to copy the desired

ing control ow decisions based on them. Hence, it involves

data from the kernel bu ers to the user's bu er.

activities such as conditional and case statement execution,

Ideally, all of the tasks p erformed by a particular invoca-

and examining parameters and other system state variables

tion of a system call should b e in the work category. If the

to derive a particular value. Traversal can b e viewed as

vailable early enough, co de in all

necessary information is a 1. System call startup other categories can b e factored out prior to the call. Unfor-

tunately, in the case of read steps 1, 2, 4, 5, 7, and 10 consist

mostly of interpretation and traversal, and steps 3, 8, and

most of 9 are lo cking. Only step 6 and a small part of 9 can ork.

b e categorized as w Yes

read to

Section 3.2 details the sp ecializa tion s applied to 2a. Same block?

reduce the non-work overhead. Section 3.3 outlines the co de

necessary to guard the optimistic sp ecializa tion s. Section 3.4

hanism for switching b etween various sp e-

describ es the mec No

cialized and generic versions of a system call implementa-

tion.

Invariants and Quasi-Invariants for Sp e- 3.2 4. Translate file offset into

cializati on logical block number

Figure 2 shows the sp ecialized ow graph for our sp ecialized

read). Steps 2, 3, version of read (referred to here as is

and 8 have b een eliminated, steps 4 and 5 have b een elim- tial reads, and the remaining steps

inated for small sequen 5. Translate logical block number

read is sp ecial- have b een optimized via sp ecializa tion . is

variants and quasi-invariants listed

ized according to the in into physical block number,

constant is a true invariant; the re- in table 1. Only fs

variants.

mainder are quasi-in get the buffer cache block

The fs constant invariant states that le system con-

ts such as the le typ e, le system typ e, and blo cksize

stan containing the data

do not change once the le has b een op ened. This invariant

is known to hold b ecause of Unix le system semantics.

Based on this invariant, is read can avoid the traversal

read implementa- volved in step 2 ab ove. Our is

costs in 6. Data transfer

tion is sp ecialized, at op en time, for regular les residing

on a lo cal le system with a blo ck size of 8 KB. It is im-

p ortant to realize that the is read co de is enabled, at op en tly

time, for the sp eci c le b eing op ened and is transparen Yes No

substituted in place of the standard read implementation . y other kind of le defaults to the standard HP-

Reading an 7. Do another block?

UX read.

It is also imp ortant to note that the is read path is sp e-

cialized for the sp eci c pro cess p erforming the open. That

read is, we assume that the only pro cess executing the is

open that gener-

co de will b e the one that p erformed the 6a. Data transfer

ated it. The ma jor advantage of this approach is that a

private p er-pro cess p er- le read call has well-de ned access

semantics: reads are sequential by default.

varia nt sequen-

Sp ecializa tion s based on the quasi-in 9 Update file offset

tial access can havehuge p erformance gains. Consider

a sequence of small (say1byte) reads by the same pro cess

to the same le. The rst read p erforms the interpretation,

versal and lo cking necessary to lo cate the kernel virtual

tra 10. System call cleanup

address of the data it needs to copy.At this stage it can

sp ecialize the next read to simply continue copying from

the next virtual address, avoiding the need for anyofthe

Figure 2: is read Flow Graph

steps 2, 3, 4, 5, 7, 8, and 9, as shown in Figure 2. This sp e-

access cialization is predicated not only on the sequential

the interpretation of emptyblock p ointers in the ino de.

fp share quasi-invariants, but also on other quasi- and no

invariants such as the assumption that the next read won't inode share and no fp share quasi-invari ants The no

allow exclusive access to the le to b e assumed. This as-

cross a bu er b oundary, and the bu er cache replacement

co de won't havechanged the data that resides at that vir- sumption allows the sp ecialized read co de to avoid lo cking

the ino de and le table in steps 3, 8, and 9. They also al-

tual memory address. The next section shows howthese

assumptions can b e guarded. low the caching (in data structures asso ciated with the sp e-

cialized co de) of information such as the le p ointer. This

The no holes quasi-invariant is also related to the sp e-

cialization s describ ed ab ove. Contiguous sequential reading caching is what allows all of the interpretation, traversal and

lo cking in steps 2, 3, 4, 5, 8 and 9 to b e avoided.

can b e sp ecialized down to contiguous byte-copying only for

les that don't contain holes, since hole traversal requires In our current implementation, all invariants are vali-

(Quasi-)Invariant Description Savings

fs constant Invariant le system parameters. Avoids step 2.

no fp share No le p ointer sharing. Avoids most of step 9 and allows caching of

le o set in le descriptor.

no holes No holes in le. Avoids checking for emptyblock p ointers in

ino de structure.

no inode share No ino de sharing. Avoids steps 3 and 8.

no user locks No user-level lo cks. Avoids having to check for user-level lo cks.

read only No writers. Allows optimized end of le check.

sequential access Calls to is read inherit le o set For small reads, avoids steps 2, 3, 4, 5, 7, 8,

read calls from previous is 9.

Table 1: Invariants for Sp ecializati on

read. Similarl y, to duplicate a le descriptor used by is

Quasi-Invariant HP-UX system calls that may

fork checks all op en le descriptors and triggers replugging

invalida te invariants

of any sp ecialized read co de.

no fp share creat, dup, dup2, fork, sendmsg,

Guards in calls that take pathnames must detect colli-

no holes op en

sions with sp ecialized co de by examining the le's ino de. We

no inode share creat, fork, op en, truncate

use a sp ecial ag in the ino de to detect whether a sp ecialized

1

no user locks lo ckf, fcntl

co de path is asso ciated with a particular ino de .

read only op en

access quasi-invariant The guards for the sequential

sequential access lseek, readv, is read itself,

are somewhat unusual. is read avoids step 5 (bu er cache

and bu er cache blo ck

lo okup) bychecking to see whether the read request will

replacement

b e satis ed by the bu er cache blo ck used to satisfy the

preceding read request. The kernel's bu er cache blo ck re-

Table 2: Quasi-Invariants and Their Guards

placementmechanism also guards this sp ecializati on to en-

sure that the preceding bu er cache blo ck is still in memory.

With the exception of lseek, triggering any of the guards

dated in open, when sp ecializatio n happ ens. A sp ecialized

discussed ab ovecausestheread co de to b e replugged back

read routine is not generated unless al l of the invariants

to the general purp ose implementation. lseek is the only in-

hold.

stance of resp ecializatio n in our implementation; when trig-

gered, it simply up dates the le o set in the sp ecialized read

co de.

3.3 Guards

To guarantee that all invariants and quasi-invari ants

hold, open checks that the vno de meets all the fs constant

Since sp ecializati ons based on quasi-invari ants are opti-

and no holes invariants and that the requested access is

mistic, they must b e adequately guarded. Guards detect

only for read. Then the ino de is checked for sharing. If

the imp ending invalidatio n of a quasi-invaria ntandinvoke

all invariants hold during open then the ino de and le de-

the replugging routine (section 3.4) to unsp ecialize the read

read path is scriptor are marked as sp ecialized and an is

co de. Table 2 lists the quasi-invariants used in our imple-

set up for use by the calling pro cess on that le. Setting

mentation and the HP-UX system calls that contain the

up the is read path amounts to allo cating a private p er-

asso ciated guards.

le-descriptor data structure for use bytheis read co de

Quasi-invariants suchas read only and no holes can

which is sharable. The ino de and le descriptor markings

b e guarded in open since they can only b e violated if the

activate all of the guards atomically since the guard co de is

same le is op ened for writing. The other quasi-invariants

p ermanently present.

can b e invalidated during other system calls which either ac-

cess the le using a le descriptor from within the same or a

child pro cess, or access it from other pro cesses using system

3.4 The Replugging Algorithm

calls that name the le using a pathname. For example,

fp share will b e invalidated if multiple le descriptors no

Replugging comp onents of an actively running kernel is a

are allowed to share the same le p ointer. This situation can

non-trivial problem that requires a pap er of its own, and

arise if the le descriptor is duplicated lo cally using dup,if

is the topic of ongoing research. The problem is simpli ed

the entire le descriptor table is duplicated using fork,orif

here for two reasons. First, our main ob jective is to test the

a le descriptor is passed though a Unix domain so cket us-

feasibility and b ene ts of sp ecializa tion . Second, sp ecializa -

access will b e violated ing sendmsg. Similarly, sequential

tion has b een applied to the replugging algorithm itself. For

if the pro cess calls lseek or readv.

kernel calls, the replugging algorithm should b e sp ecialized ,

The guards in system calls that use le descriptors are

simple, and ecient.

relatively simple. The le descriptor parameter is used as an

The rst problem to b e handled during replugging is

index into a p er-pro cess table; if a sp ecialized le descriptor

synchronization . If a replugger were executing in a single-

is already present then the quasi-invariant will b ecome in-

1

The in-memory version of HP-UX's ino des is substantially larger

valid, triggering the guard and invoking the replugger. For

than the on-disk version. The sp ecialized ag only needs to b e added

example, the guard in dup only resp onds when attempting

to the in-memory version.

threaded kernel with no system call blo cking in the kernel, 1. Acquire p er-pro cess lo cktoblock concurrent replug-

gers. It may b e that some guard was triggered con-

then no synchronization would b e needed. Our environ-

currently for the same le descriptor, in which case we

mentisamultipro cessor, where kernel calls maybesus-

are done.

p ended. Therefore, the replugging algorithm must handle

2. Acquire p er- le lo cktoblock exit from holding tank.

two sources of concurrency: (1) interactions b etween the re-

plugger and the pro cess whose co de is b eing replugged and

3. Change the p er- le indirect p ointer to send readers to

the holding tank (changes action of the reading

(2) interactions among other kernel threads that triggered

at step 3 so no new threads can enter the sp ecialized

a guard and invoked the replugging algorithm at the same

co de).

time. Note that the algorithm presented here assumes that

a coherent read from memory is faster than a concurrency

4. Spinwait for the p er- le inside- ag to b e cleared. Now

lo ck: if sp ecialized hardware makes lo cks fast, then the sp e-

no threads are executing the sp ecialized co de.

cialized synchronizatio n mechanism presented here can b e

5. Perform incremental sp ecializati on according to which

replaced with lo cks.

invariantwas invalida ted.

To simplify the replugging algorithm, wemaketwo as-

6. Set le descriptor appropriately, includin g indicating

sumptions that are true in many Unix systems: (A1) kernel

that the le is no longer sp ecialized.

2

calls cannot ab ort ,sowedonothavetocheck for an in-

7. Release p er- le lo cktounblo ck thread in holding tank.

read, and (A2) there is only one complete kernel call to is

8. Release p er-pro cess lo cktoallow other repluggers to

thread p er pro cess, so multiple kernel calls cannot concur-

continue.

rently access pro cess level data structures.

The second problem that a replugging algorithm must

The way the replugger synchronizes with the reader

solve is the handling of executing threads inside the co de

thread is through the inside- ag in combination with the

b eing replugged. We assume (A3) that there can b e at most

indirection p ointer. If the reader sets the inside- ag b efore

one thread executing inside sp ecialized co de. This is the

a replugger sets the indirection p ointer then the replugger

most imp ortant case, since in all cases so far wehavespe-

waits for the reader to nish. If the reader takes the indi-

cialized for a single thread of control. This assumption is

rect call into the holding tank, it will clear the inside- ag

consistent with most current Unix environments. To sep-

which will tell the replugger that no thread is executing the

arate the simple case (when no thread is executing inside

sp ecialized co de. Once the replugging is complete the al-

co de to b e replugged) from the complicated case (when one

gorithm unblo cks any thread in the holding tank and they

thread is inside), we use an \inside- ag" . The rst instruc-

resume through the new unsp ecialized co de.

tion of the sp ecialized read co de sets the inside- ag to in-

In most cases of unsp eciali zati on, the general case, read,

dicate that a thread is inside. The last instruction in the

is used instead of the sp ecialized is read. In this case, the

sp ecialized read co de clears the inside- ag.

le descriptor is marked as unsp ecialized and the memory

To simplify the synchronizatio n of threads during replug-

read o ccupies is marked for garbage collection at le is

ging, the replugging algorithm uses a queue, called the hold-

close time.

ing tank, to stop the thread that happ ens to invoke the sp e-

cialized kernel call while replugging is taking place. Up on

completion of replugging, the algorithm activates the thread

4 Performance Results

waiting in the holding tank. The thread then resumes the

invo cation through the unsp eciali zed co de.

The exp erimental environment for the b enchmarks was

For simplicity,we describ e the replugging algorithm as if

a Hewlett-Packard 9000 series 800 G70 (9000/887) dual-

there were only two cases: sp ecialized and non-sp ecialized .

pro cessor server [1] running in single-user mo de. This server

The paths take the following steps:

is con gured with 128 MB of RAM. The twoPA7100 [16]

pro cessors run at 96 MHz and eachcontains one MB of in-

1. Check the le descriptor to see if this le is sp ecialized .

struction cache and one MB of data cache.

If not, branch out of the fast path.

Section 4.1 presents an exp erimenttoshowhow incre-

mental sp ecializati on can reduce the overhead of the read

2. Set inside- ag.

system call. Sections 4.2 through 4.4 describ e the over-

head costs of sp ecializati on. Section 4.5 describ es the basic

3. Branch indirect. This branch leads to either the hold-

cost/b ene t analysis of when sp ecializati on is appropriate.

ing tank or the read path. It is changed by the replug-

ger.

4.1 Sp ecializat ion to Reduce read Over-

Read Path:

head

1. Do the read work.

The rst microb enchmark is designed to illustrate the re-

2. Clear inside- ag.

duction in overhead costs asso ciated with the read system

Holding Tank:

call. Thus the exp eriment has b een designed to reduce all

other costs, to wit:

1. Clear inside- ag.

2. Sleep on the p er- le lo cktoawait replugger comple-

 all exp eriments were run with a warm le system bu er

3

tion.

cache

3. Jump to standard read path.

3

The use of sp ecialization to optimize the device I/O path and

make b etter use of the le system bu er cache is the sub ject of a

Replugging Algorithm:

separate study currently underway in our group.

2

Take an unexp ected path out of the kernel on failure.

1 Byte 8KB 64 KB

8000 7039

read 3540 7039 39733

5220

6000

is read 979 5220 35043

3400

4000

Savings 2561 1819 4690

% Improvement 72% 26% 12%

2000

Sp eedup x3.61 x1.35 x1.13

0

4

copyout

copyout 170 3400 NA

read read is

Table 3: Overhead Reduction: HP-UX read versus is read

Figure 4: Overhead Reduction: 8 KB HP-UX read versus

costs in CPU cycles

is read costs in CPU cycles

39.7

3540

35.0

40

4000

30

3000

2000

20

979

1000

10

170

0

0

copyout

read read is

read is read

Figure 3: Overhead Reduction: 1 Byte HP-UX read versus

Figure 5: Overhead Reduction: 64 KB HP-UX read versus

is read costs in CPU cycles

is read costs in Thousands of CPU cycles

 the same blo ckwas rep eatedly read to achieve all-hot

The most signi cant impact o ccurs in the open system call,

CPU data caches

whichisthepointatwhichtheis read path is generated.

open has to check8invariant and quasi-invariant values,

 the target of the read is selected to b e a page-aligned

for a total of ab out 90 instructions and a lo ck/unlo ckpair.

user bu er to optimize copyout p erformance

If sp ecializa tion can o ccur it needs to allo cate some kernel

The program consists of a tight lo op that op ens the le,

memory and ll it in. close needs to check if the le descrip-

gets a timestamp, reads N bytes, gets a timestamp, and

tor is or was sp ecialized and if so, free the kernel memory.

closes the le. Timestamps are obtained by reading the PA-

Akernel memory alloc takes 119 cycles and free takes 138

RISC's interval timer, a pro cessor control register that is

cycles.

incremented every pro cessor cycle [20].

The impact of this work is that the new sp ecialized open

Table 3 numerically compares the p erformance of HP-

call takes 5582 cycles compared to 5227 cycles for the stan-

UX read with is read for reads of one byte, 8 KB, and

dard HP-UX open system call. In b oth cases, no ino de

64 KB, and Figures 3 through 5 graphically represents the

traversal is involved. As exp ected, the cost of the new open

same data. The table and gures also include the costs for

call is higher than the original. However, notice that the

5

copyout to provide a lower b ound on read p erformance .In

increase in cost is small enough that a program that op ens

all cases, is read p erformance is b etter than HP-UX read.

a le and reads it once can still b ene t from sp ecializati on.

For single byte reads, is read is more than three times as

fast as HP-UX read, re ecting the fact that sp ecializati on

4.3 TheCostofNontriggered Guards

has removed most of the overhead of the read system call.

Reads that cross blo ck b oundaries lose the sp ecializati on

The cost of guards can b e broken down into two cases: the

of simply continuing to copy data from the previous p osition

cost of executing them when they are not triggered, and

access in the current bu er cache blo ck(thesequential

the cost of triggering them and p erforming the necessary

quasi-invari ant), and thus su er p erformance loss. 8KB

replugging. This sub-section is concerned with the rst case.

reads necessarily cross blo ck b oundaries, and so the sp e-

Guards are asso ciated with each of the system calls

cialization p erformance improvement for the 8 KB is read

shown in Table 2. As noted elsewhere, there are twosorts

read. has a smaller p erformance gain than the 1 Byte is

of guards. One checks for sp ecialized le descriptors and is

For larger reads, the p erformance gain is not so large b e-

very cheap, the other for sp ecialized ino des. Since ino des

cause the overall time is dominated by data copying costs

can b e shared they must b e lo cked to check them. The

rather than overhead. However, there are p erformance b en-

lo ck exp ense is only incurred if the le passes all the other

e ts from sp ecializati on that o ccur on a p er-blo ck basis, and

tests rst. A lo ck/unlo ck pair takes 145 cycles. A guard

so even 64 KB reads improveby ab out 12%.

requires 2 temp orary registers, 2 loads, an add, and a com-

pare, 11 cycles, and then a function call if it is triggered. It

is imp ortant to note that these guards do not o ccur in the

4.2 The Cost of the Initial Sp ecializat ion

data transfer system calls, except for readv whichisnot

The p erformance improvements in the read fast path come

frequently used.

at the exp ense of overhead in other parts of the system.

In the current implementation , guards are xed in place

4

(and always p erform checks) but they are triggered only

Large reads break down into multiple blo ck-sized calls to copyout.

5

when sp ecialized co de exists. Alternatively, guards could b e

copyout p erforms a safe copyfromkernel to while cor-

rectly handling page faults and segmentation violations.

inserted in-place when asso ciated sp ecialized co de is gener- read (the setting and reset- with the replugger is b orn by is

ated. Learning which alternative p erforms b etter requires ting of inside- ag), but overall is read is much faster than

further research on the costs and b ene ts of sp ecializati on read (Section 4). In Equation 2, f is the number of times

is

read is invoked and f is the total mechanisms. sp ecialized is

T otalRead

number of invo cations to read the le. Sp ecializ atio n wins

if the inequali ty in Equation 2 is true.

4.4 The Cost of Replugging

The following sections outline a series of microb ench-

marks to measure the p erformance of the incrementally and

There are two costs asso ciated with replugging. One is the

optimistical ly sp ecialized read fast path, as well as the over-

read for checking if overhead added to the fast path in is

head asso ciated with guards and replugging.

it is sp ecialized and calling read if not, and for writing the

inside- ag bit twice, and the indirect function call with zero

arguments otherwise. A timed microb enchmark shows this

5 Discussion

cost to b e 35 cycles.

The second cost of replugging is incurred when the re-

The exp erimental results describ ed in Section 4 show the

plugging algorithm is invoked. This cost dep ends on whether

read implementation. At the p erformance of our current is

there is a thread already present in the co de path to b e re-

time of writing this implementation was not fully sp ecial-

plugged. If so, the elapsed time taken to replug can b e

ized: some invariants were not used and, as a result, the

dominated by the time taken by the thread to exit the sp e-

measured is read path contains more interpretation and

cialized path. The worst case for the read call o ccurs when

traversal co de than is absolutely necessary. Therefore, the

the thread present in the sp ecialized path is blo cked on I/O.

p erformance results presented ab ove are conservative. Even

We are working on a solution to this problem whichwould

so, the results show that optimistic sp ecializa tion can im-

allow threads to \leave" the sp ecialized co de path when ini-

prove the p erformance of b oth small and large reads.

tiating I/O and rejoin a replugged path when I/O completes,

At one end of the sp ectrum, assuming a warm bu er

but this solution is not yet implemented.

cache, the p erformance of smal l reads is dominated by con-

In the case where no thread is present in the co de path

trol ow costs. Through sp ecializati on we are able to re-

to b e replugged, the cost of replugging is determined by

move, from the fast path, a large amount of co de, concerned

the cost of acquiring twolocks, one spinlo ck, checking one

with interpretation, data structure traversal and synchro-

memory lo cation and storing to another (to get exclusive

nization. Hence, it is not surprising that the cost of small

access to the sp ecialized co de). To fall back to the generic

reads is reduced signi cantl y.

read takes 4 stores plus address generation, plus storing

At the other end of sp ectrum, again assuming a warm

the sp ecialized le o set into the system le table which

bu er cache, the p erformance of large reads is dominated by

requires obtaining the File Table Lo ck and releasing it. After

data movement costs. In e ect, the control- owoverhead,

incremental sp ecializati on twolocks have to b e released. An

still present in large reads, is amortized over a large amount

insp ection of the generated co de shows the cost to b e ab out

of byte copying. Sp ecializati ons to reduce the cost of byte

535 cycles assuming no lo ckcontention. The cost of the

copying will b e the sub ject of a future study.

holding tank is not measured since that is the rarest sub case

Sp ecializati on removes overhead from the fast path by

and it would b e dominated by spinning for a lo ckinany

adding overhead to other parts of the system: sp eci call y,

event.

the places at which the sp ecializatio n, replugging and guard-

Adding up the individua l comp onent costs, and multi-

ing of optimistic sp ecializati ons o ccur. Our exp erience has

plying them by the frequency,we can estimate the guarding

shown that generating sp ecialized implementations is easy.

and replugging overhead attributed to each is read.Ifwe

The real diculty arises in correctly placing guards and

assume that 100 is read happ en for each of the guarded

making p olicy decisions ab out what and when to sp ecial-

kernel calls (fork, creat, truncate, open, close and re-

ize and replug. Guards are dicult to place b ecause an

plugging), then less than 10 cycles are added as guarding

op erating system kernel is a large program, and invariants

overhead to eachinvo cation of is read.

often corresp ond to global state comp onents which are ma-

nipulated at numerous sites. Therefore, manually tracking

the places where invariants are invalidated is a tedious pro-

4.5 Cost/Bene t Analysis

cess. However, our exp erience with the HP-UX le system

Sp ecializati on reduces the execution costs of the fast path,

has brought us an understanding of this pro cess whichwe

but it also requires additional mechanisms, such as guards

are now using to design a program analysis capable of de-

and replugging algorithms, to maintain system correctness.

termining a conservativeapproximation of the places where

By design, guards are lo cated in low frequency execution

guards should b e placed.

paths and in the rare case of quasi-invariant invalidation ,

Similarly,thechoice of what to sp ecialize, when to sp e-

replugging is p erformed. Wehave also added co de to open

cialize, and whether to sp ecialize optimistical ly are all non-

to check if sp ecializa tion is p ossible, and to close to garbage

trivial p olicy decisions. In our current implementation we

collect the sp ecialize d co de after replugging. An informal

made these decisions in an ad hoc manner, based on our

p erformance analysis of these costs and a comparison with

exp ert knowledge of the system implementation, semantics

the gains is shown in Figure 6.

and common usage patterns. This exp eriment has prompted

In Equation 1, O er head includes the cost of guards,

us to explore to ols based on static and dynamic analyses

the replugging algorithm, and the increase due to initial in-

of kernel co de, aimed at helping the programmer decide

variantvalidation, sp ecializati on and garbage collecting for

when the p erformance improvement gained by sp ecializa -

i

all le op ens and closes. Each Guar d (in di erentkernel

tion is likely to exceed the cost of the sp ecializati on pro cess,

i

calls) is invoked f times. Similarly, Repl ug is invoked

guarding, and replugging.

sy scall

f times. A small part of the cost of synchronization Replug

X

i i

Overhead = f  Guar d + O pen + C l ose + f  Repl ug (1)

Replug

sy scall

i

O v er head + f  is read < (f f )  read (2)

is T otalRead is

Figure 6: Cost/Bene t Analysis: When is it Bene cial to Sp ecialize?

5.1 Interface Design and Kernel Structure A strong lesson from our work and from other work in the

OI community [22] is that abstractly sp eci ed interfaces,

From early in the pro ject, our intuition told us that, in the

i.e., those that do not constrain implementation choices un-

most sp ecialized case, it should b e p ossible to reduce the cost

necessarily, are the key to gaining the most b ene t from

of a read system call that hits in the bu er cache. It should

techniques such as sp ecializati on.

b e little more than the basic cost of data movementfrom

the kernel to the applicati on's , i.e., the cost of

copying the bytes from the bu er cache to the user's bu er.

6 Related Work

In practice, however, our sp ecialized read implementation

costs considerably more than copying one byte. The cost of

Our work on optimistic incremental sp ecializa tion can b e

our sp ecialize d read implementation is 979 cycles, compared

viewed as part of a widespread research trend towards

to approximately 235 cycles for entering the kernel, elding

adaptive op erating systems. Micro-kernel op erating sys-

the minimum numb er of parameters, and safely copying a

tems [5, 6, 11, 9, 21 , 27, 29] were an early example of this

single byte out to the application's address space.

trend, and improve adaptiveness byallowing op erating sys-

Up on closer examination, we discovered that the remain-

tem functionality to b e implemented in user-level servers

ing 744 cycles were due to a long list of inexp ensive actions

that can b e customized and con gured to pro duce sp ecial-

includin g stack switching, masking oating p oint exceptions,

purp ose op erating systems. While micro-kernel-based archi-

recording kernel statistics, supp orting ptrace, initiali zi ng

tectures improve adaptiveness over monolithic kernels, sup-

kernel handling, etc. The sum of these actions

p ort for user-level servers incurs a high p erformance p enalty.

dominates the cost of small byte reads.

To address this p erformance problem Chorus [29] allows

These actions are due in part to constraints that were

mo dules, known as sup ervisor actors, to b e loaded into the

placed up on our design byanover-sp eci cation of the UNIX

kernel address space. A sp ecialized IPC mechanism is used

read implementation. For example, the need to always sup-

for communication b etween actors within the kernel address

p ort statistics-gatherin g facilities suchastimes requires ev-

space. Similarly, Flex [8] allows dynamic loading of op er-

ery read call to record the time it sp ends in the kernel. For

ating system mo dules into the Machkernel, and uses a mi-

one byte reads, changing the interface to one that returns a

grating threads mo del to reduce IPC overhead.

single byte in a register instead of copying data to the user's

One problem with allowing application s to load mo dules

address space, and changing the semantics of the system call

into the kernel is loss of protection. The SPIN kernel [4]

to not supp ort statistics and pro ling eliminates the need for

allows application s to load executable mo dules, called -

many of these actions.

dles, dynamically into the kernel. These spindles are written

To push the limits of a kernel-based read implementa-

in a typ e-safe programming language to ensure that they do

tion, we implemented a sp ecial one-byte read system call,

not adversely a ect kernel op erations.

called readc, which returns a single byte in a register, just

Ob ject-oriented op erating systems allow customization

likethegetc library call. In addition to the optimizations

through the use of inheritance, invo cation redirection, and

read call, readc avoids switch- used in our sp ecialized is

meta-interfaces. Choices [7] provides generalized comp o-

ing stacks, omits ptrace supp ort, and skips up dating pro-

nents, called frameworks, which can b e replaced with sp e-

le information. The p erformance of the resulting readc

cialized versions using inheritance and dynamic linking . The

implementation is 65 cycles. Notice that aggressiveuseof

Spring kernel uses an extensible RPC framework [18] to redi-

sp ecializati on can lead to a readc system call that p erforms

rect ob ject invo cations to appropriate handlers based on the

within a factor of two of a pure user-level getc which costs

typ e of ob ject. The Substrate Ob ject Mo del [3] supp orts ex-

38 cycles in HP-UX's stdio library. Thisresultisencour-

tensibili tyintheAIXkernel byproviding additional inter-

aging b ecause it shows the feasibility of implementing op er-

faces for passing usage hints and customizing in-kernel im-

ating system functionalityatkernel level with p erformance

plementations. Similarly, the Ap ertos op erating system [31]

similar to user-level libraries. Aggressive sp ecializati on may

supp orts dynamic recon guration by mo difying an ob ject's

render unnecessary the p opular trend of duplicating op erat-

behavior through op erations on its meta-interface.

ing system functionality at user level [2, 19 ] for p erformance

Other systems, such as the Cache kernel [10] and the

reasons.

Exo-kernel [17] address the p erformance problem bymoving

Another commonly cited reason for moving op erating

even more functionality out of the op erating system kernel

system functionali ty to user level is to give applications

and placing it closer to the application. In this \minimal-

more control over p olicy decisions and op erating system im-

kernel" approach extensibil ity is the norm rather than the

plementations. We b elieve that these b ene ts can also b e

exception.

gained without duplicating op erating system functionality

Synthetix di ers from the other extensible op erating sys-

at user level. Following an op en-implementati on (OI) phi-

tems describ ed ab oveinanumber of ways. First, Synthetix

losophy [22], op erating system functionality can remain in

infers the sp ecializati on s needed even for applicatio ns that

the kernel, with customization of the implementation sup-

havenever considered the need for sp ecializati on . Other ex-

p orted in a controlled manner via meta-interface calls [23].

John R. Keller, Keith Y. Oka, and Marlin M. Jones I I. tensible systems require applications to know which sp ecial-

Corp orate Business Servers: An AlternativetoMain- izations will b e b ene cial and then select or provide them.

frames for Business Computing. Hew lett-Packard Jour- Second, Synthetix supp orts optimistic sp ecializati ons

nal, 45(3):8{30, June 1994. and uses guards to ensure the validity of a sp ecializati on

and automatically replug it when it is no longer valid. In

[2] Thomas E. Anderson, Brian N. Bershad, Edward D.

contrast, other extensible systems do not supp ort automatic

Lazowska, and Henry M. Levy.Scheduler Activations:

replugging and supp ort damage control only through hard-

E ective Kernel Supp ort for the User-Level Manage-

ware or software protection b oundaries.

mentofParallelism. ACM Transactions on Computer

Third, the explicit use of invariants and guards in Syn-

Systems, 10(1):53{79, February 1992.

thetix also supp orts the comp osabili ty of sp ecializati on s:

guards determine whether two sp ecializati on s are comp os-

[3] Arindam Banerji and David L. Cohn. An Infrastructure

able. Other extensible op erating systems do not provide

for Application -Sp eci c Customization. In Proceedings

supp ort to determine whether separate extensions are com-

of the ACM European SIGOPS Workshop, September

p osable.

1994.

Like Synthetix, Scout [26] has fo cused on the sp ecial-

[4] Brian N. Bershad, Stefan Savage, Przemys lawPardyak, ization of existing systems co de. Scout has concentrated

Emin Gun  Sirer, Marc Fiuczynski, David Becker, Su- on networking co de and has fo cused on sp ecializati ons that

san Eggers, and Craig Chamb ers. Extensibil ity, Safety minimize co de and data caching e ects. In contrast, wehave

and Performance in the SPIN Op erating System. In fo cused on parametric sp ecializati on to reduce the length of

Symposium on Operating Systems Principles (SOSP), various fast paths in the kernel. We b elieve that manyof

Copp er Mountain, Colorado, Decemb er 1995. the techniques used in Scout are also useful in Synthetix,

and vice versa.

[5] D.L. Black, D.B. Golub, D.P. Julin, R.F. Rashid, R.P.

Draves, R.W. Dean, A. Forin, J. Barrera, H. Tokuda,

G. Malan, and D. Bohman. op erating sys-

7 Conclusions

tem architecture and . In Proceedings of the Work-

This pap er has intro duced a mo del for sp ecializati on based

shop on Micro-Kernels and Other Kernel Architectures,

on invariants, guards, and replugging. Wehaveshown how

pages 11{30, Seattle, April 1992.

this mo del supp orts the concepts of incremental and opti-

[6] F. J. Burkowski, C. L. A. Clarke, Crispin Cowan,

mistic sp ecializa tion . These concepts re ne previous work

and G. J. Vreugdenhil. Architectural Supp ort for

on kernel optimization using dynamic co de generation in

LightweightTasking in the Sylvan Multipro cessor Sys-

Synthesis [28, 25]. Wehave demonstrated the feasibility

tem. In Symposium on Experience with Distributedand

and usefulness of incremental, optimistic sp ecializati on by

Multiprocessor Systems (SEDMS II), pages 165{184,

applying it to le system co de in a commercial op erating

Atlanta, Georgia, March 1991.

system (HP-UX). The exp erimental results show that signif-

icant p erformance improvements are p ossible even when the

[7] Roy H. Campb ell, Nayeem Islam, and Peter Madany.

base system is (a) not designed sp eci cally to b e amenable to

Choices: Frameworks and Re nement. Computing Sys-

sp ecializati on, and (b) is already highly optimized. Further-

tems, 5(3):217{257, 1992.

more, these improvements can b e achieved without altering

the semantics or restructuring the program.

[8] John B. Carter, Bryan Ford, Mike Hibler, Ravindra

Our exp erience with p erforming incremental and opti-

Kuramkote, Je rey Law, Lay Lepreau, Douglas B. Orr,

mistic sp ecializati on manually has shown that the approach

Leigh Stoller, and Mark Swanson. FLEX: A Tool for

is worth pursuing. Based on this exp erience, we are devel-

Buildin g Ecient and Flexible Systems. In Proceed-

oping a program sp ecializer for C and a collection of to ols

ings of the Fourth Workshop on Workstation Operating

for assisting the programmer in identifying opp ortunities for

Systems, pages 198{202, Napa, CA, Octob er 1993.

sp ecializati on and ensuring consistency.

[9] David R. Cheriton. The V Distributed System. Com-

munications of the ACM, 31(3):314{333, March 1988.

8 Acknowledgements

[10] David R. Cheriton and Kenneth J. Duda. A Caching

Mo del of Op erating System Kernel Functionality.In

Bill Trost of Oregon Graduate Institute and Takaichi

Symposium on Operating Systems Design and Imple-

Yoshida of Kyushu Institute of Technology p erformed the

mentation (OSDI), pages 179{193, Novemb er 1994.

initial study of the HP-UX read and write system calls,

identifying many quasi-invari ants to use for sp ecializa tion .

[11] David R. Cheriton, M. A. Malcolm, L. S. Melen, and

Bill also implemented some prototyp e sp ecialized kernel calls

G. R. Sager. , A Portable Real-Time Op erating

that showed promise for p erformance improvements. Luke

System. Communications of the ACM, 22(2):105{115,

Horno of University of Rennes contributed to discussions

February 1979.

on sp ecializati on and to ols development. Finally,wewould

like to thank the SOSP referees for their extensive and thor-

[12] Frederick W. Clegg, Gary Shiu-Fan Ho, Steven R. Kus-

ough comments and suggestions.

mer, and John R. Sontag. The HP-UX Op erating Sys-

tem on HP Precision Architecture . Hew lett-

Packard Journal, 37(12):4{22, Decemb er 1986.

References

[13] C. Consel and O. Danvy.Tutorial notes on partial eval-

[1] Thomas B. Alexander, Kenneth G. Rob ertson, Dean T.

uation. In ACM Symposium on Principles of Program-

Lindsey, Donald L. Rogers, John R. Ob ermeyer,

ming Languages, pages 493{501, 1993.

[14] C. Consel and F. Noel. A general approach to run-time [27] S. J. Mullender, G. van Rossum, A. S. Tanenbaum,

sp ecializatio n and its application to C. Rep ort 946, R. van Renesse, and H. van Staveren. Amo eba | A dis-

Inria/Irisa, Rennes, France, July 1995. tributed Op erating System for the 1990's. IEEE Com-

puter, 23(5), May 1990.

[15] C. Consel, C. Pu, and J. Walp ole. Incremental sp e-

cialization: The key to high p erformance, mo dularity [28] C. Pu, H. Massalin, and J. Ioannidis. The Synthesis

and p ortabili ty in op erating systems. In Proceedings of kernel. Computing Systems, 1(1):11{32, Winter 1988.

ACM Symposium on Partial Evaluation and Semantics-

[29] M. Rozier, V. Abrossimov, F. Armand, I. Boule,

BasedProgram Manipulation, Cop enhagen, June 1993.

M. Gien, M. Guillemont, F. Herrman, C. Kaiser,

[16] Eric DeLano, Will Walker, and Mark Forsyth. A High S. Langlois, P. Leonard, and W. Neuhauser. Overview

Sp eed Sup erscalar PA-RISC Pro cessor. In COMPCON of the Chorus distributed op erating system. In Pro-

92, pages 116{121, San Francisco, CA, February 24-28 ceedings of the Workshop on Micro-Kernels and Other

1992. Kernel Architectures, pages 39{69, Seattle, April 1992.

[17] Dawson R. Engler, M. Frans Kaasho ek, and [30] P. Sestoft and A. V. Zamulin. Annotated bibliogra -

James O'To ole Jr. : An Op erating System phy on partial evaluation and mixed computation. In

Architecture for Application-level Resource Manage- D. Bjrner, A. P. Ershov, and N. D. Jones, editors,

ment. In Symposium on Operating Systems Principles Partial Evaluation and Mixed Computation. North-

(SOSP), Copp er Mountain, Colorado, Decemb er 1995. Holland, 1988.

[18] Graham Hamilton, Michael L. Powell, and James G. [31] YasuhikoYokote. The Ap ertos Re ective Op erating

Mitchell. Sub contract: A exible base of distributed System: The Concept and Its Implementation. In Pro-

programming. In Proceedings of the Fourteenth ACM ceedings of the Conference on Object-OrientedProgram-

Symposium on Principles, pages 69{ ming Systems, Languages, and Applications (OOP-

79, Asheville, NC, Decemb er 1993. SLA), pages 414{434, Vancouver, BC, Octob er 1992.

[19] Kieran Harty and David R. Cheriton. Application -

controlled physical memory using external page-cache

management. In Proceedings of the Fifth International

ConferenceonArchitectural Support for Programming

Languages and Operating Systems (ASPLOS-V), pages

187{197, Boston, MA, Octob er 1992.

[20] Hewlett-Packard. PA-RISC 1.1 Architectureand

Instruction Set Reference Manual, second edition,

Septemb er 1992.

[21] Dan Hildebrand. An Architectural Overview of QNX.

In Proceedings of the USENIX Workshop on Micro-

kernels and Other Kernel Architectures, pages 113{123,

Seattle, WA, April 1992.

[22] Gregor Kiczales. Towards a new mo del of abstraction

in software engineering. In Proc. of the IMSA'92 Work-

shop on Re ection and Meta-level Architectures, 1992.

See http://www.xerox.com/PARC/spl/eca/oi.html

for up dates.

[23] Gregor Kiczales, Jim des Rivieres, and Daniel G. Bo-

brow. The Art of the Metaobject Protocol. MIT Press,

1991.

[24] Samuel J. Leer, Marshall Kirk McKusick, Michael J.

Karels, and John S. Quarterman. 4.3BSD UNIX Op-

erating System. Addison-Wesley Publishi ng Company,

Reading, MA, 1989.

[25] H. Massalin and C. Pu. Threads and input/output in

the Synthesis kernel. In Proceedings of the Twelfth Sym-

posium on Operating Systems Principles, pages 191{

201, Arizona, Decemb er 1989.

[26] David Mosb erger, Larry L. Peterson, and Sean

O'Malley. Proto col Latency: MIPS and Reality.Re-

p ort TR 95-02, Dept of , University

of Arizona, Tuscon, Arizona, April 1995.