Optimistic Incremental Sp ecialization:
Streamlining a Commercial Op erating System
y
Calton Pu, Tito Autrey, Andrew Black, Charles Consel, Crispin Cowan,
Jon Inouye, Lakshmi Kethana, Jonathan Walp ole, and Ke Zhang
Department of Computer Science and Engineering
Oregon Graduate Institute of Science & Technology
Abstract siveinterpretation and checking of the currentenvironment
b efore taking action. One of the lessons of the Synthesis
Conventional op erating system co de is written to deal with
op erating system [25] is that signi cant gains in eciency
all p ossible system states, and p erforms considerable inter-
can b e made by replacing this generic co de with specialized
pretation to determine the current system state b efore tak-
co de. The sp ecialized co de p erforms correctly only in a re-
ing action. A consequence of this approachisthatkernel
stricted environment,butitischosen so that this restricted
calls which p erform little actual work take a long time to
environment is the common case.
execute. To address this problem, we use specialized op er-
By way of example, consider a simpli ed Unix File Sys-
ating system co de that reduces interpretation for common
tem interface in which open takes a path name and returns
cases, but still b ehaves correctly in the fully general case.
an \op en le" ob ject. The op erations on that ob ject include
We describ e how sp ecialized op erating system co de can b e
read, write, close,and seek. The metho d co de for read
generated and b ound incremental ly as the information on
and write can b e sp ecialized, at op en time, to read and
which it dep ends b ecomes available. We extend our sp ecial-
write that particular le, b ecause at that time the system
ization techniques to include the notion of optimistic incre-
knows, among other things, which le is b eing read, which
mental specialization : a technique for generating sp ecialized
pro cess is doing the reading, the le typ e, the le system
kernel co de optimisticall y for system states that are likely
blo ck size, whether the ino de is in memory, and if so, its
to o ccur, but not certain. The ideas outlined in this pap er
address, etc. Thus, a lot of the interpretation of le system
allow the conventional kernel design tenet of \optimizing for
data structures that would otherwise have togoonat ev-
the common case" to b e extended to the domain of adap-
ery read can b e done once at op en time. Performing this
tive op erating systems. We also show that aggressiveuseof
interpretation at op en time is a go o d idea if read is more
sp ecializati on can pro duce in-kernel implementations of op-
common than open, and in our exp erience with sp ecializi ng
erating system functionali ty with p erformance comparable
the Unix le system, loses only if the le is op ened for read
to user-level implementations.
and then never read.
We demonstrate that these ideas are applicable in real-
Extensive use of this kind of sp ecialization in Synthe-
world op erating systems by describing a re-implementation
sis achieved improvementinkernel call p erformance rang-
of the HP-UX le system. Our sp ecialized read system call
ing from a factor of 3 to a factor of 56 [25] for a subset of
reduces the cost of a single byte read by a factor of 3, and
the Unix system call interface. However, the p erformance
an 8 KB read by 26%, while preserving the semantics of the
improvements due directly to co de sp ecializati on were not
HP-UX read call. By relaxing the semantics of HP-UX read
separated from the gains due to other factors, includin g the
wewere able to cut the cost of a single byte read system
design and implementation of a new kernel in assembly lan-
call by more than an order of magnitude.
guage, and the extensive use of other new techniques such
as lo ck-free synchronization and software feedback.
This pap er describ es work done in the context of the Syn-
1 Intro duction
thetix pro ject whichisafollow-on from Synthesis. Synthetix
extends the Synthesis results in the following resp ects. First,
Much of the complexityinconventional op erating system
Synthetix de nes a conceptual mo del of sp ecializati on. This
co de arises from the requirement to handle all p ossible sys-
mo del de nes the basic elements and phases of the sp ecial-
tem states. A consequence of this requirement is that op er-
ization pro cess. Not only is this mo del useful for deploying
ating system co de tends to b e \generic", p erforming exten-
sp ecializati on in other systems, it has also proved essential
in the development of to ols to supp ort the sp ecializati on
This research is partially supp orted byARPA grant N00014-94-1-
0845, NSF grant CCR-9224375, and grants from the Hewlett-Packard
pro cess.
Company and Tektronix.
Second, Synthetix intro duces the idea of incremental and
y
Also University of Rennes/IRISA (France).
optimistic sp ecializati on. Incremental sp ecializati on allows
sp ecialized mo dules to b e generated and b ound as the in-
formation on which they dep end b ecomes available. Op-
timistic sp ecializati on allows mo dules to b e generated for
system states that are likely to o ccur, but not certain.
Finally,weshowhow optimistic, incremental sp ecializa -
tion can b e applied to existing op erating systems written dep end on invariants that are not known until runtime, and
in conventional programming languages. In this approach, hence rely on a sp ecializati on pro cess that can take place at
p erformance improvements are achieved by customising ex- runtime. In the context of op erating systems, it is useful for
isting co de without altering its semantics. In contrast, other sp ecializati on to take place b oth at compile time and at run
extensible op erating systems allow arbitrary customizations time.
to b e intro duced at the risk of altering the system semantics. Given a list of invariants, whichmaybeavailable either
The long term goal of Synthetix is to de ne a metho d- statically or dynamically,acombination of compile-time and
ology for sp ecializi ng existing system comp onents and for run-time PE should b e capable of generating the required
building new sp ecialized comp onents from scratch. Toex- sp ecialized co de. For example, the Synthesis kernel [28] p er-
plore and verify this metho dology we are manually sp ecializ- formed the (conceptual) PE step just once, at runtime dur-
ing some small, but representative, op erating system comp o- ing open. It is in principle p ossible to apply the run-time
nents. The exp erience gained in these exp eriments is b eing partial evaluator again at every p oint where new invariants
used to develop to ols towards automating the sp ecializati on b ecome known (i.e., some or all of the p oints at whichmore
pro cess. Hence, the goal of our exp eriments is to gain an un- information b ecomes available ab out the bindings that the
derstanding of the sp ecializati on concept rather than simply program contains). We call this rep eated application of a
to optimize the comp onent in question. partial evaluator incremental specialization [15].
This pap er illustrates our approach applying sp ecializa - The discussion so far has considered generating sp ecial-
tion in the context of a commercial Unix op erating system ized co de only on the basis of known invariants, i.e., bind-
and the C programming language. Sp eci cally,itfocuses ings that are known to b e constant. In an op erating system,
on the sp ecializati on of the read system call, in HP-UX [12] there are many things that are likely to b e constant for long
while retaining standard HP-UX semantics. Since read is p erio ds of time, but may o ccasionall y vary.For example, it
representative of manyotherUnix system calls and since is likely that les will not b e shared concurrently, and that
HP-UX is representative of manyotherUnix systems, we reads to a particular le will b e sequential. We call these
exp ect the results of our study to generalize well b eyond assumptions quasi-invariants. If sp ecialize d co de is gener-
this sp eci c implementation. ated, and used, on the assumption that quasi-invariants hold
The remainder of the pap er is organized as follows. Sec- most of the time, then p erformance should improve. How-
tion 2 elab orates on the notion of sp ecializati on, and de- ever, the system must correctly handle the cases where the
nes incremental and optimistic sp ecializati on. Section 3 quasi-invaria nts do not hold.
describ es the application of sp ecializa tion to the HP-UX Correctness can b e preserved by guarding every place
read system call. Section 4 analyses the p erformance of where quasi-invariants may b ecome false. For example, sup-
our implementation . Section 5 discusses the implication s of p ose that sp ecialized read co de is generated based on the
our results, as well as the key areas for future research. Re- quasi-invaria nt \no concurrent sharing". A guard placed in
lated work is discussed in section 6. Section 7 concludes the the open system call could b e used to detect other attempts
pap er. to op en the same le concurrently. If the guard is triggered,
the read routine must b e \unsp eciali zed", either to the com-
pletely generic read routine or, more accurately, to another
2 What is Sp ecializa ti on?
sp ecialized version that still capitalizes on the other invari-
ants and quasi-invaria nts that remain valid. We call the
Program sp ecializati on, also called partial evaluation (PE),
pro cess of replacing one version of a routine by another (in
is a program transformation pro cess aimed at customizing
a di erent stage of sp ecializati on) replugging.We refer to
a program based on parts of its input [13, 30]. In essence,
the overall pro cess of sp ecializi ng based on quasi-invari ants
this pro cess consists of p erforming constant propagation and
optimistic specialization. Because it may b ecome necessary
folding, generalized to arbitrary computations.
to replug dynamicall y, optimistic sp ecializati on requires in-
In principle, program sp ecializati on can b e applied to any
cremental sp ecializati on.
program that exhibits interpretation. That is, any program
If the optimistic assumptions ab out a program's b ehav-
whose control ow is determined by the analysis of some
ior are correct, the sp ecialized co de will function correctly.
data. In fact, this characteristic is common in op erating
If one or more of the assumptions b ecome false, the sp e-
system co de. Consider the read example of Section 1. At
cialized co de will break, and it should b e replugged. This
eachinvo cation, read do es extensive data structure analysis
transformation will b e a win if sp ecialized co de is executed
to determine facts such as whether the le is lo cal or re-
many times, i.e., if the savings that accrue from the opti-
mote, the device on which it resides, and its blo ck size. Yet,
mistic assumption b eing right, weighted by the probability
these pieces of information are invariant, and can b e deter-
that it is right, exceed the additional costs of the replugging
mined when the le is op ened. Hence, the data structure
step, weighted by the probability that it is necessary (see
analysis could b e factorized by sp ecializin g the read co de
Section 4 for details).
at op en time with resp ect to the available invariants. Since
The discussion so far has describ ed incremental and op-
the sp ecialized read co de only needs to consist of op erations
timistic sp ecializatio n as forms of runtime PE. However, in
that rely on varying data, it can b e more ecient than the
the op erating system context, the full cost of co de genera-
original version.
tion must not b e incurred at runtime. The cost of runtime
Op erating systems exhibit a wide varietyofinvariants.
co de generation can b e avoided by generating co de templates
As a rst approximation, these invariants can b e divided into
statically and optimistical ly at compile time. Atkernel call
two categories dep ending on whether they b ecome available
invo cation time, the templates are simply lled in and b ound
b efore or during runtime. Examples of invariants available
appropriately [14].
b efore runtime include pro cessor cache size, whether there is
an FPU, etc. These invariants can b e exploited by existing
PE technology at the source level. Other sp ecializati ons
Sp ecializing HP-UX read
3 1. System call startup
To explore the real-world applicabil ity of the techniques out-
lined ab ove, we applied incremental and optimistic sp ecial-
read system call. read was cho-
ization to the HP-UX 9.04 2. Identify file & file system type,
sen as a test case b ecause it is representativeofmanyother
system calls b ecause it is a variation of the BSD le
Unix translate into inode number
system [24 ]. The HP-UX implementation of read is also
representative of many other Unix implementations . There-
fore, we exp ect our results to b e applicable to other Unix -
e systems. HP-UX read also p oses a serious challenge
lik 3. Lock the inode
for our technology b ecause, as a frequently used and well-
understo o d piece of commercial op erating system co de, it
has already b een highly optimized. To ensure that our p er-
t software, wewere
formance results are applicable to curren 4. Translate file offset into
careful to preserve the currentinterface and b ehavior of HP-
read in our sp ecialized read implementation s.
UX logical block number
3.1 Overview of the HP-UX read Imple- tation
men 5. Translate logical block number
o understand the nature of the savings involved in our sp e-
T into physical block number,
cialized read implementation it is rst necessary to under-
volved in a conventional Unix
stand the basic op erations in get the buffer cache block
read implementation. Figure 1 shows the control ow for assuming that the read is from a normal le and that
read containing the data
its data is in the bu er cache. The basic steps are as follows:
1. System call startup: privilege promotion, switchto
ernel stack, and up date user structure.
k 6. Data transfer
2. Identify the le and le system typ e: translate the le
descriptor number into a le descriptor, then into a
umb er, and nally into an ino de number.
vno de n Yes
3. Lo ck the ino de.
7. Do another block?
4. Identify the blo ck: translate the le o set value into a
logical ( le) blo cknumb er, and then translate the log-
knumber intoaphysical (disk) blo cknumb er.
ical blo c No
5. Find the virtual address of the data: nd the blo ckin
he containing the desired physical blo ck
the bu er cac 8. Unlock the inode
and calculate the virtual address of the data from the
le o set.
y necessary bytes from the bu er
6. Data transfer: Cop 9 Update file offset
cache blo ck to the user's bu er.
7. Pro cess another blo ck?: compare the total number of
ytes copied to the number of bytes requested; goto
b 10. System call cleanup
step 4 if more bytes are needed.
8. Unlo ck the ino de.
Figure 1: HP-UX read Flow Graph
9. Up date the le o set: lo ck le table, up date le o set,
and unlo ck the le table.
preparation for interpretation since it is concerned with dis-
10. System call cleanup: up date kernel pro le information,
covering system state information that has b een stored pre-
switch back to user stack, privilege demotion.
viously (i.e., in data structures). Traversal is basically a
matter of dereferencing and includes function calling and
The ab ove tasks can b e categorized as either interpreta-
data structure searching. Lo cking includes all synchronizati on-
tion, traversal, locking,or work.Interpretation is the pro cess
related activities. Work is the fundamental task of the call.
of examining parameter values and system states and mak-
In the case of the read, the only work is to copy the desired
ing control ow decisions based on them. Hence, it involves
data from the kernel bu ers to the user's bu er.
activities such as conditional and case statement execution,
Ideally, all of the tasks p erformed by a particular invoca-
and examining parameters and other system state variables
tion of a system call should b e in the work category. If the
to derive a particular value. Traversal can b e viewed as
vailable early enough, co de in all
necessary information is a 1. System call startup other categories can b e factored out prior to the call. Unfor-
tunately, in the case of read steps 1, 2, 4, 5, 7, and 10 consist
mostly of interpretation and traversal, and steps 3, 8, and
most of 9 are lo cking. Only step 6 and a small part of 9 can ork.
b e categorized as w Yes
read to
Section 3.2 details the sp ecializa tion s applied to 2a. Same block?
reduce the non-work overhead. Section 3.3 outlines the co de
necessary to guard the optimistic sp ecializa tion s. Section 3.4
hanism for switching b etween various sp e-
describ es the mec No
cialized and generic versions of a system call implementa-
tion.
Invariants and Quasi-Invariants for Sp e- 3.2 4. Translate file offset into
cializati on logical block number
Figure 2 shows the sp ecialized ow graph for our sp ecialized
read). Steps 2, 3, version of read (referred to here as is
and 8 have b een eliminated, steps 4 and 5 have b een elim- tial reads, and the remaining steps
inated for small sequen 5. Translate logical block number
read is sp ecial- have b een optimized via sp ecializa tion . is
variants and quasi-invariants listed
ized according to the in into physical block number,
constant is a true invariant; the re- in table 1. Only fs
variants.
mainder are quasi-in get the buffer cache block
The fs constant invariant states that le system con-
ts such as the le typ e, le system typ e, and blo cksize
stan containing the data
do not change once the le has b een op ened. This invariant
is known to hold b ecause of Unix le system semantics.
Based on this invariant, is read can avoid the traversal
read implementa- volved in step 2 ab ove. Our is
costs in 6. Data transfer
tion is sp ecialized, at op en time, for regular les residing
on a lo cal le system with a blo ck size of 8 KB. It is im-
p ortant to realize that the is read co de is enabled, at op en tly
time, for the sp eci c le b eing op ened and is transparen Yes No
substituted in place of the standard read implementation . y other kind of le defaults to the standard HP-
Reading an 7. Do another block?
UX read.
It is also imp ortant to note that the is read path is sp e-
cialized for the sp eci c pro cess p erforming the open. That
read is, we assume that the only pro cess executing the is
open that gener-
co de will b e the one that p erformed the 6a. Data transfer
ated it. The ma jor advantage of this approach is that a
private p er-pro cess p er- le read call has well-de ned access
semantics: reads are sequential by default.
varia nt sequen-
Sp ecializa tion s based on the quasi-in 9 Update file offset
tial access can havehuge p erformance gains. Consider
a sequence of small (say1byte) reads by the same pro cess
to the same le. The rst read p erforms the interpretation,
versal and lo cking necessary to lo cate the kernel virtual
tra 10. System call cleanup
address of the data it needs to copy.At this stage it can
sp ecialize the next read to simply continue copying from
the next virtual address, avoiding the need for anyofthe
Figure 2: is read Flow Graph
steps 2, 3, 4, 5, 7, 8, and 9, as shown in Figure 2. This sp e-
access cialization is predicated not only on the sequential
the interpretation of emptyblock p ointers in the ino de.
fp share quasi-invariants, but also on other quasi- and no
invariants such as the assumption that the next read won't inode share and no fp share quasi-invari ants The no
allow exclusive access to the le to b e assumed. This as-
cross a bu er b oundary, and the bu er cache replacement
co de won't havechanged the data that resides at that vir- sumption allows the sp ecialized read co de to avoid lo cking
the ino de and le table in steps 3, 8, and 9. They also al-
tual memory address. The next section shows howthese
assumptions can b e guarded. low the caching (in data structures asso ciated with the sp e-
cialized co de) of information such as the le p ointer. This
The no holes quasi-invariant is also related to the sp e-
cialization s describ ed ab ove. Contiguous sequential reading caching is what allows all of the interpretation, traversal and
lo cking in steps 2, 3, 4, 5, 8 and 9 to b e avoided.
can b e sp ecialized down to contiguous byte-copying only for
les that don't contain holes, since hole traversal requires In our current implementation, all invariants are vali-
(Quasi-)Invariant Description Savings
fs constant Invariant le system parameters. Avoids step 2.
no fp share No le p ointer sharing. Avoids most of step 9 and allows caching of
le o set in le descriptor.
no holes No holes in le. Avoids checking for emptyblock p ointers in
ino de structure.
no inode share No ino de sharing. Avoids steps 3 and 8.
no user locks No user-level lo cks. Avoids having to check for user-level lo cks.
read only No writers. Allows optimized end of le check.
sequential access Calls to is read inherit le o set For small reads, avoids steps 2, 3, 4, 5, 7, 8,
read calls from previous is 9.
Table 1: Invariants for Sp ecializati on
read. Similarl y, to duplicate a le descriptor used by is
Quasi-Invariant HP-UX system calls that may
fork checks all op en le descriptors and triggers replugging
invalida te invariants
of any sp ecialized read co de.
no fp share creat, dup, dup2, fork, sendmsg,
Guards in calls that take pathnames must detect colli-
no holes op en
sions with sp ecialized co de by examining the le's ino de. We
no inode share creat, fork, op en, truncate
use a sp ecial ag in the ino de to detect whether a sp ecialized
1
no user locks lo ckf, fcntl
co de path is asso ciated with a particular ino de .
read only op en
access quasi-invariant The guards for the sequential
sequential access lseek, readv, is read itself,
are somewhat unusual. is read avoids step 5 (bu er cache
and bu er cache blo ck
lo okup) bychecking to see whether the read request will
replacement
b e satis ed by the bu er cache blo ck used to satisfy the
preceding read request. The kernel's bu er cache blo ck re-
Table 2: Quasi-Invariants and Their Guards
placementmechanism also guards this sp ecializati on to en-
sure that the preceding bu er cache blo ck is still in memory.
With the exception of lseek, triggering any of the guards
dated in open, when sp ecializatio n happ ens. A sp ecialized
discussed ab ovecausestheread co de to b e replugged back
read routine is not generated unless al l of the invariants
to the general purp ose implementation. lseek is the only in-
hold.
stance of resp ecializatio n in our implementation; when trig-
gered, it simply up dates the le o set in the sp ecialized read
co de.
3.3 Guards
To guarantee that all invariants and quasi-invari ants
hold, open checks that the vno de meets all the fs constant
Since sp ecializati ons based on quasi-invari ants are opti-
and no holes invariants and that the requested access is
mistic, they must b e adequately guarded. Guards detect
only for read. Then the ino de is checked for sharing. If
the imp ending invalidatio n of a quasi-invaria ntandinvoke
all invariants hold during open then the ino de and le de-
the replugging routine (section 3.4) to unsp ecialize the read
read path is scriptor are marked as sp ecialized and an is
co de. Table 2 lists the quasi-invariants used in our imple-
set up for use by the calling pro cess on that le. Setting
mentation and the HP-UX system calls that contain the
up the is read path amounts to allo cating a private p er-
asso ciated guards.
le-descriptor data structure for use bytheis read co de
Quasi-invariants suchas read only and no holes can
which is sharable. The ino de and le descriptor markings
b e guarded in open since they can only b e violated if the
activate all of the guards atomically since the guard co de is
same le is op ened for writing. The other quasi-invariants
p ermanently present.
can b e invalidated during other system calls which either ac-
cess the le using a le descriptor from within the same or a
child pro cess, or access it from other pro cesses using system
3.4 The Replugging Algorithm
calls that name the le using a pathname. For example,
fp share will b e invalidated if multiple le descriptors no
Replugging comp onents of an actively running kernel is a
are allowed to share the same le p ointer. This situation can
non-trivial problem that requires a pap er of its own, and
arise if the le descriptor is duplicated lo cally using dup,if
is the topic of ongoing research. The problem is simpli ed
the entire le descriptor table is duplicated using fork,orif
here for two reasons. First, our main ob jective is to test the
a le descriptor is passed though a Unix domain so cket us-
feasibility and b ene ts of sp ecializa tion . Second, sp ecializa -
access will b e violated ing sendmsg. Similarly, sequential
tion has b een applied to the replugging algorithm itself. For
if the pro cess calls lseek or readv.
kernel calls, the replugging algorithm should b e sp ecialized ,
The guards in system calls that use le descriptors are
simple, and ecient.
relatively simple. The le descriptor parameter is used as an
The rst problem to b e handled during replugging is
index into a p er-pro cess table; if a sp ecialized le descriptor
synchronization . If a replugger were executing in a single-
is already present then the quasi-invariant will b ecome in-
1
The in-memory version of HP-UX's ino des is substantially larger
valid, triggering the guard and invoking the replugger. For
than the on-disk version. The sp ecialized ag only needs to b e added
example, the guard in dup only resp onds when attempting
to the in-memory version.
threaded kernel with no system call blo cking in the kernel, 1. Acquire p er-pro cess lo cktoblock concurrent replug-
gers. It may b e that some guard was triggered con-
then no synchronization would b e needed. Our environ-
currently for the same le descriptor, in which case we
mentisamultipro cessor, where kernel calls maybesus-
are done.
p ended. Therefore, the replugging algorithm must handle
2. Acquire p er- le lo cktoblock exit from holding tank.
two sources of concurrency: (1) interactions b etween the re-
plugger and the pro cess whose co de is b eing replugged and
3. Change the p er- le indirect p ointer to send readers to
the holding tank (changes action of the reading thread
(2) interactions among other kernel threads that triggered
at step 3 so no new threads can enter the sp ecialized
a guard and invoked the replugging algorithm at the same
co de).
time. Note that the algorithm presented here assumes that
a coherent read from memory is faster than a concurrency
4. Spinwait for the p er- le inside- ag to b e cleared. Now
lo ck: if sp ecialized hardware makes lo cks fast, then the sp e-
no threads are executing the sp ecialized co de.
cialized synchronizatio n mechanism presented here can b e
5. Perform incremental sp ecializati on according to which
replaced with lo cks.
invariantwas invalida ted.
To simplify the replugging algorithm, wemaketwo as-
6. Set le descriptor appropriately, includin g indicating
sumptions that are true in many Unix systems: (A1) kernel
that the le is no longer sp ecialized.
2
calls cannot ab ort ,sowedonothavetocheck for an in-
7. Release p er- le lo cktounblo ck thread in holding tank.
read, and (A2) there is only one complete kernel call to is
8. Release p er-pro cess lo cktoallow other repluggers to
thread p er pro cess, so multiple kernel calls cannot concur-
continue.
rently access pro cess level data structures.
The second problem that a replugging algorithm must
The way the replugger synchronizes with the reader
solve is the handling of executing threads inside the co de
thread is through the inside- ag in combination with the
b eing replugged. We assume (A3) that there can b e at most
indirection p ointer. If the reader sets the inside- ag b efore
one thread executing inside sp ecialized co de. This is the
a replugger sets the indirection p ointer then the replugger
most imp ortant case, since in all cases so far wehavespe-
waits for the reader to nish. If the reader takes the indi-
cialized for a single thread of control. This assumption is
rect call into the holding tank, it will clear the inside- ag
consistent with most current Unix environments. To sep-
which will tell the replugger that no thread is executing the
arate the simple case (when no thread is executing inside
sp ecialized co de. Once the replugging is complete the al-
co de to b e replugged) from the complicated case (when one
gorithm unblo cks any thread in the holding tank and they
thread is inside), we use an \inside- ag" . The rst instruc-
resume through the new unsp ecialized co de.
tion of the sp ecialized read co de sets the inside- ag to in-
In most cases of unsp eciali zati on, the general case, read,
dicate that a thread is inside. The last instruction in the
is used instead of the sp ecialized is read. In this case, the
sp ecialized read co de clears the inside- ag.
le descriptor is marked as unsp ecialized and the memory
To simplify the synchronizatio n of threads during replug-
read o ccupies is marked for garbage collection at le is
ging, the replugging algorithm uses a queue, called the hold-
close time.
ing tank, to stop the thread that happ ens to invoke the sp e-
cialized kernel call while replugging is taking place. Up on
completion of replugging, the algorithm activates the thread
4 Performance Results
waiting in the holding tank. The thread then resumes the
invo cation through the unsp eciali zed co de.
The exp erimental environment for the b enchmarks was
For simplicity,we describ e the replugging algorithm as if
a Hewlett-Packard 9000 series 800 G70 (9000/887) dual-
there were only two cases: sp ecialized and non-sp ecialized .
pro cessor server [1] running in single-user mo de. This server
The paths take the following steps:
is con gured with 128 MB of RAM. The twoPA7100 [16]
pro cessors run at 96 MHz and eachcontains one MB of in-
1. Check the le descriptor to see if this le is sp ecialized .
struction cache and one MB of data cache.
If not, branch out of the fast path.
Section 4.1 presents an exp erimenttoshowhow incre-
mental sp ecializati on can reduce the overhead of the read
2. Set inside- ag.
system call. Sections 4.2 through 4.4 describ e the over-
head costs of sp ecializati on. Section 4.5 describ es the basic
3. Branch indirect. This branch leads to either the hold-
cost/b ene t analysis of when sp ecializati on is appropriate.
ing tank or the read path. It is changed by the replug-
ger.
4.1 Sp ecializat ion to Reduce read Over-
Read Path:
head
1. Do the read work.
The rst microb enchmark is designed to illustrate the re-
2. Clear inside- ag.
duction in overhead costs asso ciated with the read system
Holding Tank:
call. Thus the exp eriment has b een designed to reduce all
other costs, to wit:
1. Clear inside- ag.
2. Sleep on the p er- le lo cktoawait replugger comple-
all exp eriments were run with a warm le system bu er
3
tion.
cache
3. Jump to standard read path.
3
The use of sp ecialization to optimize the device I/O path and
make b etter use of the le system bu er cache is the sub ject of a
Replugging Algorithm:
separate study currently underway in our group.
2
Take an unexp ected path out of the kernel on failure.
1 Byte 8KB 64 KB
8000 7039
read 3540 7039 39733
5220
6000
is read 979 5220 35043
3400
4000
Savings 2561 1819 4690
% Improvement 72% 26% 12%
2000
Sp eedup x3.61 x1.35 x1.13
0
4
copyout
copyout 170 3400 NA
read read is
Table 3: Overhead Reduction: HP-UX read versus is read
Figure 4: Overhead Reduction: 8 KB HP-UX read versus
costs in CPU cycles
is read costs in CPU cycles
39.7
3540
35.0
40
4000
30
3000
2000
20
979
1000
10
170
0
0
copyout
read read is
read is read
Figure 3: Overhead Reduction: 1 Byte HP-UX read versus
Figure 5: Overhead Reduction: 64 KB HP-UX read versus
is read costs in CPU cycles
is read costs in Thousands of CPU cycles
the same blo ckwas rep eatedly read to achieve all-hot
The most signi cant impact o ccurs in the open system call,
CPU data caches
whichisthepointatwhichtheis read path is generated.
open has to check8invariant and quasi-invariant values,
the target of the read is selected to b e a page-aligned
for a total of ab out 90 instructions and a lo ck/unlo ckpair.
user bu er to optimize copyout p erformance
If sp ecializa tion can o ccur it needs to allo cate some kernel
The program consists of a tight lo op that op ens the le,
memory and ll it in. close needs to check if the le descrip-
gets a timestamp, reads N bytes, gets a timestamp, and
tor is or was sp ecialized and if so, free the kernel memory.
closes the le. Timestamps are obtained by reading the PA-
Akernel memory alloc takes 119 cycles and free takes 138
RISC's interval timer, a pro cessor control register that is
cycles.
incremented every pro cessor cycle [20].
The impact of this work is that the new sp ecialized open
Table 3 numerically compares the p erformance of HP-
call takes 5582 cycles compared to 5227 cycles for the stan-
UX read with is read for reads of one byte, 8 KB, and
dard HP-UX open system call. In b oth cases, no ino de
64 KB, and Figures 3 through 5 graphically represents the
traversal is involved. As exp ected, the cost of the new open
same data. The table and gures also include the costs for
call is higher than the original. However, notice that the
5
copyout to provide a lower b ound on read p erformance .In
increase in cost is small enough that a program that op ens
all cases, is read p erformance is b etter than HP-UX read.
a le and reads it once can still b ene t from sp ecializati on.
For single byte reads, is read is more than three times as
fast as HP-UX read, re ecting the fact that sp ecializati on
4.3 TheCostofNontriggered Guards
has removed most of the overhead of the read system call.
Reads that cross blo ck b oundaries lose the sp ecializati on
The cost of guards can b e broken down into two cases: the
of simply continuing to copy data from the previous p osition
cost of executing them when they are not triggered, and
access in the current bu er cache blo ck(thesequential
the cost of triggering them and p erforming the necessary
quasi-invari ant), and thus su er p erformance loss. 8KB
replugging. This sub-section is concerned with the rst case.
reads necessarily cross blo ck b oundaries, and so the sp e-
Guards are asso ciated with each of the system calls
cialization p erformance improvement for the 8 KB is read
shown in Table 2. As noted elsewhere, there are twosorts
read. has a smaller p erformance gain than the 1 Byte is
of guards. One checks for sp ecialized le descriptors and is
For larger reads, the p erformance gain is not so large b e-
very cheap, the other for sp ecialized ino des. Since ino des
cause the overall time is dominated by data copying costs
can b e shared they must b e lo cked to check them. The
rather than overhead. However, there are p erformance b en-
lo ck exp ense is only incurred if the le passes all the other
e ts from sp ecializati on that o ccur on a p er-blo ck basis, and
tests rst. A lo ck/unlo ck pair takes 145 cycles. A guard
so even 64 KB reads improveby ab out 12%.
requires 2 temp orary registers, 2 loads, an add, and a com-
pare, 11 cycles, and then a function call if it is triggered. It
is imp ortant to note that these guards do not o ccur in the
4.2 The Cost of the Initial Sp ecializat ion
data transfer system calls, except for readv whichisnot
The p erformance improvements in the read fast path come
frequently used.
at the exp ense of overhead in other parts of the system.
In the current implementation , guards are xed in place
4
(and always p erform checks) but they are triggered only
Large reads break down into multiple blo ck-sized calls to copyout.
5
when sp ecialized co de exists. Alternatively, guards could b e
copyout p erforms a safe copyfromkernel to user space while cor-
rectly handling page faults and segmentation violations.
inserted in-place when asso ciated sp ecialized co de is gener- read (the setting and reset- with the replugger is b orn by is
ated. Learning which alternative p erforms b etter requires ting of inside- ag), but overall is read is much faster than
further research on the costs and b ene ts of sp ecializati on read (Section 4). In Equation 2, f is the number of times
is
read is invoked and f is the total mechanisms. sp ecialized is
T otalRead
number of invo cations to read the le. Sp ecializ atio n wins
if the inequali ty in Equation 2 is true.
4.4 The Cost of Replugging
The following sections outline a series of microb ench-
marks to measure the p erformance of the incrementally and
There are two costs asso ciated with replugging. One is the
optimistical ly sp ecialized read fast path, as well as the over-
read for checking if overhead added to the fast path in is
head asso ciated with guards and replugging.
it is sp ecialized and calling read if not, and for writing the
inside- ag bit twice, and the indirect function call with zero
arguments otherwise. A timed microb enchmark shows this
5 Discussion
cost to b e 35 cycles.
The second cost of replugging is incurred when the re-
The exp erimental results describ ed in Section 4 show the
plugging algorithm is invoked. This cost dep ends on whether
read implementation. At the p erformance of our current is
there is a thread already present in the co de path to b e re-
time of writing this implementation was not fully sp ecial-
plugged. If so, the elapsed time taken to replug can b e
ized: some invariants were not used and, as a result, the
dominated by the time taken by the thread to exit the sp e-
measured is read path contains more interpretation and
cialized path. The worst case for the read call o ccurs when
traversal co de than is absolutely necessary. Therefore, the
the thread present in the sp ecialized path is blo cked on I/O.
p erformance results presented ab ove are conservative. Even
We are working on a solution to this problem whichwould
so, the results show that optimistic sp ecializa tion can im-
allow threads to \leave" the sp ecialized co de path when ini-
prove the p erformance of b oth small and large reads.
tiating I/O and rejoin a replugged path when I/O completes,
At one end of the sp ectrum, assuming a warm bu er
but this solution is not yet implemented.
cache, the p erformance of smal l reads is dominated by con-
In the case where no thread is present in the co de path
trol ow costs. Through sp ecializati on we are able to re-
to b e replugged, the cost of replugging is determined by
move, from the fast path, a large amount of co de, concerned
the cost of acquiring twolocks, one spinlo ck, checking one
with interpretation, data structure traversal and synchro-
memory lo cation and storing to another (to get exclusive
nization. Hence, it is not surprising that the cost of small
access to the sp ecialized co de). To fall back to the generic
reads is reduced signi cantl y.
read takes 4 stores plus address generation, plus storing
At the other end of sp ectrum, again assuming a warm
the sp ecialized le o set into the system le table which
bu er cache, the p erformance of large reads is dominated by
requires obtaining the File Table Lo ck and releasing it. After
data movement costs. In e ect, the control- owoverhead,
incremental sp ecializati on twolocks have to b e released. An
still present in large reads, is amortized over a large amount
insp ection of the generated co de shows the cost to b e ab out
of byte copying. Sp ecializati ons to reduce the cost of byte
535 cycles assuming no lo ckcontention. The cost of the
copying will b e the sub ject of a future study.
holding tank is not measured since that is the rarest sub case
Sp ecializati on removes overhead from the fast path by
and it would b e dominated by spinning for a lo ckinany
adding overhead to other parts of the system: sp eci call y,
event.
the places at which the sp ecializatio n, replugging and guard-
Adding up the individua l comp onent costs, and multi-
ing of optimistic sp ecializati ons o ccur. Our exp erience has
plying them by the frequency,we can estimate the guarding
shown that generating sp ecialized implementations is easy.
and replugging overhead attributed to each is read.Ifwe
The real diculty arises in correctly placing guards and
assume that 100 is read happ en for each of the guarded
making p olicy decisions ab out what and when to sp ecial-
kernel calls (fork, creat, truncate, open, close and re-
ize and replug. Guards are dicult to place b ecause an
plugging), then less than 10 cycles are added as guarding
op erating system kernel is a large program, and invariants
overhead to eachinvo cation of is read.
often corresp ond to global state comp onents which are ma-
nipulated at numerous sites. Therefore, manually tracking
the places where invariants are invalidated is a tedious pro-
4.5 Cost/Bene t Analysis
cess. However, our exp erience with the HP-UX le system
Sp ecializati on reduces the execution costs of the fast path,
has brought us an understanding of this pro cess whichwe
but it also requires additional mechanisms, such as guards
are now using to design a program analysis capable of de-
and replugging algorithms, to maintain system correctness.
termining a conservativeapproximation of the places where
By design, guards are lo cated in low frequency execution
guards should b e placed.
paths and in the rare case of quasi-invariant invalidation ,
Similarly,thechoice of what to sp ecialize, when to sp e-
replugging is p erformed. Wehave also added co de to open
cialize, and whether to sp ecialize optimistical ly are all non-
to check if sp ecializa tion is p ossible, and to close to garbage
trivial p olicy decisions. In our current implementation we
collect the sp ecialize d co de after replugging. An informal
made these decisions in an ad hoc manner, based on our
p erformance analysis of these costs and a comparison with
exp ert knowledge of the system implementation, semantics
the gains is shown in Figure 6.
and common usage patterns. This exp eriment has prompted
In Equation 1, O v er head includes the cost of guards,
us to explore to ols based on static and dynamic analyses
the replugging algorithm, and the increase due to initial in-
of kernel co de, aimed at helping the programmer decide
variantvalidation, sp ecializati on and garbage collecting for
when the p erformance improvement gained by sp ecializa -
i
all le op ens and closes. Each Guar d (in di erentkernel
tion is likely to exceed the cost of the sp ecializati on pro cess,
i
calls) is invoked f times. Similarly, Repl ug is invoked
guarding, and replugging.
sy scall
f times. A small part of the cost of synchronization Replug
X
i i
Overhead = f Guar d + O pen + C l ose + f Repl ug (1)
Replug
sy scall
i
O v er head + f is read < (f f ) read (2)
is T otalRead is
Figure 6: Cost/Bene t Analysis: When is it Bene cial to Sp ecialize?
5.1 Interface Design and Kernel Structure A strong lesson from our work and from other work in the
OI community [22] is that abstractly sp eci ed interfaces,
From early in the pro ject, our intuition told us that, in the
i.e., those that do not constrain implementation choices un-
most sp ecialized case, it should b e p ossible to reduce the cost
necessarily, are the key to gaining the most b ene t from
of a read system call that hits in the bu er cache. It should
techniques such as sp ecializati on.
b e little more than the basic cost of data movementfrom
the kernel to the applicati on's address space, i.e., the cost of
copying the bytes from the bu er cache to the user's bu er.
6 Related Work
In practice, however, our sp ecialized read implementation
costs considerably more than copying one byte. The cost of
Our work on optimistic incremental sp ecializa tion can b e
our sp ecialize d read implementation is 979 cycles, compared
viewed as part of a widespread research trend towards
to approximately 235 cycles for entering the kernel, elding
adaptive op erating systems. Micro-kernel op erating sys-
the minimum numb er of parameters, and safely copying a
tems [5, 6, 11, 9, 21 , 27, 29] were an early example of this
single byte out to the application's address space.
trend, and improve adaptiveness byallowing op erating sys-
Up on closer examination, we discovered that the remain-
tem functionality to b e implemented in user-level servers
ing 744 cycles were due to a long list of inexp ensive actions
that can b e customized and con gured to pro duce sp ecial-
includin g stack switching, masking oating p oint exceptions,
purp ose op erating systems. While micro-kernel-based archi-
recording kernel statistics, supp orting ptrace, initiali zi ng
tectures improve adaptiveness over monolithic kernels, sup-
kernel interrupt handling, etc. The sum of these actions
p ort for user-level servers incurs a high p erformance p enalty.
dominates the cost of small byte reads.
To address this p erformance problem Chorus [29] allows
These actions are due in part to constraints that were
mo dules, known as sup ervisor actors, to b e loaded into the
placed up on our design byanover-sp eci cation of the UNIX
kernel address space. A sp ecialized IPC mechanism is used
read implementation. For example, the need to always sup-
for communication b etween actors within the kernel address
p ort statistics-gatherin g facilities suchastimes requires ev-
space. Similarly, Flex [8] allows dynamic loading of op er-
ery read call to record the time it sp ends in the kernel. For
ating system mo dules into the Machkernel, and uses a mi-
one byte reads, changing the interface to one that returns a
grating threads mo del to reduce IPC overhead.
single byte in a register instead of copying data to the user's
One problem with allowing application s to load mo dules
address space, and changing the semantics of the system call
into the kernel is loss of protection. The SPIN kernel [4]
to not supp ort statistics and pro ling eliminates the need for
allows application s to load executable mo dules, called spin-
many of these actions.
dles, dynamically into the kernel. These spindles are written
To push the limits of a kernel-based read implementa-
in a typ e-safe programming language to ensure that they do
tion, we implemented a sp ecial one-byte read system call,
not adversely a ect kernel op erations.
called readc, which returns a single byte in a register, just
Ob ject-oriented op erating systems allow customization
likethegetc library call. In addition to the optimizations
through the use of inheritance, invo cation redirection, and
read call, readc avoids switch- used in our sp ecialized is
meta-interfaces. Choices [7] provides generalized comp o-
ing stacks, omits ptrace supp ort, and skips up dating pro-
nents, called frameworks, which can b e replaced with sp e-
le information. The p erformance of the resulting readc
cialized versions using inheritance and dynamic linking . The
implementation is 65 cycles. Notice that aggressiveuseof
Spring kernel uses an extensible RPC framework [18] to redi-
sp ecializati on can lead to a readc system call that p erforms
rect ob ject invo cations to appropriate handlers based on the
within a factor of two of a pure user-level getc which costs
typ e of ob ject. The Substrate Ob ject Mo del [3] supp orts ex-
38 cycles in HP-UX's stdio library. Thisresultisencour-
tensibili tyintheAIXkernel byproviding additional inter-
aging b ecause it shows the feasibility of implementing op er-
faces for passing usage hints and customizing in-kernel im-
ating system functionalityatkernel level with p erformance
plementations. Similarly, the Ap ertos op erating system [31]
similar to user-level libraries. Aggressive sp ecializati on may
supp orts dynamic recon guration by mo difying an ob ject's
render unnecessary the p opular trend of duplicating op erat-
behavior through op erations on its meta-interface.
ing system functionality at user level [2, 19 ] for p erformance
Other systems, such as the Cache kernel [10] and the
reasons.
Exo-kernel [17] address the p erformance problem bymoving
Another commonly cited reason for moving op erating
even more functionality out of the op erating system kernel
system functionali ty to user level is to give applications
and placing it closer to the application. In this \minimal-
more control over p olicy decisions and op erating system im-
kernel" approach extensibil ity is the norm rather than the
plementations. We b elieve that these b ene ts can also b e
exception.
gained without duplicating op erating system functionality
Synthetix di ers from the other extensible op erating sys-
at user level. Following an op en-implementati on (OI) phi-
tems describ ed ab oveinanumber of ways. First, Synthetix
losophy [22], op erating system functionality can remain in
infers the sp ecializati on s needed even for applicatio ns that
the kernel, with customization of the implementation sup-
havenever considered the need for sp ecializati on . Other ex-
p orted in a controlled manner via meta-interface calls [23].
John R. Keller, Keith Y. Oka, and Marlin M. Jones I I. tensible systems require applications to know which sp ecial-
Corp orate Business Servers: An AlternativetoMain- izations will b e b ene cial and then select or provide them.
frames for Business Computing. Hew lett-Packard Jour- Second, Synthetix supp orts optimistic sp ecializati ons
nal, 45(3):8{30, June 1994. and uses guards to ensure the validity of a sp ecializati on
and automatically replug it when it is no longer valid. In
[2] Thomas E. Anderson, Brian N. Bershad, Edward D.
contrast, other extensible systems do not supp ort automatic
Lazowska, and Henry M. Levy.Scheduler Activations:
replugging and supp ort damage control only through hard-
E ective Kernel Supp ort for the User-Level Manage-
ware or software protection b oundaries.
mentofParallelism. ACM Transactions on Computer
Third, the explicit use of invariants and guards in Syn-
Systems, 10(1):53{79, February 1992.
thetix also supp orts the comp osabili ty of sp ecializati on s:
guards determine whether two sp ecializati on s are comp os-
[3] Arindam Banerji and David L. Cohn. An Infrastructure
able. Other extensible op erating systems do not provide
for Application -Sp eci c Customization. In Proceedings
supp ort to determine whether separate extensions are com-
of the ACM European SIGOPS Workshop, September
p osable.
1994.
Like Synthetix, Scout [26] has fo cused on the sp ecial-
[4] Brian N. Bershad, Stefan Savage, Przemys lawPardyak, ization of existing systems co de. Scout has concentrated
Emin Gun Sirer, Marc Fiuczynski, David Becker, Su- on networking co de and has fo cused on sp ecializati ons that
san Eggers, and Craig Chamb ers. Extensibil ity, Safety minimize co de and data caching e ects. In contrast, wehave
and Performance in the SPIN Op erating System. In fo cused on parametric sp ecializati on to reduce the length of
Symposium on Operating Systems Principles (SOSP), various fast paths in the kernel. We b elieve that manyof
Copp er Mountain, Colorado, Decemb er 1995. the techniques used in Scout are also useful in Synthetix,
and vice versa.
[5] D.L. Black, D.B. Golub, D.P. Julin, R.F. Rashid, R.P.
Draves, R.W. Dean, A. Forin, J. Barrera, H. Tokuda,
G. Malan, and D. Bohman. Microkernel op erating sys-
7 Conclusions
tem architecture and Mach. In Proceedings of the Work-
This pap er has intro duced a mo del for sp ecializati on based
shop on Micro-Kernels and Other Kernel Architectures,
on invariants, guards, and replugging. Wehaveshown how
pages 11{30, Seattle, April 1992.
this mo del supp orts the concepts of incremental and opti-
[6] F. J. Burkowski, C. L. A. Clarke, Crispin Cowan,
mistic sp ecializa tion . These concepts re ne previous work
and G. J. Vreugdenhil. Architectural Supp ort for
on kernel optimization using dynamic co de generation in
LightweightTasking in the Sylvan Multipro cessor Sys-
Synthesis [28, 25]. Wehave demonstrated the feasibility
tem. In Symposium on Experience with Distributedand
and usefulness of incremental, optimistic sp ecializati on by
Multiprocessor Systems (SEDMS II), pages 165{184,
applying it to le system co de in a commercial op erating
Atlanta, Georgia, March 1991.
system (HP-UX). The exp erimental results show that signif-
icant p erformance improvements are p ossible even when the
[7] Roy H. Campb ell, Nayeem Islam, and Peter Madany.
base system is (a) not designed sp eci cally to b e amenable to
Choices: Frameworks and Re nement. Computing Sys-
sp ecializati on, and (b) is already highly optimized. Further-
tems, 5(3):217{257, 1992.
more, these improvements can b e achieved without altering
the semantics or restructuring the program.
[8] John B. Carter, Bryan Ford, Mike Hibler, Ravindra
Our exp erience with p erforming incremental and opti-
Kuramkote, Je rey Law, Lay Lepreau, Douglas B. Orr,
mistic sp ecializati on manually has shown that the approach
Leigh Stoller, and Mark Swanson. FLEX: A Tool for
is worth pursuing. Based on this exp erience, we are devel-
Buildin g Ecient and Flexible Systems. In Proceed-
oping a program sp ecializer for C and a collection of to ols
ings of the Fourth Workshop on Workstation Operating
for assisting the programmer in identifying opp ortunities for
Systems, pages 198{202, Napa, CA, Octob er 1993.
sp ecializati on and ensuring consistency.
[9] David R. Cheriton. The V Distributed System. Com-
munications of the ACM, 31(3):314{333, March 1988.
8 Acknowledgements
[10] David R. Cheriton and Kenneth J. Duda. A Caching
Mo del of Op erating System Kernel Functionality.In
Bill Trost of Oregon Graduate Institute and Takaichi
Symposium on Operating Systems Design and Imple-
Yoshida of Kyushu Institute of Technology p erformed the
mentation (OSDI), pages 179{193, Novemb er 1994.
initial study of the HP-UX read and write system calls,
identifying many quasi-invari ants to use for sp ecializa tion .
[11] David R. Cheriton, M. A. Malcolm, L. S. Melen, and
Bill also implemented some prototyp e sp ecialized kernel calls
G. R. Sager. Thoth, A Portable Real-Time Op erating
that showed promise for p erformance improvements. Luke
System. Communications of the ACM, 22(2):105{115,
Horno of University of Rennes contributed to discussions
February 1979.
on sp ecializati on and to ols development. Finally,wewould
like to thank the SOSP referees for their extensive and thor-
[12] Frederick W. Clegg, Gary Shiu-Fan Ho, Steven R. Kus-
ough comments and suggestions.
mer, and John R. Sontag. The HP-UX Op erating Sys-
tem on HP Precision Architecture Computers. Hew lett-
Packard Journal, 37(12):4{22, Decemb er 1986.
References
[13] C. Consel and O. Danvy.Tutorial notes on partial eval-
[1] Thomas B. Alexander, Kenneth G. Rob ertson, Dean T.
uation. In ACM Symposium on Principles of Program-
Lindsey, Donald L. Rogers, John R. Ob ermeyer,
ming Languages, pages 493{501, 1993.
[14] C. Consel and F. Noel. A general approach to run-time [27] S. J. Mullender, G. van Rossum, A. S. Tanenbaum,
sp ecializatio n and its application to C. Rep ort 946, R. van Renesse, and H. van Staveren. Amo eba | A dis-
Inria/Irisa, Rennes, France, July 1995. tributed Op erating System for the 1990's. IEEE Com-
puter, 23(5), May 1990.
[15] C. Consel, C. Pu, and J. Walp ole. Incremental sp e-
cialization: The key to high p erformance, mo dularity [28] C. Pu, H. Massalin, and J. Ioannidis. The Synthesis
and p ortabili ty in op erating systems. In Proceedings of kernel. Computing Systems, 1(1):11{32, Winter 1988.
ACM Symposium on Partial Evaluation and Semantics-
[29] M. Rozier, V. Abrossimov, F. Armand, I. Boule,
BasedProgram Manipulation, Cop enhagen, June 1993.
M. Gien, M. Guillemont, F. Herrman, C. Kaiser,
[16] Eric DeLano, Will Walker, and Mark Forsyth. A High S. Langlois, P. Leonard, and W. Neuhauser. Overview
Sp eed Sup erscalar PA-RISC Pro cessor. In COMPCON of the Chorus distributed op erating system. In Pro-
92, pages 116{121, San Francisco, CA, February 24-28 ceedings of the Workshop on Micro-Kernels and Other
1992. Kernel Architectures, pages 39{69, Seattle, April 1992.
[17] Dawson R. Engler, M. Frans Kaasho ek, and [30] P. Sestoft and A. V. Zamulin. Annotated bibliogra -
James O'To ole Jr. Exokernel: An Op erating System phy on partial evaluation and mixed computation. In
Architecture for Application-level Resource Manage- D. Bjrner, A. P. Ershov, and N. D. Jones, editors,
ment. In Symposium on Operating Systems Principles Partial Evaluation and Mixed Computation. North-
(SOSP), Copp er Mountain, Colorado, Decemb er 1995. Holland, 1988.
[18] Graham Hamilton, Michael L. Powell, and James G. [31] YasuhikoYokote. The Ap ertos Re ective Op erating
Mitchell. Sub contract: A exible base of distributed System: The Concept and Its Implementation. In Pro-
programming. In Proceedings of the Fourteenth ACM ceedings of the Conference on Object-OrientedProgram-
Symposium on Operating System Principles, pages 69{ ming Systems, Languages, and Applications (OOP-
79, Asheville, NC, Decemb er 1993. SLA), pages 414{434, Vancouver, BC, Octob er 1992.
[19] Kieran Harty and David R. Cheriton. Application -
controlled physical memory using external page-cache
management. In Proceedings of the Fifth International
ConferenceonArchitectural Support for Programming
Languages and Operating Systems (ASPLOS-V), pages
187{197, Boston, MA, Octob er 1992.
[20] Hewlett-Packard. PA-RISC 1.1 Architectureand
Instruction Set Reference Manual, second edition,
Septemb er 1992.
[21] Dan Hildebrand. An Architectural Overview of QNX.
In Proceedings of the USENIX Workshop on Micro-
kernels and Other Kernel Architectures, pages 113{123,
Seattle, WA, April 1992.
[22] Gregor Kiczales. Towards a new mo del of abstraction
in software engineering. In Proc. of the IMSA'92 Work-
shop on Re ection and Meta-level Architectures, 1992.
See http://www.xerox.com/PARC/spl/eca/oi.html
for up dates.
[23] Gregor Kiczales, Jim des Rivieres, and Daniel G. Bo-
brow. The Art of the Metaobject Protocol. MIT Press,
1991.
[24] Samuel J. Leer, Marshall Kirk McKusick, Michael J.
Karels, and John S. Quarterman. 4.3BSD UNIX Op-
erating System. Addison-Wesley Publishi ng Company,
Reading, MA, 1989.
[25] H. Massalin and C. Pu. Threads and input/output in
the Synthesis kernel. In Proceedings of the Twelfth Sym-
posium on Operating Systems Principles, pages 191{
201, Arizona, Decemb er 1989.
[26] David Mosb erger, Larry L. Peterson, and Sean
O'Malley. Proto col Latency: MIPS and Reality.Re-
p ort TR 95-02, Dept of Computer Science, University
of Arizona, Tuscon, Arizona, April 1995.