VINO: the 1994 Fall Harvest

VINO: The 1994 Fall Harvest

The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters

Citation Endo, Yasuhiro, James Gertzman, Margo Seltzer, Christopher Small, Keith A. Smith, and Diane Tang. 1994. VINO: The 1994 Fall Harvest. Harvard Computer Science Group Technical Report TR-34-94.

Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:26506454

Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA

VINO The Fall Harvest

Yasuhiro Endo

James Gwertzman

Margo Seltzer

Christopher Small

Keith A Smith

and

Diane Tang

December

Center for Research in Computing Technology

Harvard University

Cambridge Massachusetts

VINO The Fall Harvest

Yasuhiro Endo

James Gwertzman

Margo Seltzer

Christopher Small

Keith A Smith

Diane Tang

An Introduction to the Architecture of the VINO Kernel

Yasuhiro Endo Margo Seltzer Christopher Smal l and Keith Smith

The Case for Geographical PushCaching

James Gwertzman Margo Seltzer

Structuring the Kernel as a Collection of Reusable Comp onents

Christopher Smal l

Lies Damned Lies and File System Benchmarks

Diane Tang Margo Seltzer

The Case for InKernel Tracing

Yasuhiro Endo Christopher Smal l

Your Op erating System is a Database

Keith A Smith Margo Seltzer

An Introduction to the Architecture of the VINO Kernel

Margo Seltzer Yasuhiro Endo Christopher Small and Keith A Smith

Harvard University

fmargoyazchriskeithgdasharvardedu

Abstract Inappropriate p olicy is only one reason that applica

tions reimplement kernel functionality Kernel function

Current op erating systems are designed to provide least

ality is often unavailable to applications For example

commondenominator service to a variety of applications

the kernel uses ecient synchronization primitives based

They exp ort few internal kernel facilities and those which

on fast test and set instructions which are not available

are exp orted have irregular interfaces As a result re

to applications ATT Mo dern le systems use logging

source intensive applications such as database manage

to provide improved p erformance and fast recovery but

ment systems and multimedia applications are often

these logging mechanisms are not available for application

p o orly served by the op erating system These applica

use CHANG CHUT KAZAR VXFS

tions often go to great lengths to bypass normal kernel

Todays applications are unable to realize the p otential

mechanisms to achieve acceptable p erformance

of to days hardware OUST Database management

We describ e a new kernel architecture the VINO ker

systems are the classic example of comp etition b etween

nel which addresses the limitations of conventional op er

applications and the op erating systems STON How

ating systems The VINO design is driven by three prin

ever they are only one example realtime systems high

ciples

sp eed networking applications distributed applications

and embedded systems all face similar problems We call

Application Directed Policy the op erating system

these applications resource intensive as they place heavy

provides a collection of mechanisms but applications

demands on the allo cation of resources by virtue of the

dictate the p olicies applied to those mechanisms

size of the resources eg images and video or the tim

ing requirements of the resources eg audiovideo syn

Kernel as Toolbox applications can reuse the kernels

chronization and quality of services guarantees Todays

primitives

systems address these problems with piecemeal solutions

The VINO kernel fo cuses on three key ideas

Universal Resource Access all resources are accessed

through a single common interface

Applications direct policy the kernel controls al loca

tion of resources but leaves the management of those

VINOs p ower and exibility make it an ideal platform

resources to applications For example the kernel is

for research in op erating systems and resource intensive

resp onsible for determining the allo cation of physical

applications

page frames to pro cesses but each pro cess then de

termines which virtual memory pages will b e mapp ed

into physical memory

Introduction

Kernel as a set of reusable tools rather than hide

Conventional op erating systems provide a xed interface

kernel mechanisms from applications they are ex

with a predened set of p olicies implementing that inter

p orted for application reuse Many lesystems use a

face The p olicies provided are the least common denom

databasestyle logging of metadata op erations to im

inator of those required by applications When the p olicy

prove p erformance and simplify recovery many ap

provided by the op erating system is inappropriate for a

plications do as well Normally an application needs

particular application there are two alternatives leave

to reimplement the logging facility in user space In

the application to suer with an inappropriate p olicy or

VINO this facility is available for application use

reimplement the kernel mechanism in user space The

Al l resources share a common extensible interface rst approach leads to degraded p erformance the second

to simplify reuse of kernel services we supp ort a sim leads to redundant implementations and comp etition for

ple hierarchical type mo del for resources any facility resources b etween the op erating system and the applica

that works with a general resource will work with a tion

These resourcesp ecic implementations are mo dules more sp ecic one A lo cking subsystem written to

that are dynamically installed in the kernel We install work with the generic resource type will work with

this co de in the kernel b ecause the cost of frequent cross any resource b e it a le VM page or serial p ort

domain calls is to o high esp ecially on p erformancecritical

The VINO architecture consists of an inner kernel and

paths such as when p olicy decisions are made BERS

a set of application resources The inner kernel can not b e

Unlike other extensible systems we have not under

mo died by application co de but a pro cess can override

taken the task of dening a new typesafe language

the b ehavior of its application resources The inner ker

BERS ENGL LISK It is outside the scop e of

nel controls al location and security decisions it mediates

our pro ject to sp ecify implementand supp ort a new lan

requests for shared resources and ensures that a pro cess

guage and widespread acceptance of new languages in

do es not illegally gain access to resources

the community irresp ective of their elegance and p ower

is very low Extensions to VINO are written in C or C

Techniques to ensure the safety of ob ject co de are well

Resource Types

known Each mo dule is assigned a range of memory for its

co de and data segments Instructions are inserted into the

Each VINO resource is describ ed by its resource type

co de to p erform a baseandb ounds check on each memory

The resource type denition includes operations function

reference This type of check detects faults in co de an

members message handlers and properties data mem

alternative technique sandboxing WAHBE masks and

b ers slots A resource type includes a default implemen

prevents faults with an overhead lower than that of base

tation for each op eration a piece of co de that is run when

andb ounds checking

the op eration is invoked

Our plan is to a trusted compiler that generates co de

The standard set of resource types include resources

with either b ounds checking or sandb oxing to ensure co de

such as les directories threads transactions physical

safety Co de generated by our compiler will b e marked

memory pages virtual memory pages and queues

with a ngerprint RABIN a type of digital signature

Resource types are arranged in a hierarchy A subtype

A ngerprint is computationally infeasible to forge it en

inherits the interface of its sup ertype and can reuse or

sures with a very high degree of certainty that all co de

override its sup ertypes implementation The subtype can

installed in the kernel comes from our trusted compiler

add new op erations and prop erties extending its sup er

Our compiler based on compiler backend work under

types interface It can not remove prop erties or op era

way at Harvard ensures that co de grafted onto the op

tions from the inherited interface For example a Quick

erating system do es not read or write outside its b ounds

TimeFile resource type is dened in terms of the VideoFile

includes no instructions that mask interrupts and do es

resource type which is dened in terms of the File re

not mo dify itself

source type It has the same basic interface but includes

Even with these assurances userinstalled co de may not

dierent implementations for the enco de and deco de op

terminate in a timely fashion The VINO kernel supp orts

erations

limited multithreading and grafted co de can b e desched

A new resource type is added to VINO by compiling

uled by timing out The grafted co de may b e illb ehaved

it into the kernel If the new resource type has the same

and never return to the application but only the appli

interface as its sup ertype or the new interface do es not

cation itself suers no other pro cess is prevented from

need to b e called from existing kernel co de the new type

making progress

can b e added to the kernel dynamically as new device

We must also guard against grafted co de obtaining a

drivers can b e linked into Unix onthey If the new

critical system lo ck and not releasing it in a reasonable

interface needs to b e called from existing kernel co de the

amount of time To handle this we attach a timeout

kernel will need to b e relinked

to critical lo cks and kill a pro cess that do es not release

the lo ck b efore the timeout Each piece of grafted co de

Grafting

runs in the context of a lightweight transaction that keeps

track of its allo cated resources If the pro cess is ab orted

Application control of p olicy is accomplished by overrid

the corresp onding transaction is ab orted and the system

ing the default implementation for an op eration on a re

is returned to a consistent state

source In VINO this is called grafting a new implementa

Unlike the external servers of Mach ACET grafting

tion into the kernel For example the PageTable resource

allows small incremental changes in kernel functionality

allo cated to a pro cess uses an LRU algorithm for its evic

If the page eviction strategy of the system is inappropri

tion strategy If an application wants to use a dierent

ate it can b e replaced without writing a new external

algorithm it grafts its own implementation for the Evict

pager MCNAM

op eration onto the PageTable resource for the pro cess

Unix is a trademark of XOp en

Resource Managers and Names buered display needs to b e scheduled thirtytwo times

a second and have physical memory large enough to hold

A name service maps a name to a resource manager

two copies of the displayed image A query pro cessor can

storageid pair The resource manager can then b e asked

tune its join algorithm to the amount of physical memory

to map the storageid to a le resource A le resource

available for its use if it can assume that the memory

implements the exp ected read write and seek interface

once allo cated will not b e taken away Such applica

Because the name service is decoupled from the storage

tions can make hard resource requests where no less than

system we can put les next to each other in the names

the minimum requested resources will b e allo cated and

pace that are stored in dierent places For example in

once allo cated they will not b e preempted If a new hard

homechris you nd the entries time vinoarchtex

request will exceed the physical resources of the system

and todo The rst is b e handled by a time resource

VINO will not grant the request A hard request can b e

manager when read it resp onds with the current time

thought of as applicationsp ecied entrance criteria if the

The second is a regular le stored in a lo cal lesys

resource can not b e allo cated the application can choose

tem The third when read sends a query to the calendar

to not pro ceed In order to ensure fairness of allo cation

database and return the contents of the readers todo

an application must b e privileged in order to make hard

list

requests Applications with less stringent requirements

This facility is similar to one oered by Plan

make soft requests sp ecifying a preferred minimum re

PRES although b ecause VINO separates name man

source allo cation If the sum of the soft requests exceeds

agement from storage management it gains the exibility

the system resources VINO will arbitrate b etween the

of allowing services to b e lo cated in the namespace where

requesters sharing the resources available

it makes most sense to the user We also stray from the

idea of using the lesystem namespace as the single uni

fying abstraction not all resources can b e easily mo deled

Kernel Tools

as les or need to b e present in the le namespace

Op erating systems are built around synchronization

A resource manager is an instance of the resource type

transactions recovery and resource sharing The co de

manager A manager provides op erations to create and

that implements this functionality is rarely exp orted to

delete entries and control access to its stored data It also

user applications The VINO design is based on the idea

implements the management of the underlying storage

that kernel to ols should b e exp orted to and used by ap

read and write op erations for its les Subtypes of man

plications

ager include one implementing an FFSstyle MCKU

le system a journaling le system an NFS le system

and a memorybased le system Dening another sub

Synchronization Primitives

type of manager eg one that handles FTP requests is

straightforward

The op erating systems synchronization primitives are

Lo cal disk storage is controlled by a volume manager

typically much simpler and more ecient than those

which owns the physical disk other storage managers re

provided to userlevel applications For example the

quest cylinders and tracks from it By using a volume

semaphores oered to applications by System V incur a

manager we are able to dynamically partition the amount

large number of system calls and context switches while

of space allo cated to dierent managers

simple spinlo cks are virtually free SELT VINO pro

Stackable or layered le systems are implemented by

vides a kernel lo ck manager accessible for application use

building on top of an existing resource manager If an

In its simplest form the lo ck manager provides spin

encrypted le system is needed a new manager is created

lo ck synchronization on memory lo cations requiring ker

with read and write op erations that encrypt and decrypt

nel intervention only in the case of a contested lo ck This

data and then delegate storage to an already existing

interface is available b oth to the kernel and to applica

resource manager

tions

As the resources b eing lo cked b ecome more complex

so do es the lo cking paradigm The VINO lo ck manager

supp orts generalpurpose hierarchical locking GRAY

Fairness

For example the le system typically requires lo cking on

One of the primary jobs of an op erating system is to ar blo ck le directory and le system levels In most ker

bitrate and abstract resource access Some devices such nels this hierarchy is enforced by convention In VINO

as physical memory are shared and preemptable others it is enforced by design

such as disk space or a serial p ort are not We call the levels at which lo cking may b e needed the

Some applications require service guarantees eg an containment hierarchy When a lo ck is requested from

application displaying realtime video using a double the lo ck manager the manager examines the resources

series of unordered writes encapsulated in a transaction containment hierarchy to determine if the lo ck may b e

The transaction interface supp orts the standard granted Applications using the lo ck manager can dene

transactionbegin transactioncommit and transaction alternate containment hierarchies and make them avail

abort op erations It accepts references to appropriate log able to the lo ck manager eg a DBMS might create a

and lo ck manager instances to use for each transaction logical containment hierarchy of database relation tuple

At transaction b egin a new transaction resource is cre and eld Additionally applications can create new in

ated This resource references the appropriate log and stances of a lo ck manager that enforces alternate lo cking

lo ck managers and is referenced by each protected up proto cols eg alternate deadlo ck handling blo cking vs

date Most kernel transactions are protected using a sim nonblo cking or new lo cking mo des

ple shadowresource scheme with a log residing in main Finally by integrating the kernel and user level lo cking

memory either volatile or nonvolatile dep ending on the systems concurrency can b e increased For example a

resources b eing protected The mixing and matching of DBMS running on a conventional Unix le system may

logging and lo cking comp onents enables VINO to supp ort implement its own lo ck manager and issue multiple IO

arbitrarily complex transaction proto cols requests to the same le Unfortunately the Unix le sys

Because the implementation of the transaction man tem exclusively lo cks the le during each IO op eration so

ager can b e incrementally mo died dierent transaction that no concurrency is achieved even though the DBMS is

semantics eg as outlined in BILI can b e imple already ensuring the integrity of the op eration In VINO

mented as needed by applications since the same lo ck manager handles b oth DBMS requests

and IO requests lo cks held by the DBMS are strong

enough to p erform IO and no additional lo cking is re

Memory Management

quired by the le system

The VINO memory management system is based on the

ideas of the Mach VM architecture although its imple

Log Management

mentation diers considerably

A MemoryResource is a collection of pages As in Mach

VINO provides a simple log management facility that is

it is backed by a le mapp ed into memory It includes

used by the le system and the transaction system and

op erations to read pages from and write pages to that

accessible to applications as well

backing store When a page fault takes place VINO de

A log resides on one or more physical devices It can b e

termines which MemoryResource if any is assigned to created on a single device or extended onto a second de

the virtual memory page containing the faulting address

vice not necessarily of the same type as the rst For ex

A request is made of the MemoryResource with the ad

ample a DBMS might request a log that spans b oth mag

dress at which to write the requested page

netic disk and archive media eg tap e or optical disk

Unlike a Mach pager the MemoryResource is entirely

The kernel requests a volatile inmemory log to supp ort

in the kernel when a page fault o ccurs which causes a transactions on ephemeral data such as pro cess structures

trap into the op erating system it is not necessary to and buer cache metadata The le system might request

go back across the protection b oundary from the kernel

a log that spans nonvolatile RAM NVRAM and disk

to the user level Also the Mach architecture requires

le system log records would b e written rst to NVRAM

that a new external pager b e written for each kind of

and later written to disk in large ecient transfers

b ehavior needed VINO allows each MemoryResource to The key interface to the log facility is the readwrite in

use as much or as little of the standard implementation terface which provides the essential information for write

as is appropriate the application need only override or

log function that returns a ahead logging ie a write

augment the op erations it wants to change

log sequence number and a read log function that re

The AddressMapResource is patterned after the Mach

turns records in log sequence number It also supp orts

ob ject of the same Each address space has an asso ciated

a synch wal op eration for writeahead log synchronization

AddressMapResource which contains a mapping b etween and a checkpoint op eration for log reclamation and archiv

physical pages and virtual memory pages When VINO

ing As logs can expand and contract the partitioning b e

determines that a mapping needs to b e removed either

tween log resources and other data resources is not static

b ecause of a virtual memory fault or b ecause the number

but can change as system demands uctuate

of physical pages assigned to the address space is b eing de

creased it invokes the Cho oseVictim op eration dened

Transaction Management

on AddressMapResource By default Cho oseVictim se

lects the least recently used page although an application The VINO kernel uses transactions to maintain consis

can replace the implementation of Cho oseVictim with the tency during up dates to multiple related resources eg

algorithm of its choice a directory and its contents For example the carefully

Note that as in Caos work CAO VINO retains ordered writes of FFS can b e reimplemented as a simpler

control over the number of mappings allo cated to an ad It is our goal that applications that do not take advan

dress space but not the mappings themselves The former tage of VINOs extensibility should run at roughly the

b ehavior is not under the control of an application in or same sp eed as on a xBSD system

der to ensure fairness of allo cation the management of

the mappings is delegated to each application

Conclusion

The VINO architecture is a simple and regular and meets

Related Work

our goals of application direction of kernel p olicy reusable

kernel to ols and a common interface to all resources

Many systems have addressed the need for exibility

It is not necessary to dene a new language in order

Mach ACET allowed the addition of external servers

to safely extend kernel b ehavior conventional languages

factoring the kernel into replaceable servers Chorus

can b e used when combined with a trusted compiler and

ROZI worked to overcome the p erformance problems

software protection techniques

of external servers by allowing them to b e developed out

We b elieve that by concentrating on the key ideas of

side the kernel and then moved into the kernel as a build

extensibility and reusability we will b e able to accomplish

option

our goals with a minimal level of distraction

Newer systems such as Aegis ENGL and SPIN

BERS address the granularity problems of the original

microkernel architecture by allowing small incremental

References

changes to b e made by loading user co de into the server

They have also addressed the issue of safety through the

ATT ATT System V Interface Denition Third

use of compilation techniques

Edition Volumes

Ob jectoriented to olkits are comp osed of a set of

reusable comp onents eg NextStep NEXT that can

ACET Acetta M Baron R Bolosky W Golub

b e combined and sp ecialized as needed

D Rashid R Tevanian A and Young M

Work has b een done to address p olicy control on a

Mach A New Kernel Foundation for UNIX Devel

topicbytopic basis Scheduler Activations ANDE are

opment Pro ceedings of the Summer Usenix Con

a metho d for sharing scheduling p olicy b etween kernel

ference July

and user Caos work on applicationcontrolled le caching

ANDE Anderson T Bershad B Lazowska E

CAO addresses buer cache management The Berke

Levy H Scheduler Activations Eective Kernel

ley Fast Filesystem MCKU allows le layout to b e

Supp ort for the UserLevel Management of Paral

controlled by the setting of the rotdelay maxcontig and

lelism Pro ceedings of the Thirteenth ACM Sym

maxbpg parameters

p osium on Op erating System Principles Monterey

System V Release provides for multiple classes

CA Octob er

of scheduling algorithms corresp onding to timesharing

scheduling realtime scheduling and kernel pro cess

BERS Bershad B Anderson T Lazowska E

scheduling It is not p ossible to add new p olicies with

Levy H Lightweight Remote Pro cedure Call

out completely reconguring and relinking the op erating

Proceedings of the Twelfth ACM Symposium on Op

system and then only if the desired scheduling algorithm

erating System Principles

ts SVRs limited mo del of how a scheduler should b e

have OLEA

BERS Bershad D Chambers C Eggers S Maeda

C McNamee D Pardyak P Savage S Gun

Sirer E SPIN An Extensible Microkernel for

Status

Applicationsp ecic Op erating System Services

Technical Rep ort Department of Com

We are targeting the x and HPPA architectures as our

puter Science and Engineering University of Wash

initial platforms The architectural outline is complete

ington Seattle

and we have b egun prototyping the resource types graft

ing technology and compilation to ols at the user level

BILI Biliris S Dar S Gehani N Jagadish H

The inner kernel b o ot co de and device supp ort is based

V and Ramamritham K ASSET A System for

on BSD

Supp orting Extended Transactions Proceedings of

As part of our development we plan to implement a

SIGMOD Minneap olis MN May

POSIX compatibility IEEE library on top of VINO

CAO Cao P Felten E and Li K Application We are also lo oking into supp orting BSD binaries directly

Controlled File Caching Policies Proceedings of the but have not committed to it

Winter Usenix Conference pp June OLEA OLeary K Woo d M Advanced System Ad

ministration UNIX Press Englewoo d Clis NJ

Chapter

CHANG Chang A Mergen M Rader R Rob erts

OUST Ousterhout J Why Arent Op erating Sys

J Porter S Evolution of storage facilities in AIX

tems Getting Faster as Fast as Hardware Pro

Version for RISC System pro cessors IBM

ceedings of the Summer Usenix Technical Con

Journal of Research and Development January

ference Anaheim CA June

PRES Presotto D Pike R Trickey H and

CHUT Chutani S Anderson O Kazar M Lev

Thompson K Plan A Distributed System

erett B Mason W Sideb otham R The Episo de

Proceedings of the Spring EurOpen Conference

File System Pro ceedings of the Winter

May

Usenix Conference San Francisco CA January

RABIN Rabin M Fingerprinting by Random Poly

nomials Harvard University Center for Research in

ENGL Engler D M F Kaasho ek and J OToole

Computing Technology TR

The Op erating System Kernel as a Secure Pro

ROZI Rozier M Abbrossimov V Armand F

grammable Machine Proceedings of the Sixth

Boule I Giend M Guillemont M Herrmann

SIGOPS European Workshop September

F Leonard P Langlois S Neuhauser W The

GRAY Gray J Lorie R Putzolu F and Traiger

Chorus Distributed Op erating System Computing

I Granularity of Lo cks and Degrees of Consistency

Systems v n

in a Large Shared Database in Modeling in Data

SELT Seltzer M Olson M LIBTP Portable

Base Management Systems Elsevier North Holland

Mo dular Transactions for UNIX Proceedings

New York pp

Winter Usenix Conference San Francisco CA pp

January

IEEE IEEE Portable Op erating System Interface

POSIX Part System Application Program

STON Stonebraker M Op erating System Supp ort

Interface API C Language IEEE Standard

for Database Management Communications of the

b September

ACM July

KAZAR Kazar M Leverett B Anderson O

VXFS Unix System Lab oratories The vxfs File System

Vasilis A Bottos B Chutani S Everhart C

Type from Advanced System Administration for

Mason A Tu S Zayas E DECorum File Sys

UNIX SVR

tem Architectural Overview Pro ceedings of the

WAHBE Wahbe R Lucco S Anderson T and

Sum mer Usenix Anaheim CA June

Graham S Ecient SoftwareBased Fault Isola

tion Proceedings of the th SOSP Asheville NC

LISK Liskov B Day M and Shrira M Dis

December

tributed Ob ject Management in Thor in Dis

tributed Object Management Morgan Kaufmann

San Mateo California

MCNAM McNamee D and Armstrong K Ex

tending the Mach External Pager Interface to Ac

commo date UserLevel Page Replacement Policies

Proceedings of the Usenix Mach Workshop

Burlington VT

MCKU McKusick M Joy W Leer S Fabry R

A Fast File System for UNIX Transactions on

Computer Systems v n pp August

NEXT NextStep Users Manual Next Computer

The Case for Geographical PushCaching

James Gwertzman Margo Seltzer

Harvard University

fgwertzma margogdasharvardedu

when the server decides where to cache a le it can make Abstract

this decision using its knowledge of network top ology and

Most existing widearea caching schemes are client initi

the les access history for even greater network band

ated Decisions on when and where to cache information

width savings

are made without the b enet of the servers global knowl

edge of the situation We b elieve that the server should

play a role in making these caching decisions and we pro

Motivation

p ose geographical pushcaching as a way of bringing the

server back into the lo op The World Wide Web is an

Preliminary analysis of Web access logs show that a few

excellent example of a widearea system that will b en

les on each server are resp onsible for most trac from

et from geographical pushcaching and we present an

that server Servers can therefore save a great deal of

architecture that allows a Web server to autonomously

bandwidth by only worrying ab out those few les This

replicate HTML pages

fact makes server caching feasible since replicating and

distributing les puts a slight load on the server Analy

sis also shows that the access pattern for each le is not

geographically uniform File requests are often clustered

Introduction

geographically which implies that judicious selection of

cache sites can provide excellent bandwidth savings

The WorldWide Web op erates for the most part as

a cacheless distributed system When two neighboring

clients retrieve a do cument from the same server the do c

ument is sent twice This is inecient esp ecially consid

Architecture

ering the ease with which Web browsers allow users to

There are two comp onents to our prop osed Web system transfer large multimedia do cuments

a mo died HTTP server and a replication service The To combat this problem some Web browsers have b e

mo died server is resp onsible for tracking geographical gun to add lo cal client caches These prevent the same

access information for its les and for accepting and of client from transferring the same do cument twice Some

fering cached replicas of other les This can b e done by networks are also b eginning to add Web proxies that

mo difying a proxy server such as the CERN proxy server prevent two clients on the same campus network from

to accept les for replication using a mo died POST transferring the same do cument twice

request The problem with b oth these schemes is that they are

The replication service keeps track of mo died HTTP myopic A client cache do es not help a neighboring com

servers that are willing to serve replicated les the puter and a campus proxy do es not help a neighboring

amount of available free space on each server and each campus Furthermore these caches are usually limited in

servers average load The replication service works with size Disk might b e cheap but as the size of multimedia

the mo died server to decide where a given le should b e les increases it will b e imp ossible to cache everything As

cached a result these caches will only b e able to store the most

We must minimize the amount of state that each Web p opular items even though there is still some demand for

server stores for its les or else we will face scalability other less p opular items

problems We therefore track geographical access infor The solution is for clients to share each others cache

mation in a coarse manner We are currently using states space The degree to which a le is replicated should b e

and countries since these can easily b e obtained from net prop ortional to that les global p opularity and clients

work addresses should retrieve les from the the nearest cache to mini

When the demand for a le exceeds a replication thresh mize network trac The server can satisfy b oth goals

old the server replicates it We are using tracedriven by deciding when and where to cache les Furthermore

Figure Before le replication takes place several clients accessing a World Wide Web le on the east coast

simulation to determine reasonable values for the replica server The primary servers load will drop as clients b egin

tion threshold and we exp ect it to b e dynamic to access the le from the new server Should the primary

The replication service decides where to replicate the servers load climb high enough that it must replicate the

le given its access history The goal is to pick a server to le again the primary server will choose a dierent server

cache the le that will minimize the amount of bandwidth to cache it on since the access patterns will have changed

used in the future We predict this by using the les Likewise if the new servers load climbs high enough such

history that it must replicate the le it will b e cached in yet

Figure and gure illustrate the replication pro cess another place b ecause the access patterns for the new

in action Several clients from across the United States server will b e very dierent than those for the old server

are accessing a le on an east coast server The east coast There are two issues that must still b e addressed for

server replicates the le such that network bandwidth is this scheme le consistency and resource discovery A

minimized and the le ends up on a west coast server server may determine that its copy of a le is out of date

The replication service maintains a list of all HTTP by using the getifmodiedsince HTTP request This is

servers willing to replicate les it must choose one of them an ecient way to b oth check consistency and to request

to cache the le We are considering several algorithms to the new le in the event it has b een mo died but it is

determine the optimal cache lo cation If the service knows to o exp ensive to use every time a le is requested

the Internets top ology it can solve the problem by nding Since weakconsistency should b e acceptable for the

a go o d solution to the corresp onding graph partitioning Web we are using a scheme developed for the Alex

problem or maxcut minow problem le system With the exception of dynamic pages these

If only coarse grained information is available ab out the will b e addressed separately we exp ect the Web to ob ey

Internet such as average latency b etween servers avail the same principle as FTP the older a le is the less

able from traceroute a b etter solution is to iterate over likely it is to b e mo died Therefore the older the le

a representative sample of the available servers calculat that an HTTP server is caching the less frequently the

ing bandwidth savings for each The service would then HTTP server must p oll to check if its copy is still upto

replicate the le on the server that would have reduced date This is very ecient compared to checking for every

network bandwidth the most request and the client will b e able to force a p oll if it is

Once the primary server gives a le to another server essential to use the latest le

for caching the primary server forgets ab out the other As for resource lo cation there are several groups work

Figure After le replication has taken place the le has b een replicated onto a west coast server so as to minimize

network bandwidth

cache dynamic pages such as those generated by cgibin ing on this problem Until this problem is solved we are

scripts that are used to create Web pages on the y using a technique prop osed by Blaze which we call

the technique Clients call the primary server

to ask for the server nearest you This is not elegant

but it works b ecause latency is currently more critical

than bandwidth The exp ense of querying a distant server

Related Work

once is amortized over the many lo cal requests that are

thereby made p ossible Eventually there will need to b e

There is little work on caching in largescale distributed

a cleaner solution so that all dep endence on the original

systems outside of distributed le systems since only in

server can b e removed Otherwise we will face scalability

the past few years has the attention of the distributed

and reliability problems

systems community turned toward globally distributed

systems such as the WorldWide Web and FTP Several

groups are working on similar problems but none that we

know of are working on serverinitiated caching The Har

Other Applications

vest system in particular incorp orates an object caching

subsystem that provides a hierarchically organized means Throughout this pap er we have referred to HTTP servers

for eciently retrieving Internet ob jects such as FTP and and les This was for the purp oses of clarity as well

HTML les as to provide a fo cus for our research We exp ect our re

Blaze has addressed caching in a largescale system sults to b e applicable to any widearea distributed system

His research fo cused on distributed le systems but can however not just the World Wide Web One application

b e applied to FTP or the Web Finally the Alex system for geographical pushcaching that we have in mind is to

was designed to provide a means of caching FTP les replicate not only data les but also services themselves

Of these three systems Blazes design comes closest to our A go o d example would b e Archie whose load prob

own since it supp orts replication when demand b ecomes lems are notorious If Archie were to b e written in a

to o high and b ecause it lets clients use any nearby cache machineindependent networkservice scripting language

It do es not however provide the server with control over eg Tcl its co de could b e replicated and cached just

where replicas are placed like a Web le This might also b e the answer to how to

John K Ousterhout Tcl An embeddable com Conclusion

mand language In USENIX Conference Proceed

We do not b elieve that geographical pushcaching should

ings pages Washington DC January

replace clientinitiated caching These two techniques ad

USENIX

dress the same problem on two dierent timescales and

Nicolas Pio ch Le weblouvre WorldWide Web http

therefore are complimentary Clientinitiated caching re

mistralenstfr pio chlouvrelouvreshtml

sp onds quickly to lo cal changes in a les p opularity but

can not alleviate a global rise in demand Likewise server

initiated caching can not cop e very well with sudden lo

calized jumps in p opularity but is b est suited to handling

longterm le request trends

We close with this reminder of why serverinitiated

caching is necessary taken from the Web home page of

the WebLouvre

Note Starting end of Octob er we are cur

rently exp eriencing severe network problems on

our Kb school Internet connection Please

b e understanding I am still lo oking for a site

willing to mirror the WebLouvre exhibit Mb

in all preferably in the USA

References

T BernersLee R Cailliau JF Gro and B Poller

mann Worldwide web The information uni

verse Electronic Networking Research Applications

and Policy

Matthew A Blaze Caching in largescale distributed

le systems Technical Rep ort TR Princeton

University January

C Mic Bowman Peter B Danzig Darren R Hardy

Udi Manber and Mich ael F Schwartz Harvest A

scalable customizable discovery and access system

Technical Rep ort CUCS University of Col

orado Boulder

Vincent Cate Alex A global lesystem In USENIX

File Systems Workshop Proceedings pages Ann

Arb or MI May USENIX

Alan Emtage and Peter Deutsch Archie an electronic

directory service for the internet In Proceedings of the

USENIX Winter Conference USENIX January

Ari Luotonen and Kevin Altis Worldwide web

proxies In Computer Networks and ISDN systems

First International Conference on the WorldWide

Web Elsevier Science BV available from

httpwwwcernch PapersWWW luotonenps

Mosaicxncsauiucedu Using

proxy gateways WorldWide Web available from

httpwwwncsauiucedu SDGSoftware Mosaic

Do cs proxygatewayshtml

Structuring the Kernel as a Collection of Reusable Comp onents

Christopher Small

Harvard University

chrisdasharvardedu

Abstract Related Work

Conventional op erating systems provide highlevel black

Others have examined the need for customizing op erating

b ox services to application programs If the services do

systems Kiczales et al argue that the blackbox mo del

not t the needs of an application the application is out of

for op erating systems hides not only crucial implemen

luck Applications whose needs might b e met by the op er

tation details but crucial p olicy issues as well KICZ

ating system need to reimplement facilities from scratch

these p olicy decisions are not appropriate for all applica

Instead of providing blackbox services an op erating

tions

system should b e decomp osed into a set of reusable incre

Anderson argues that the co de in the op erating systems

mental ly extensible comp onents These comp onents are

should b e stripp ed to a minimum ANDE The func

used by the kernel to implement its services and can b e

tionality of the op erating system would b e moved into

reused by applications If only part of a service is needed

the application the kernel would only b e resp onsible for

by an application eg the buer cache from the lesys

arbitrating resource requests Applications would have

tem it is available as a separate mo dule This approach

control over p olicy decisions b ecause the decisions would

is taken by the VINO kernel under development at Har

b e made in application co de services such as le systems

vard University

would b e developed as application libraries and could

b e reused or circumvented by applications The Aegis

exokernel ENGL follows this approach In contrast

VINO leaves services in the kernel but allows applications

to reuse them inplace

Introduction

The SPIN system BERS is an extensible microker

nel that allows applications to add co de to the kernel on

Conventional op erating systems provide services such as

they as spind les The spindle mechanism allows services

le storage name management and caching As a side

to b e constructed that are tailored for a particular appli

eect of implementing these services the kernel typically

cation SPIN fo cuses on adaptability rather than reuse

implements but do es not exp ort fast synchronization

although application co de can b e placed in the kernel for

simple transaction management and logging The in

a p erformance gain the SPIN architecture is essentially

terface provided by the kernel allows applications to use

that of a conventional microkernel

the exp orted services asis if the service is not appropri

ate the application must implement a replacement from

scratch

At the application level ob jectoriented development

metho dologies are all the rage Toolkits for developing

Decomp osition

applications are available for user interface development

database access do cument management and general data

An op erating system is comp osed of a set of services structure manipulation NEXT These to olkits are

Some are normally exp orted others hidden For example comp osed of reusable extensible and co op erative mo d

the Fast Filesystem MCKU is comp osed of physical ules

storage a buer cache a name service metadata synchro The op erating system should construct its services as

nization and management and a recovery to ol In VINO such a to olkit In order for a system structured this way

where a larger service can b e decomp osed into p otentially to b e useful it needs to b e appropriately decomp osed and

useful subservices it is for example VINO oers each of easily extended This pap er describ es the service decom

the comp onents of its lesystem as a separate service to p osition and extension mechanism found in the VINO ker

application programs nel under development at Harvard University

File Manager Single level lo cks are not always sucient a lo ck man

ager needs to provide generalpurpose hierarchical locking

The le manager provides the standard Posixstyle le

GRAY For example the le system typically allows

op erations to its clients read write seek app end and

lo cking at the blo ck le directory and le system levels

truncate A le system normally builds this abstraction

a relational database management system lo ck hierarchy

on top of a physical disk using the disk VINO interposes

consists of eld tuple relation and database When a

a volume manager b etween the le manager and the raw

lo ck is requested the lo ck manager must verify that no

device The volume manager arbitrates requests for disk

conicting lo ck is held on any other element in the hier

space to subsystems that require p ersistent storage The

archy

use of the volume manager allows the disk to b e dynami

In most kernels the le system lo cking hierarchy is im

cally partitioned b etween its clients

plicit buried in the co de a general lo ck manager must b e

Alternatively a lesystem can b e created and directed

able to work with any usersp ecied hierarchy VINO ac

to use a dierent manager for its storage eg a vir

complishes this by allowing an application to dene a con

tual memory based volume for building a memorybased

tainment hierarchy for the resources b eing lo cked When

lesystem or a dierent le manager delegating p ersis

a lo ck request is made the lo ck manager examines the

tent storage to another lesystem The latter technique

currently allo cated lo cks and the clientspecied contain

can b e used to build a layered lesystem a compressed

ment hierarchy to determine if the new request can b e

lesystem would override the default read and write op

granted

erations and store the compressed data on a le system

Clients need to b e able to sp ecify how the lo ck manager

that writes its data to a disk

b ehaves in the face of lo ck contention eg deadlo ck de

tection and resolution blo cking or nonblo cking requests

Cache Manager

and lo ck types read vs write lo cks and lo ck compati

bility multiple concurrent readers vs single writer

There are several caches in the typical system The buer

Note that by integrating the kernel and user level lo ck

cache is a cache of recentlyused disk blo cks physical

ing systems concurrency can b e increased For example a

memory holds recently used virtual memory pages A

DBMS running on a conventional UNIX le system may

cache consists of a function that maps references to cache

implement its own lo ck manager to synchronize database

entries a backing store interface and a replacement p ol

access and issue multiple IO requests to the same le

icy By allowing clients to dene the implementation of

Unfortunately the UNIX le system exclusively lo cks the

each of these interfaces the standard cache manager can

entire le during each IO op eration no concurrency is

b e used by VINO for b oth for the buer cache and the

achieved even though the DBMS is already ensuring the

VM cache or by applications managing their own data

integrity of the op eration With shared lo ck management

caches For example a database management system can

b ecause the same lo ck manager handles b oth DBMS re

manage its client cache by replacing the backing store in

quests and IO requests lo cks held by the DBMS are suf

terface with remote requests to the database server

cient to p erform IO no additional lo cking is required

by the le system

Name Manager

A lesystem normally includes a subsystem that maps a

Log Manager and Recovery Manager

name to le reference eg an ino de number NFS le

Several new le systems use databasestyle logging for

handle or vno de p ointer The same co de can b e used

improved p erformance and fast recovery CHANG

for other applications that use a hierarchical namespace

CHUT KAZAR VXFS but this facility is not ex

such as a database naming subsystem the Domain Name

p orted to applications Obviously database management

Service or Xs resource database

systems use logging but many other applications need re

covery systems as well For example FrameMaker vi

Lo cking

news readers such as rn and email frontend to ols all at

tempt to retain and recover their state in the face of run

The op erating systems synchronization primitives are

time failure Recovery co de is notoriously complex and is

typically much simpler and more ecient than those pro

often the subsystem resp onsible for the largest number of

vided to userlevel applications For example obtaining

system failures SULL Supp orting multiple recovery

a semaphore in System V ATT incurs the cost of a

systems can only reduce total system robustness

system call This overhead is not usually necessary if a

In VINO a log resides on one or more physical devices

lo ck is not contested a userlevel testandset instruction

can b e used to obtain the lo ck cheaply If it is already

UNIX is a trademark of XOp en

held a system call would then b e made to enqueue the

FrameMaker is a registered trademark of Frame Technology

Corp oration lo ck request SELT

It can b e created on a single device or extended onto a b e overridden by an application by grafting a new im

second device not necessarily of the same type as the plementation into the kernel The grafting pro cess uses

rst A DBMS would request a log that spans b oth sandboxing WAHBE or a similar software fault isola

magnetic disk and archive media eg tap e or optical tion technique to ensure that user co de do es not com

disk The kernel would request a volatile inmemory promise the safety of the kernel Co de is written in a

log to supp ort transactions on ephemeral data such as conventional programming language unlike other exten

pro cess structures and buer cache metadata The le sible systems eg SPIN BERS Aegis ENGL and

system would request a log that spans nonvolatile RAM Thor LISK we have not undertaken the task of den

and disk le system log records would b e written rst ing a new typesafe language It is outside the scop e of our

to nonvolatile RAM and later written to disk in large pro ject to sp ecify implementand supp ort a new language

ecient transfers and widespread acceptance of new languages in the com

The key interface to the log facility is the readwrite munity irresp ective of their elegance and p ower is very

interface that supp orts writeahead logging a writelog low

function that returns a unique identier a log sequence Even with these assurances userinstalled co de may not

number and a readlog function that returns records in terminate in a timely fashion The VINO kernel is multi

log sequence order It also supp orts a synchWAL op era threaded and grafted co de that runs to o long times out

tion to synchronize the log with the data b eing logged and The grafted co de may b e illb ehaved and never return to

a checkpoint op eration for log reclamation and archiving the application but only the application itself suers no

other pro cess is prevented from making progress

We must also guard against grafted co de obtaining a

Transaction Manager

critical system lo ck and not releasing it in a reasonable

amount of time To handle this we attach a timeout

The kernel uses transactions to maintain consistency dur

to critical lo cks and kill a pro cess that do es not release

ing up dates to multiple related resources eg a directory

the lo ck b efore the timeout Each piece of grafted co de

and its contents For example when the Fast Filesystem

runs in the context of a lightweight transaction that keeps

up dates metadata it carefully orders disk writes to en

track of its allo cated resources If the pro cess terminates

sure the recoverability of the lesystem In the context of

the corresp onding transaction is ab orted and the system

a transaction the order of these writes would b e unimp or

is returned to a consistent state

tant the transaction would commit or ab ort atomically

Unlike the external servers of Mach ACET grafting

The VINO transaction manager supp orts the standard

allows small incremental changes in kernel functionality

transactionbegin transactioncommit and transaction

If the page eviction strategy of the system is inappropri

abort op erations It accepts references to appropriate log

ate it can b e replaced without writing a new external

and lo ck manager instances to use for each transaction

pager MCNAM

At transaction b egin a new transaction resource is cre

ated This resource references the appropriate log and

lo ck managers and is referenced by each protected up

Conclusions

date Most kernel transactions are protected using a sim

ple shadowresource scheme with a log residing in main

Conventional op erating systems provide services as black

memory either volatile or nonvolatile dep ending on the

b oxes where the services do not t the needs of an ap

resources b eing protected The mixing and matching of

plication the application is out of luck Instead of oer

logging and lo cking comp onents enables VINO to supp ort

ing monolithic services the kernel should b e structured

arbitrarily complex transaction proto cols

to provide a collection of smaller reusable incrementally

The transaction manager includes facilities for con

extensible to ols for application reuse

structing extended transactions BILI allowing appli

cations to take advantage of alternative mo dels such as

nested and splitjoin transactions

References

ACET Acetta M Baron R Bolosky W Golub

Extensibility and Reuse

D Rashid R Tevanian A and Young M

Exp orting a service to applications is only half the battle Mach A New Kernel Foundation for UNIX Devel

we also need a mechanism for allowing the service to b e opment Pro ceedings of the Summer Usenix Con

sp ecialized or extended

ference July

VINO implements each of the managers describ ed

ab ove as a resource type A resource type consists of a ANDE Anderson T The Case for Application

group of op erations and prop erties The op erations can Sp ecic Op erating Systems Pro ceedings of the

Third Workshop on Workstation Op erating Sys MCKU McKusick M Joy W Leer S Fabry R

tems A Fast File System for UNIX Transactions on

Computer Systems v n pp August

ATT ATT System V Interface Denition Third

Edition Volumes

MCNAM McNamee D and Armstrong K Ex

BERS Bershad D Chambers C Eggers S Maeda

tending the Mach External Pager Interface to Ac

C McNamee D Pardyak P Savage S Gun

commo date UserLevel Page Replacement Policies

Sirer E SPIN An Extensible Microkernel for

Proceedings of the Usenix Mach Workshop

Applicationsp ecic Op erating System Services

Burlington VT

Technical Rep ort Department of Com

puter Science and Engineering University of Wash

NEXT NextStep Users Manual Next Computer

ington Seattle

BILI Biliris S Dar S Gehani N Jagadish H

SELT Seltzer M Olson M LIBTP Portable

V and Ramamritham K ASSET A System for

Mo dular Transactions for UNIX Proceedings

Supp orting Extended Transactions Proceedings of

Winter Usenix Conference San Francisco CA pp

SIGMOD Minneap olis MN May

January

CHANG Chang A Mergen M Rader R Rob erts

SULL Sullivan M and R Chillarege Software De

J Porter S Evolution of storage facilities in AIX

fects and Their Impact on System Availability A

Version for RISC System pro cessors IBM

Study of Field Failures in Op erating Systems Di

Journal of Research and Development January

gest st International Symposium on Fault Tolerant

Computing June

CHUT Chutani S Anderson O Kazar M Lev

VXFS Unix System Lab oratories The vxfs File System

erett B Mason W Sideb otham R The Episo de

Type from Advanced System Administration for

File System Pro ceedings of the Winter

UNIX SVR

Usenix Conference San Francisco CA January

WAHBE Wahbe R Lucco S Anderson T and

Graham S Ecient SoftwareBased Fault Isola

ENGL Engler D M F Kaasho ek and J OToole

tion Proceedings of the th SOSP Asheville NC

The Op erating System Kernel as a Secure Pro

December

grammable Machine Proceedings of the Sixth

SIGOPS European Workshop September

GRAY Gray J Lorie R Putzolu F and Traiger

I Granularity of Lo cks and Degrees of Consistency

in a Large Shared Database in Modeling in Data

Base Management Systems Elsevier North Holland

New York pp

KAZAR Kazar M Leverett B Anderson O

Vasilis A Bottos B Chutani S Everhart C

Mason A Tu S Zayas E DECorum File Sys

tem Architectural Overview Pro ceedings of the

Sum mer Usenix Anaheim CA June

KICZ Kiczales G Lamping J Maeda C Kepp el

D McNamee D The Need for Customizable Op

erating Systems Pro ceedings of the Fourth Work

shop on Workstation Op erating Systems Napa CA

August

LISK Liskov B Day M and Shrira M Dis

tributed Ob ject Management in Thor in Dis

tributed Object Management Morgan Kaufmann

San Mateo California

Lies Damned Lies and File System Benchmarks

Diane Tang Margo Seltzer

Harvard University Division of Applied Sciences

fdtang margogdasharvardedu

Current Benchmarks are a Dis Abstract

grace

File system design implementation and p erformance is a

hot topic in op erating systems research but nearly all the

Benchmarks commonly used to measure le system p er

research in the area revolves around p erformance numbers

formance to day suer from several problems lack of scal

derived from inadequate b enchmarks File system b ench

ability use of a single number as a nal result measure

marks suer from lack of scalability sensitivity to op erat

ment of IO p erformance rather than le system p erfor

ing system b ehavior other than the le system and fun

mance and myopia an emphasis on incidental implemen

damental misconceptions ab out what is b eing measured

tation eects that rarely determine typical user p erfor

If le system research is to move forward the commu

mance

nity needs robust scalable and informative le system

We have examined most of the b enchmarks used in re

b enchmarks This pap er presents some of the aws in

cent research pap ers and will discuss Bonnie the Andrew

to days le system b enchmarks prop oses a set of guide

Benchmark IOStone and LADDIS formerly known as

lines for the development of go o d le system b enchmarks

NFSSTONE and NHFSSTONE We use the weaknesses

and discusses approaches to the creation of a compliant

of these b enchmarks to determine criteria for le system

b enchmark

b enchmarks

Bonnie consists of six microb enchmarks designed to

measure b ottlenecks in the le system Bonnie mea

sures the disk readwrite throughput and random seek

time The main shortcoming of Bonnie is that it is not

really a le system b enchmark but rather a disk b ench

Introduction

mark and IO p erformance cannot b e equated with le

system p erformance For example Bonnie gives no indi

Assuming that the number of publications in an area is

cation as to how fast a le system can p erform a path

an indication of research interest le systems and dis

name lo okup it only tells the user how fast the system

tributed shared memory are among the hottest topics

can transfer data Because of these limitations Bonnie

in op erating systems research The SOSP confer

do es not indicate how well a real application will p erform

ence b oasted seven le systems pap ers out of eighteen

on the system Bonnie yields the directive Thou shalt

SOSP b oasted three of twentyone the Summer

measure the le system if thou art reporting le system

Usenix twelve out of twentyseven and the Usenix

performance

ten out of twentyseven

The Andrew Benchmark originally developed at CMU

These pap ers fo cus primarily on three issues distribu

to compare AFS to other le systems uses existing Unix

tion improved p erformance and scalability In order to

utilities to create a directory hierarchy copy les to that

argue any of these p oints researchers must demonstrate

hierarchy examine the les and then compile them

that the le system under investigation functions cor

At the time Andrew was developed it might have stressed

rectly provides adequate or exceptional p erformance

many le systems However Andrew has not scaled with

and satises the novel claims made Performance in par

time Andrew uses a xedsize data set which is to o

ticular is a key challenge for le systems as pro cessor

small On most systems to day the entire data set will

sp eeds climb exp onentially while IO sp eed grows at a

t in the buer cache which means that after the initial

linear rate Unfortunately the technology for describ

create and copy all data requests can b e satised from the

ing le system p erformance is woefully inadequate

cache Furthermore Andrews running time is dominated

This pap er critiques the most oftcited le system

by the compile phase which means that Andrew is almost

b enchmarks and prop oses a set of criteria for the estab

lishment of successful le system b enchmarks Unix is a trademark of XOp en

entirely user CPU b ound rather than either IO b ound The Call for a New File System

or system CPU b ound As a result it is unclear what An

Benchmark Metric

drew measures to day Andrew yields the directive Thou

shalt make le system benchmarks scalable

In order to design a go o d le system b enchmark we need

IOStone developed in at UCDavis is designed to

to dene a measure of go o dness Chen stated several

measure le system p erformance on a workload based on

go o dness criteria for IO b enchmarks Namely an IO

Unix le system traces and IBM mainframe traces

b enchmark should b e

IOStone has three phases create a le system

hierarchy read and write the hierarchy and delete the

Prescriptive it should p oint system designers to p os

hierarchy Only the readwrite phase is measured to

sible areas of improvement

pro duce one nal result in IOStones p er second IOStone

IO b ound

has many shortcomings Its le system hierarchy mo del

is awed IOstone claims to emulate the workload on a

Scalable with advancing technology

typical UNIX workstation by creating a mo del at le

system hierarchy but real le system hierarchies are rarely

Comparable b etween dierent systems

at Furthermore in an attempt to remove cache eects

General applicable to a wide variety of workloads

it reads large spacer les b efore the readwrite phase of

the b enchmark Unfortunately the spacer les are xed

Tightly sp ecied no lo opholes and clarity in what

size MB indep endent of the cache size and therefore

needs to b e rep orted

inadequate to ush large caches The data set is also small

enough to t in almost any buer cache which means

These measures are exactly those we derive for le sys

that like Andrew IOStone is not particularly IO b ound

tem b enchmarks In fact with the exception of IO

The nal shortcoming of IOStone is that it only pro duces

b oundedness these criteria should probably b e applied

a single result in IOStones p er second This result can

to most b enchmarking metho dologies

provide comparative p erformance information but it do es

In the case of le system b enchmarking we want to

not help the user determine what asp ects of the system

understand the b ehavior of each comp onent eg disks

need to b e improved or how to improve them IOstone

caches le system co de If we can isolate the p erformance

yields the directive Thou shalt make benchmark results

of each comp onent of the le system then we can identify

descriptive

areas for improvement Furthermore if we can character

The last b enchmark we examine here is LADDIS which

ize an application in terms of these comp onents we can

is still under development LADDIS is based on NHFS

determine how well a particular le system will satisfy a

STONE which is based on NFSSTONE and is designed

particular applications needs For example supp ose the

to measure the p erformance of NFS servers

task at hand is to optimize p erformance of a database

LADDIS has the p otential to b e a wonderful b enchmark

system that stores every record in a separate le This

for NFS servers It is scalable in the number of clients and

database will undoubtedly p erform p o orly on le systems

in the load p er client its results must b e presented graph

with p o or lo okup p erformance

ically showing how the p erformance of a server varies

It is our hop e that the metho dology that enables us to

with load and it measures the p erformance of the server

characterize a le system by its comp onent p erformance

However it is limited to NFS and it do es not give a clear

can b e extended to characterize an op erating system by

indication of how to improve system p erformance since

its comp onent p erformance as well Such a b enchmarking

the only parameter it varies is load on the client LAD

system would allow us to accurately compare subsystems

DIS yields the directive Thou shalt make benchmarks

residing in dierent op erating systems since we can accu

prescriptive

rately attribute dierences to the correct comp onents

What we can see is that many of the existing b ench

marks have severe limitations they do not scale and they

do not measure the le system As a result they are not

A Better Benchmark

particularly useful and they do not assist researchers in

The selfscaling IO b enchmark developed by Chen ful

understanding le system p erformance

lls the goals stated in the previous section This b ench

mark has ve parameters the size of the overall data set

the number of pro cesses running concurrently the average

size of an IO request to the nearest blo ck p ercentage

of op erations that are reads and p ercentage of op erations

that are sequential as opp osed to random The b ench

mark has two phases It rst nds the fo cal vector which

is the set of ve values one for each parameter that are Using the combination of these two dierent parts of the

as far as p ossible from any drastic p erformance changes in b enchmark we fulll all the metrics stated in the previous

throughput as a function of that parameter Intuitively section The adaptation of Chens ideas gives us a pre

the fo cal p oints are representative of typical values ap scriptive b enchmark that measures sp ecic asp ects of the

plicable over a wide range of workloads Once the fo cal le system while the tracebased b enchmark yields cross

vector has b een identied the b enchmark generates ve platform comparisons and applicability to a wide variety

graphs plotting throughput as a function of each param of workloads The tracebased approach is insucient by

eter with the remaining parameters at their fo cal p oint itself due to the diculty in separating out the eects of

value Using these graphs Chen introduces the idea of the separate comp onents in the le system and thus is

predictive p erformance He claims that since the fo cal not prescriptive Both metho ds can b e made scalable and

vector is generally applicable the shap e of a graph should tightly sp ecied

b e applicable even when the parameters are not at their

fo cal values If a workload can b e characterized in terms

Conclusion

of these ve parameters then the workload p erformance

is predicted by scaling b etween the ve graphs

Existing le system b enchmarks are inherently awed in

This b enchmark is scalable tightly sp ecied repro

that they are not very go o d at measuring the p erformance

ducible descriptive and prescriptive for the IO sys

of a le system We need a new b enchmark that not only

tem which do es overlap with the le system Its only

gives accurate p erformance numbers for a le system but

shortcoming for our purp oses is that it is IO system sp e

is also helpful scalable and able to b e widely used

cic rather than le system sp ecic It do es not provide

feedback on any le system comp onent other than disk

p erformance and buer cache size or what parts of the

References

le system need to b e improved

We are searching for a le system b enchmark that ful

T Bray Bonnie source co de NetNews p osting

lls the metrics stated ab ove We are considering two

p ossible approaches One approach is to extend Chens

ideas to le systems ie to nd parameters that are in

P M Chen D A Patterson A New Approach

dicative of le system p erformance rather than IO sys

to IO Benchmarks Adaptive Evaluation Pre

tem p erformance The main diculty with this approach

dicted Performance UCBComputer Science Dept

is in nding a set of parameters that are as plausibly in

University of California at Berkeley March

dep endent from one another as those Chen uses to mo del

the IO subsystem Without parameter indep endence we

lose the ability to predict the p erformance for a sp ecic

J H Howard M L Kazar S G Menees D A

workload and therefore the ability to make crossplatform

Nichols M Satyanarayanan R N Sideb otham M

comparisons as well

J West Scale and Performance in a Distributed

A second approach is to augment Chens idea with

File System ACM Transactions on Computer Sys

a tracebased b enchmark If we can gather le system

tems February

traces for sp ecic applications and successfully parame

terize them in terms of le system op erations we can

I Hu Measuring File Access Patterns in UNIX

construct a le system b enchmark that meets the crite

Performance Evaluation Review

ria prop osed ab ove For example assume that we can

ACM SIGMETRICS

gather le system traces of a target application set eg

a compiler word pro cessor and database system We

M K Molloy Anatomy of the NHFSSTONES

pro cess the traces and derive a parameterized workload

Benchmark Performance Evaluation Review

expressed in terms of the mix of le system op erations

their dep endence up on one another and their interarrival

times This parameterized workload is then used to drive

B Nelson B Lyon M Wittle B Keith LADDIS

the b enchmark The traces are more informative than

A MultiVendor and VendorNeutral NFS Bench

simply running the applications b ecause they provide the

mark UniForum Conference January

ability to obtain timing information for sp ecic calls This

J K Ousterhout J DaCosta et al A Trace

b enchmark can b e scaled by increasing the number of pro

Driven Analysis of the UNIX BSD File System

cesses but can also b e scaled to dierent pro cessors and

Operating Systems Review December

disks by altering the average interarrival time appropri

Pro ceedings of the th Symp osium on Op

ately

erating Systems Principles

A Park J C Becker IOStone A Synthetic File

System Benchmark Computer Architecture News

June

D A Patterson G Gibson R H Katz A Case for

Redundant Arrays of Inexp ensive Disks RAID

International Conference on Management of Data

SIGMOD June

Shein M Callahan P Woo dbuy NFSStone A

Network File Server Performance Benchmark Pro

ceedings of the USENIX Summer Technical Confer

ence

A J Smith Sequentiality and Prefetching in

Database Systems ACM Transactions on Database

Systems

A J Smith Analysis of Long Term File Reference

Patterns for Application to File Migration Algo

rithms IEEE Transactions on Software Engineer

ing SE No

The Case for InKernel Tracing

Yasuhiro Endo

Christopher Small

Harvard University

fyazchrisgdasharvardedu

Abstract mers can quickly and easily evaluate the suitability of dif

ferent p olicies using the tracing and simulation to ols that

Op erating systems are often criticized for providing xed

VINO provides

lowestcommondenominator p olicies that are inappropri

ate for many classes of applications VINO a new

Traces

op erating system under development at Harvard Univer

sity is like other extensible op erating systems in that

The kernel is a set of mo dules with input p orts and output

it allows application programs to direct inkernel p olicies

p orts connected together Mo dules such as the le system

through a mechanism called grafting However to take

buer cache and device drivers have their p orts connected

advantage of this service and select or implement eective

together and requests are transp orted through these con

kernel p olicies application developers must p erform the

nections This architecture is similar to the use of streams

nontrivial task of evaluating each new p olicy This typi

in UNIX In UNIX streams facilitate the adaption of

cally requires accurate simulation andor tracing There

network mo dules while in VINO our architecture sup

has b een little research on how to aid programmers in

p orts the adaptation of nearly any mo dule in the system

this task Many of the prop osed solutions rely on sp e

For example the input p orts of the physically ad

cially instrumented op erating system kernels andor sim

dressed buer cache mo dule are connected to the output

ulation mo dules constructed to analyze a single decision

p ort of the le system mo dules and the output p orts are

In this pap er we prop ose an inkernel tracing and simu

connected to the disk driver mo dules The buer cache

lation mechanism designed to simplify the evaluation of

mo dule accepts requests for a blo ck on a particular disk

application sp ecied p olicies

and either satises the request internally using the cached

disk blo cks or generates requests to the appropriate disk

driver mo dules

Introduction

VINO provides users with simple metho ds to alter the

ow of requests b etween mo dules With this facility users

Extensible op erating systems oer an attractive alterna

can easily generate a log of requests made by a particu

tive to applications whose functionality or p erformance

lar mo dule by redirecting the ow of requests to b oth its

are thwarted by the rigid kernel p olicies of conventional

original destination and to a le These logs can then

op erating systems However extensible systems also bur

b e used to repro duce a stream of requests that can b e fed

den the application with the need for increased decision

into real or simulated kernel mo dules In the remainder of

making It is p ossible for applications to exhibit worse b e

this pap er we refer to the captured input request stream

havior if application sp ecied p olicies are not carefully se

as a trace and the captured output request stream as a

lected or implemented VINO simplies the analysis pro

log

cess by providing inkernel tracing and simulation to ols

This tracing facility allows application programmers a

It do es so by exploiting the VINO claim that the op erat

quick and simple way to evaluate p olicies For exam

ing system kernel is a collection of reusable to ols

ple we can evaluate dierent buer cache management

We structure the kernel to allow users to capture or in

p olicies by replaying the same trace through the buer

tro duce request streams b etween two kernel mo dules and

cache mo dule augmented with dierent p olicy algorithms

reuse existing kernel co de to run simulations Simulation

and counting the number of requests in the buer cache

mo dules are instances of regular kernel mo dules except

log ie counting the number of requests in the output

that they do not aect any state seen by the rest of the

stream Because the tracing facility allows the replay of

kernel For example the buer cache simulation mo dule

traces we can recreate identical workloads for dierent

shares much of the co de with the real buer cache mo dule

UNIX is a trademark of XOp en including the grafted p olicy co de Application program

test runs the timing problems asso ciated with the second and third

metho ds

The second metho d is useful when the target applica

Simulation

tions b ehaviors are dicult to repro duce Many inter

active programs fall into this category For this class of

Some mo dules inside the VINO kernel can b e instantiated

applications traces collected from the input p ort of the

as simulation mo dules freeing programmers from the tasks

buer cache mo dule can b e used to drive the exp eriment

of building separate simulators Simulation mo dules are

The user must rst run the application interactively to

identical to real mo dules except that they do not mo dify

generate the trace then while the programmer logs the re

the global state Therefore simulations can run without

quests sent from the output p ort the trace can b e played

aecting the rest of the system Since the simulators and

back into the input p ort of the buer cache mo dule as

real mo dules share much of the co de we do not increase

many times as needed to evaluate dierent p olicies Re

co de size substantially

sults obtained using this metho d are not as accurate as

Mo dules that supp ort simulation consist of two logical

those obtained using the rst metho d In order to prop

set of states the rst is writable by b oth the real and

erly repro duce the eects of events that are not directly

simulation instances of the mo dule and is duplicated for

triggered by the arrival of the requests such as the ush

each instance of such mo dules and the other is writable

ing of dirty cache blo cks the trace must b e played back

only by the real instance of the mo dule b ecause the states

with the exact timing However the timing of the request

are shared systemwide In case of the buer cache mo d

arrivals are often dep endent on the b ehavior of the buer

ule information such as which buer cache page contains

cache A miss in the buer cache will delay the arrival

a particular blo ck of a physical disk falls into the rst

of the next request Therefore if the p olicy b eing eval

category while the cached data itself falls into the second

uated and the p olicy used when the trace was collected

The simulation mo dules run without aecting the rest

are drastically dierent in the eciency the result may

of the system nor are they aected by other activities in

b e inaccurate

the system This allows the simulators to b e run under

The third metho d utilizes the simulation mo dule as well

many dierent situations The user can evaluate appli

as the tracing facility to provide a convenient way for pro

cation sp ecied p olicies by playing back a trace into the

grammers to p erform dynamic evaluation of p olicies An

simulation mo dule and logging the output to a le Un

application may p erio dically collect traces from the input

like using the real mo dules to evaluate p olicies requests

p ort of the buer cache mo dule and use the trace and the

that simulation mo dules generate are not transmitted to

simulator to decide which p olicy is most appropriate for

the rest of the system The simulations consume less sys

the tasks that the application is currently p erforming Be

tem resources This makes it feasible for application pro

cause a prerecorded trace is used to drive the simulator

grammers to develop applications that dynamically alter

this metho d is also sub ject to the problems with accuracy

the p olicies that they sp ecify Such programs may collect

of metho d two In addition imp erfections in the simula

traces as they execute and p erio dically apply the infor

tor implementation may introduce additional errors

mation gained from tracing to select p olicies that may

p erform b etter

Disk Layout Scheme Evaluation

Disks op erate most eciently when accessed sequentially

Sample Uses

Therefore an ideal disk layout scheme should maximize

the sequential access of the disk and minimize the number

Buer Cache Management

and the distance of seeks that is needed to satisfy a given

Using VINOs tracing and simulation facilities applica set of requests We can use the tracing facility to examine

tion programmers can evaluate dierent buer cache re the eectiveness of dierent disk layout scheme

placement p olicies using three dierent metho ds One of the diculties in evaluating disk layout scheme

The rst metho d is to run the target application in a arises from the fact that the disk layout scheme has long

controlled environment and collect logs from the output term eect Unlike the buer cache management p olicy

p ort of the buer cache mo dule Given that the user which can b e evaluated by running a short program it is

is successful in executing the application in a controlled imp ossible to p erform a meaningful evaluation of a disk

environment the results obtained using this metho d are layout scheme without aging the le system using the lay

the most accurate of the three since we are only using out p olicy b eing examined The aging pro cess involves

the tracing facility to record what is happ ening in the sub jecting the algorithm to thousands of if not millions

system the requests that buer cache management mo d of le creation deletion expansion and contraction One

ule receives are generated by a real program and handled can create programs that articially age the le system

by real kernel mo dules This metho d is not sub ject to but these programs usually fails to capture what really

happ ens in the system under normal use Richie D M A Stream InputOutput System

To overcome these problems the user can log all the ATT Bel l Laboratories Technical Journal Oc

relevant requests that the le system receives over a long tover

p erio d of time The trace is then used to age the le

Small C Structuring the Kernel as a Collection

system The user must make sure that the le system

of Reusable Comp onents Submitted to HOTOS

is in a known and easily repro ducible state eg empty

b efore the aging pro cess b egins

It is p ossible to analyze the aged le system statically to

Stonebraker M Op erating System Supp ort for

determine how much of the data is laid out sequentially

Database Management Communications of the

but the user should collect traces that reect common le

ACM July

reference patterns to p erform the evaluation The use of

these traces helps reect the fact that the eectiveness

of the layout p olicy dep ends heavily on which les are

frequently accessed and how those les are accessed The

log of disk requests is collected from the output p ort of

the buer cache and the user can use this log to analyze

the eectiveness of the layout p olicy in maximizing the

sequential access and minimizing the seek

Conclusion

We have identied that in order for application program

mers to take advantage of the kernel extensibility there

must b e mechanisms to allow programmers to quickly and

easily evaluate dierent p olicies Inkernel tracing and

simulation is a simple and general solution to this prob

lem and can b e implemented without a signicant increase

in the co de size by reusing the co de that is already in the

kernel We do not claim that this new facility will com

pletely do away with the need for sp ecialized tracing and

simulation to ols nor do we advocate turning the op erat

ing system into a simulator but rather we present this

as one example of how we can take advantage of existing

kernel co de to create a useful to ol

References

Bershad B C Chambers S Eggers C Maeda D

McNamee P Pardyak S Savage E Sirer SPIN

An Extensible Microkernel for Applicationsp ecic

Op erating System Services University of Wash

ington Technical Rep ort February

Engler D M F Kaasho ek and J OToole

The Op erating System Kernel as a Secure Pro

grammable Machine Proceedings of the Sixth

SIGOPS European Workshop September

Krueger K D Loftesness A Vahdat T Ander

son Tools for the Development of Application

Sp ecic Virtual Memory Management In Proceed

ings of OOPSLA volume pages

Your Op erating System is a Database

Keith A Smith and Margo Seltzer

Harvard University

fkeithmargogcsharvardedu

ating systems for over a decade Stone yet little seems Abstract

to have changed

The fundamental resp onsibilities of an op erating system

Although database systems have b een using logging

are the arbitration of access to shared hardware resources

to provide atomic up dates and fast recovery for decades

such as the pro cessor main memory and IO devices

Gray it is only in the last ve years that we have

and the provision of a clean layer of abstraction atop

seen the le system community accept logging as an ap

p otentially complicated devices Database management

proach for high p erformance and fast recovery Ouster

systems provide the same functionality with resp ect to

Kazar Chutani Similarly database management

datathey arbitrate access to shared data and they pro

systems have long addressed the problems of distributed

vide simple abstractions to facilitate the manipulation of

access to shared data Bern Op erating systems have

this data Both op erating systems and database systems

faced the same problem in providing supp ort for dis

address issues of synchronization concurrency control

tributed le systems and distributed shared memory Un

buer management distribution and recovery Despite

fortunately much of the op erating system research in

this overlap of form and function databases and op er

these elds has ignored solutions used by database sys

ating systems have historically b een implemented and re

tems leading to wasted eort rediscovering the same solu

searched completely separately and often comp ete rather

tions and the same dead ends While op erating systems

than co op erate for resources This approach leads to re

have b egun adopting some database techniques they have

dundancy in implementation and worse p erformance than

done little to address the needs of database systems

is p ossible if a more integrated design and research ap

We b elieve that the time for op erating systems re

proach is taken

searchers and database researchers to work together has

Op erating systems consist of many comp onents that

long since passed If either eld is to move forward b oth

p erform essentially the same task as a database but each

must acknowledge the commonality b etween the two and

of these pieces has its own idiosyncratic interface and im

learn from the past We go so far as to argue for a tighter

plementation The design and structure of op erating sys

integration where database and op erating system func

tems can b e simplied by using database structuring con

tionality overlap they should endeavor to use a common

cepts to implement a uniform interface for resource man

co de base More fundamentally as researchers we must

agement In this pap er we present the VINO Universal

examine the conventional architectures of each and dis

Resource Interface a general interface for structuring op

tinguish the gems from the aws

erating systems and provide examples of its use

The rest of this pap er is organized as follows The next

section provides a brief overview of database functionality

Section describ es a uniform interface for resource man

agement in op erating systems In section we show how

Introduction

a le system might b e implemented using this structur

ing technique Section provides a brief survey of related

In simplest terms b oth op erating systems and database

work and we summarize in section

systems arbitrate access to shared resources In the case

of an op erating system these resources are typically hard

ware comp onents disks network interfaces serial lines

main memory or pro cessors In the case of database sys

What Do Databases Do

tems the resources are typically data les records or

A database management system DBMS is a rep ository ob jects Both types of systems p erform similar tasks in

of data A database used by a bank might include infor managing these resources eg buer management syn

mation ab out all of the accounts at that bank the names chronization disk allo cation recovery yet they rarely

and addresses of the account holders and the amount share co de to do so The database community has b een

of money in the accounts Clients of a database make complaining ab out the lack of database supp ort in op er

requests to the database when they wish to read or mo d existing ob jects can b e deleted from it The op erating sys

ify data mo dications include adding new data deleting tem p erforms lo okup op erations to retrieve ob jects from

data or editing existing data A database may have the set and the op erating system may mo dify existing

many clients simultaneously issuing requests In some ob jects in the set This mo del of resource management

cases these clients may b e distributed across a network is the same as that used by database systems to manage

In most cases mo dications to the state stored by a data ob jects

data base take place in the context of transactions Trans A page table for example is simply a collection of phys

actions provide four imp ortant prop erties atomicity con ical memory pages indexed by their virtual addresses in

sistency isolation and durability Atomicity means that a pro cess address space When new pages are brought

that the mo dications in each transaction are applied as a into memory new entries are added to the page table

single unit Either they all are applied to the database or Similarly when pages are evicted from memory entries

none of them are The consistency prop erty insures that are deleted from the page table Whenever the pro cess

data in the database is always in a consistent state trans references memory the virtual address of the reference is

actions are required to take the database from one con used to p erform an asso ciative lo okup in the page table

sistent state to another Isolation requires that result of returning a physical memory page where the desired vir

concurrent transactions b e indistinguishable from the re tual address is lo cated In most architectures this lo okup

sult of applying the same transactions in some sequential is exp edited by dedicated hardware such as TLBs

order The last prop erty durability insures that once a On a unipro cessor the database notion of shared ver

transaction has b een committed its results are preserved sus exclusive access seems to have no analog with resp ect

across system failures Gray to page tables With only one pro cessor only one page

The prop erties of isolation and consistency require that can b e accessed at a time and therefore all page accesses

some form of lo cking or concurrency control b e applied are exclusive In a parallel or distributed architecture

to the data residing in a database Traditionally this however multiple pro cessors may b e using the same page

capability has b een provided through the use of lo cks table and attempting to access the same pages concur

with read lo cks supp orting multiple simultaneous accesses rently In such a system memory consistency b ecomes an

and write lo cks prohibiting concurrent access Gray imp ortant issue and the pro cessors must b e co ordinated

To sp eed access to data most databases index the with resp ect to which ones read or write a given memory

stored data items by one or more keys Given the key the page at what p oint This problem is exactly analogous

database can quickly retrieve the corresp onding data ob to the problem faced by a DBMS in handling concurrent

ject by p erforming an asso ciative lo okup The exact data requests to read andor write the same data ob jects

structures and lo okup mechanisms used by a database are Although many dierent op erating system facilities re

transparent to the user residing b ehind the abstraction quire this database mo del of resource management each

of records and keys Frequently the selection of indexing one is typically implemented indep endently resulting in

structures is tuned to the application and type of data multiple implementations and interfaces for arbitrating

b eing accessed Balanced trees Btrees and hash tables access to resources This lack of a common resource man

are two of the most common indexing mechanisms agement interface limits the opp ortunities for co de reuse

Another optimization p erformed by nearly all database and where only a subset of the database resource man

management systems is the buering of frequentlyused agement interface is implemented limits op erating system

data in memory to exploit spatial and temp oral lo cality functionality

in the stream of data references This is similar to the VINO Small a new op erating system under de

buering implemented in a traditional le system buer velopment at Harvard University exploits this common

cache except that databases typically implement more so ground b etween op erating systems and database systems

phisticated page replacement algorithms than the simple In VINO all resources are managed through a Univer

LRU algorithm used by most le system caches Chou sal Resource Interface URI This interface is explicitly

mo deled after the resource management techniques used

by database management systems The VINO URI is im

plemented by a set of kernel mo dules called resource man

A Universal Resource Interface

agers each of which manages access to a collection of ob

An op erating system controls many resources that are

jects via a set of interface routines This interface which

managed in the same manner as the data resources man

is standardized by the URI has functions to add or delete

aged by a DBMS Some of these resources are visible to

ob jects to retrieve ob jects via an asso ciative lo okup on

the users of the op erating system eg directories les

one or more keys and to return mo died ob jects to the

and some are not eg page tables scheduling queues

resource manager

Each of these resource types can b e viewed as a set of

The VINO URI also includes commands that can b e

similar ob jects New ob jects can b e added to the set and

ing these blo cks Note that only one le manager need used by the transaction service when b eginning commit

b e implemented individual les are treated as separate ting and rolling back transactions For example rather

instances of this manager reusing its co de with the ap than using careful write ordering Ganger or a sepa

propriate lesp ecic data Reading a le is implemented rate logging facility Chutani Kazar to maintain the

by retrieving the blo cks at the requested osets with read integrity of le system metadata op erations VINO uses

p ermission Writing a le is implemented by mo difying a general purp ose transaction mechanism The degree of

the le blo cks at the appropriate osets durability conferred by the transaction mechanism may b e

When a le is extended new blo cks must b e allo cated to tailored to the requirements of the resource manager in a

it These blo cks are added to the les le manager The manner similar to that used by QuickSilver Schmuck

new blo cks are allo cated from a storage manager a re

source manager representing a disk partition When a le

needs to allo cate new blo cks it requests the blo cks from

Implementing a File System

the storage manager where the le resides The blo cks

A traditional le system exp orts an interface similar to

are not returned to the storage manager until the le is

that of the universal resource interface Thus the le

truncated or deleted preventing blo cks from b eing simul

system makes an excellent introductory example for un

taneously allo cated to more than one le

derstanding the VINO URI

The URI mo del also allows us to sp ecify a proto col for

From the users p ersp ective all les are ob jects man

deciding when to grant shared or exclusive access to a

aged by a resource manager called the le system The

le ie when to allow read andor write system calls to

creat and unlink system calls corresp ond to adding and

overlap The UNIX Fast File System McKusick do es

deleting resp ectively a le ob ject Reading a le is anal

not p ermit shared access to les no more than one read

ogous to retrieving a le or a part of it from the le

or write request to a le is serviced at a time Clearly a

system for read access A write op eration acquires write

one writer multiple reader p olicy would p ermit greater

access to the le and up dates the le ob ject appropriately

le throughput This more complicated form of lo cking

Filenames serve as keys for referencing les A le re

is rarely implemented for le systems however since the

name op eration p erforms a transaction in which the old

gains in concurrency are not enough to warrant the eort

le is deleted and a new le with the same contents but

of implementing it The generalpurp ose resource man

a dierent name is created Since this happ ens in the

agement provided by the VINO URI allows co de to b e

context of a transaction if the system fails during the re

easily shared b etween resource managers Thus it is only

name op eration the le will either retain its old name

necessary to implement one solution to the readers and

or have the new name but it will b e imp ossible to wind

writers problem which can then b e used for managing

up in a state where the le has b oth names or neither

any resource where this mo del of shared access is appro

While to days le systems also provide this functionality

priate

they do so using fairly complicated sp ecialpurp ose co de

The division of the le system into separate resource

Because transactions are a fundamental part of the

managers also allows a variety of lo cking and concurrency

VINO mo del of resource management the transaction fa

management schemes The name manager can b e used

cility is available to user pro cesses for insuring the in

to prevent conicting read and write op erations on the

tegrity le data This allows a uniform recovery mech

granularity of entire les This would prevent an applica

anism to b e used by applications concerned ab out data

tion from reading any part of a le that was concurrently

integrity such as word pro cessors and source co de control

b eing written to If decisions ab out shared access are left

systems

to the le manager concurrency can b e handled on the

Although the le system has the outward app earance of

granularity of individual disk blo cks

a single resource manager it is actually implemented as

several distinct but co op erating resource managers The

le system namespace is implemented by the name man

Related Work

ager a resource manager that manages a collection of

les File names are used to index the les File creation VINO b orrows from b oth the database and op erating sys

and deletion corresp ond to requesting an addition to or tem communities for its design The Plan op erating sys

deletion from the name manager The open system call is tem Pres has a common interface for accessing most

implemented by requesting a lo okup from the name man services This interface is based on the le system inter

ager and returning the corresp onding le ob ject with read face found in other op erating systems and is not explicitly

andor write p ermission as requested in the open call designed to provide resource management

Each le is implemented by a le manager a resource QuickSilver Schmuck is a microkernel op erating sys

manager that manages the set of disk blo cks where the tem that uses transactions as a fundamental primitive

le resides The le oset is used as a key for retriev however they do not extend the use of this transaction

mechanism to applications Research and Development Vol No

There are a number of le systems Chutani January

Chang that have incorp orated logging to insure the

Chou Chou HongTai DeWitt David An Eval

integrity of their metadata Unfortunately these le sys

uation of Buer Management Strategies for

tems do not exp ort this transaction mechanism to user

Relational Database Systems Proceedings

applications or to other parts of the op erating system

of the Eleventh International Conference

kernel which are therefore forced to implement their own

on Very Large Database August pp

recovery mechanisms

Inversion Olson is a le system built on top of the

POSTGRES Stone database management system In

Chutani Chutani S Anderson O Kazar M Lev

version provides a transactionbased recovery mechanism

erett B Mason W Sideb otham R

for le data as well as for le system metadata It also

The Episo de File System Proceedings of

allows users to dene new les types and functions for op

the Winter Usenix Conference San

erating on them Users of Inversion can use the underlying

Francisco CA January

database to issue queries on the le systems contents and

Ganger Ganger G Patt Y Metadata Up date

metadata

Performance in File Systems Proceedings

VINO is not unique in separating naming from the le

of the First Usenix Symposium on Oper

storage service The Amo eba op erating system Mullen

ating System Design and Implementation

also divides traditional le system functionality along

Monterey CA November pp

these lines providing a directory service that maps names

to capabilities for le ob jects The le ob jects are man

Gray Gray J Lorie R Putzolu F and

aged by a separate service called the bul let service

Traiger I Granularity of lo cks and de

grees of consistency in a large shared

data base Modeling in Data Base Man

Conclusion

agement Systems Elsevier North Holland

New York

Traditional databases and op erating systems use a vari

Gray Gray J Notes on Database Op erating

ety of similar techniques for solving resource management

SystemsAn Advanced Course Springer

problems Over time databases have evolved a uniform

Verlag Lecture Notes in Computer Science

mo del for manipulating the range of resources they need

Volumne

to manage Op erating systems in contrast still use a

variety of ad ho c mechanisms and interfaces for resource

Gray Gray J Reuter A Transaction Pro

management We feel that the access metho d mo del for

cessing Concepts and Techniques Morgan

resource management developed for use in databases is

Kaufmann Publishers San Francisco CA

also appropriate for op erating system resource manage

ment

Kazar Kazar M Leverett B Anderson O

The VINO op erating system is designed to explore

Vasilis A Bottos B Chutani S Ev

this common ground b etween op erating systems and

erhart C Mason A Tu S Zayas

databases The VINO Universal Resource Interface pro

E DECorum File System Architectural

vides a uniform interface for resource management mo d

Overview Proceedings of the Sum

eled after the access metho ds used by databases

mer Usenix Anaheim CA June pp

References

McKusick Marshall Kirk McKusick William Joy

Sam Leer and R S Fabry A Fast File

Bern Philip A Bernstein Nathan Go o d

System for UNIX ACM Transactions on

man Concurrency Control in Distributed

Computer Systems Vol No August

Database Systems ACM Computing Sur

veys Vol No June pp

Mullen Sap e J Mullender Guido van Rossum An

drew S Tannenbaum Robb ert van Re

Chang Chang A Mergen M Rader R nesse and Hans van Staveren Amo eba

Rob erts J Porter S Evolution of stor A Distributed Op erating System for the

age facilities in AIX Version for RISC s IEEE Computer May pp

System pro cessors IBM Journal of

Olson Michael A Olson The Design and Imple

mentation of the Inversion File System

Proceedings of the Winter Usenix

Conference San Diego CA January

Ouster Ousterhout J Douglis F Beating the

IO Bottleneck A Case for Logstructured

File Systems Operating Systems Review

January

Pres Presotto D Pike R Trickey H and

Thompson K Plan a Distributed Sys

tem Proceedings of the Spring Eu

rOpen Conference Many

Schmuck Frank Schmuck Jim Wyllie Exp eriences

with Transactions in QuickSilver Proceed

ings of the th Symposium on Operating

System Principles Pacic Grove CA Oc

tob er

Small Chris Small Margo Seltzer VINO An

Integrated Platform for Op erating System

and Database Research Harvard Com

puter Science Technical Rep ort TR

Stone Michael Stonebraker Op erating System

Supp ort for Database Management Com

munications of the ACM Vol No

July pp

Stone Michael Stonebraker The Design of the

POSTGRES Storage System Proceed

ings th International Conference on Very

Large Data Bases Brighton England

September pp