<<

VINO: An Integrated Platform for Operating and

The Harvard community has made this article openly available. Please how this access benefits you. Your story matters

Citation Small, Christopher and Margo Seltzer. 1994. VINO: An Integrated Platform for and Database Research. Harvard Science Group Technical Report -30-94.

Citable http://nrs.harvard.edu/urn-3:HUL.InstRepos:26506446

Terms of Use This article was downloaded from Harvard University’s repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA

VINO An Integrated Platform

for Op erating System and Database Research

Christopher Small and Margo Seltzer

Harvard University

Cambridge MA

fchrismargogdasharvardedu

Harvard Laboratory Technical Report TR

of buer p o ol management le storage pro cess schedul Abstract

ing and

In Stonebraker wrote

For example the disk buer management strat

egy commonly used by op erating is LRU

Op erating system services in many existing

this is sub optimal for most disk access

tems are either to o slow or inappropriate Cur

patterns CHOU

rent DBMSs usually provide their own and

DBMS have usually dealt with the architectural mis

little or no use of those oered by the op erating

match by evasion they overcome op erating system limita

system STON

tions by reimplementing and avoiding the use of op erating

system services wherever p ossible

The op erating system mo has changed lit

For example rather than use the supplied lesystem

tle since that and we elieve that at its core it is

a DBMS will directly to the raw disk

the wrong mo del for DBMS and other resourceintensive

ORA SYB The layout of the on the partition

applications The standard mo del is inexible unco op er

and the caching of that data then falls under the con

ative and irregular in its treatment of resources

trol of the DBMS However b ecause the

We describ the design of a new system the VINO ker

of the database pro cess is still managed by the op erat

nel addresses the limitations of standard op erating

ing system inmemory copies of disk buers are double

systems It fo cuses on three key ideas

buered they are stored by the DBMS in memory or

on the database partition and also by the op erating sys

 Applications direct p olicy

tem in the backing store for pro cess In addition

 Kernel mechanisms are reusable by applications

dard op erating systems to ols eg to ols that display the

contents of a can not b e used From the op

 All resources share a common extensible interface

erating system p ersp ective this is akin to the manage

ment of BLOBs in relational a BLOB is

VINOs p ower and exibility make it an ideal platform for

stored by the DBMS but is not a rstclass entity that

the design and implementation of traditional and mo dern

can b e manipulated using the standard DBMS interface

database management systems

Furthermore the allo cation b etween the lesystem and

the DBMS is completely static the device containing the

lesystem must b e taken o and rebuilt in order to

Introduction

change the size of a partition

In general op erating systems are designed for the least

Transaction supp ort is rarely provided by general

common denominator application This has proven to b e

purp ose op erating systems its availability at the

a p o or match for resourceintensive applications such as

level is the exception rather than the rule eg Quick

database management systems The result is that DBMS

silver HASK Instead op erating systems implement

avoid using op erating system services and reimplement

one or sp ecial purp ose mechanisms to supp ort re

them from scratch

covery For example most systems supp ort some

Stonebraker p oints out several ways in which the sp e

of le system recovery but this service is rarely

ci requirements of relational databases are p o orly served

exp orted to the user level meaning that a DBMS must

by traditional op erating system architectures in the areas

implement its own recovery system Recovery co de is no plementation or indexing structures The query pro cessor

toriously complex and is often the subsystem resp onsible may take advantage of known strengths or weaknesses of

for the largest number of system failures SULL Sup various indexing metho ds but it accesses the underlying

p orting multiple recovery paradigms is likely to reduce relations through a simple tupleoriented interface using

total system robustness calls to tuples by key scan sequentially through

Ob ject databases are presented with greater barriers to a or mo dify sp ecic tuples It is this abstract in

development and p erformance Ob ject stores that terface that we use as the foundation for VINOs resource

ob ject faulting through the use of virtual memory man interface The system can take advantage of

agement hardware supp ort eg LAMB are faced ab out the underlying representation but can function cor

with the p erformance impact of a protection fault on rst rectly at the abstract level

access and if the rst fault was on a access a second The Illustra DBMS ILLU is a successor to the

on write access An ob ject database management system POSTGRES system developed at the University of Cali

that is implemented using a client architecture can fornia Berkeley STON It can b e extended by adding

end up triplepaging on the server data is present in b oth userwritten co de to implement new data types This co de

the database and the servers swap le and on the client can b e run in either the client pro cess or the server pro

it is present b oth in the client cache or the client swap cess if it is installed in the server there is no guarantee

le that it will not corrupt the server or violate system secu

At times a service provided by the op erating system is rity

frustratingly close to what is needed by a DBMS For The Thor system developed at MIT LISK also al

ample Skarra examined the feasability of using standard lows the server to b e extended by adding client co de To

Posix lo cking services in a DBMS SKAR She found ensure that client co de do es not compromise the server

that there were several places where commercially avail extensions must b e written in a system sp ecic typesafe

able implementations did not correctly detect deadlo ck language Theta MYERS

that starvation of write lo ck requests can o ccur that the

semantics of lo ck is not consistent across dier

Op erating System Perspective

ent implementations and that p erformance is inadequate

for DBMS use

Extensibility

The fundamental problem is that the services provided

 Newer op erating systems have adopted the

by the OS are not the services desired or needed by the

architecture ACET which allows the

DBMS

system to b e extended by adding userlevel servers

We discuss related work and then the motiva

to which the kernel can delegate resp onsibility Sys

tion for VINO In section we describ e the system

tem services such as lesystem management can

chitecture In section we describ e sp ecic VINO sub

b e delegated to external servers A new server can

systems that are of use to a DBMS implementor We

b e added that has the same interface as an existing

conclude with sketches of VINObased implementations of

one but with an entirely dierent implementation

relational ob jectrelational and ob jectoriented DBMS

A database management system can provide a new

lesystem server that can b e manipulated using ex

isting system to ols but is implemented in a manner

Related Work

more appropriate for the DBMS use

DBMS and op erating systems have confronted the prob

The Camelot pro cessing sys

lems that VINO addresses Some DBMS implement a

tem EPPI takes advantage of this architecture

uni resource interface and extensibility some op erat

and provides a set of Mach pro cesses to supp ort

ing systems also oer limited application control of p olicy

nested transaction management lo cking recoverable

storage allo cation and system conguration Most

of the mechanisms required to supp ort transaction

Database Perspective

semantics are implemented at user level The result

Relational database management systems separate the

ing system can b e used by any application not just

abstract interface for a tuple from its representation nor

clients of a Camelot data manager

mally as an in a B or This data

indep endence allows highlevel to ols such as query pro  Mach VM The Mach virtual

cessors to work indep endent of the implementation the system can delegate b oth storage and replacement

internal level in ANSISPARC terminology DION p olicy decisions to userlevel servers In the Mach vir

In a relational system queries are expressed in terms tual memory management system RASH mem

of tuples and keys without regard for the underlying im ory objects are asso ciated with a pager The pager

needed a nonblo cking write allows an application to is resp onsible for implementing the backing store for

queue a write request and continue to make progress virtual memory

while the write takes place SOLA

The initial pager design did not allow pagers to con

trol eviction p olicy McNamee and Armstrong ex

There are disadvantages to b oth techniques When

tended the interface to allow userlevel control of

using nonblo cking reads for readahead the act of

replacement p olicies MCNAM These extensions

bringing in a blo ck that will b e needed in the future

allow an application to sp ecify the to use

may push out a blo ck needed in the present A non

when selecting a page for eviction

blo cking write may fail due to lack of disk space

or system or a sequence of writes may not

 Vnode le system The VFS mo del KLEI ab

take place in the order requested by the application

stracts the implementation details of a le systems

which in the of a crash can make recovery

1

allowing the rest of a Unix kernel to interact with

imp ossible

it through a xed interface of approximately forty

op erations VFS simplies the of adding a new

Other Extensible Op erating Systems

lesystem type to a Unix system The new lesystem

is fully describ ed by the implementation of its inter

There are several other extensible op erating system

face functions These functions are then compiled

pro jects underway including the SPIN pro ject at the Uni

into the kernel and added to a dispatch table

versity of Washington BERS and the Aegis pro ject at

MIT ENGL These op erating systems allow applica

Unied Resource Interface

tions to control general p olicy decisions

Both SPIN and Aegis provide a minimal kernel and fo

 Plan an op erating system developed at Bell Lab o

cus on providing facilities for extending the kernel through

ratories PRES oers a unied resource interface

the use of a systemsp ecic extension language Both are

through the lesystem and the ability to

investigating runtime co de generation to optimize kernel

extend the system by attaching a server pro cess to

extensions

the namespace

The SPIN pro ject concentrates on the idea of directly

extending a Machstyle microkernel mo del by allowing

Application Control of Policy

servers to b e installed in the op erating system kernel The

developers of Aegis have stripp ed the kernel to a bare min

Two p olicy decisions are critical to the p erformance of an

imum with only ve system calls Standard resources are

IO intensive application what data is cached in internal

built from these atomic onents

system buers and when it is read and written

 Cao et al have investigated separating resource al lo

VINO Motivation

cation decisions from resource replacement decisions

in the context of a lesystem buer cache CAO

VINO is a new kernel architecture designed around the no

Allo cation decisions how many buers are available

tion of a p ermeable barrier b etween the op erating system

for use by each pro cess are made by the kernel based

and applications Applications are emp owered to direct

on its desire to share limited resources fairly Re

resource management p olicy decisions

placement decisions which buer to evict are del

VINO has three goals to allow applications to sp ecify

egated to the application program that owns the

the kernel p olicies that manage resources to make kernel

buers

primitives accessible at userlevel and to provide a uni

 Transparent Informed Prefetching PATT is based

versal system interface designed around the acquisition of

on the idea that applicationlevel access pattern hints

hardware and software resources These goals combine to

to the op erating system can b e more useful and take

provide the exibility and extensibility needed for eec

eort than having the op erating system attempt

tive supp ort of a DBMS

to infer future access patterns from past b ehavior

With b etter hints the op erating system can have a

ApplicationDirected Kernel Policy

larger p o ol of outstanding requests over which to op

timize IO access and can b etter overlap disk access

The problems cited in Stonebraker and discussed in the

with pro cessing time

introduction result from applications having no input into

the p olicies used to manage their resources Sub opti

 Asynchronous IO A nonblo cking read can b e used

mal buer management results when the kernel uses LRU

to prime the buer cache with data b efore it is

globally inappropriate pro cess results when a

1

Unix is a trademark of XOp en single criterion is applied for all applications p o or IO

p erformance results when the op erating system assumes MCKU others use a writeahead log for the meta

incorrect access patterns for data The motivation for data CHUT CHANG Still other le systems use

ApplicationDirected Kernel Policy is to enable applica a logstructured representation for the entire le system

tions to their own resource management p olicies ROSE While all these systems provide transaction

leaving the kernel with the task of arbitrating b etween semantics to op erations none exp ort this func

comp eting p olicies tionality to userlevel

Chou and DeWitt CHOU characterized disk access In VINO generalpurp ose log management co de is used

patterns for the dierent phases of op eration of a rela to facilitate the management of lesystem metadata and

tional database Some blo cks are reused others are not is also available for use by applications A DBMS do es not

For example when p erforming a nested lo op the have to implement its own log manager and DBMS re

outer relation is scanned sequentially so no caching is covery is initiated as part of the standard system recovery

required while the inner relation is scanned rep eatedly pro cess

and so should b e cached

A standard assumption made by op erating systems is

Universal Resource Access Interface

that le access is either sequential or random When a

blo ck is read from a le that is b eing sequentially accessed

In order to make VINOs reusable kernel to ols as widely

an op erating system may read ahead the following blo ck

applicable as p ossible VINO provides a universal resource

or blo cks No prefetching is done on a le that is b eing

access interface to all system resources The interface

accessed randomly

provides a simple regular abstract interface to system re

The op erating system makes the simplifying assump

sources indep endent of the representation of the resource

tion that there are only two mo dels of le access b ehavior

Relational systems use this mo del to represent data as

DBMS access patterns do not consistently match either

tuples whether the data is stored in a Btree hash table

one and are not well served by the simplication

or other structure Op erating systems use this mo del as

Typically a DBMS knows in advance which will

well for example the Network System the Berkeley

b e needed so on and could the op erating system

Fast Filesystem and the Berkeley Logstructured Filesys

of its future requirements There have b een numerous at

tem SELT provide the same interface to application

tempts to emp ower applications with the ability to control

programs but the underlying representations are very dif

prefetching and page replacement CAO MCNAM

ferent Network so ckets and disk les are b oth accessed

PATT HART APPEL All of these approaches

through the use of le descriptors although they are sup

provide p otential solutions but the solutions apply only

p orted by entirely dierent implementations

to buer management and require the implementation

Some systems eg Plan PRES choose a single

of sp ecial purp ose mechanisms In VINO application

compromise interface for all resources The disadvantage

sp ecic tailoring is the norm not the exception Every

of choosing one for all resources is that each

resource p olicy decision can b e mo died at the discretion

must implement the complete interface and none can ex

of the application

tend it VINO provides a more general hierarchical re

source mo del that allows b ehavior to b e shared or factored

out as appropriate through the use of resource subtyping

Reusable Kernel Tools

The requirements of an op erating system are in many

ways the same as those of a database management system

VINO Architecture

b oth arbitrate access to shared resources In the case of an

op erating system those resources are things like physical The VINO op erating system architecture is comp osed of

memory devices and pro cessor cycles In a DBMS those resources ob jects instances and their types classes

resources are data Most conventional op erating systems The resource type hierarchy is a static single inheritance

implement services such as synchronization primitives log hierarchy sp ecied at system build time To allow ex

management and buer management These services are ibility individual resources can override or augment the

buried inside the kernel unavailable for application use implementation of individual resources at runtime

VINO is designed around the idea that any service used by Each resource type denes an interface consisting of a

the kernel should b e accessible to userlevel applications set of operations functions member message handlers

For example a lesystem stores b oth data le con and properties data members slots The resource type

tents and metadata When the provides a default implementation for each op eration a

metadata is mo died the mo dications must b e writ piece of co de that is run when the op eration is invoked on

ten out in such a way that the metadata can b e re a resource of that type

covered in case of a crash Some lesystems care The resource type hierarchy is completely sp ecied at

fully order writes eg the Berkeley Fast Filesystem the time VINO is compiled all VINO resources are de

scrib ed by the standard types The VINO distribution MOGUL CHEN GUIL By placing extension

includes resource types for system resources eg les co de in the kernel we decrease the number of protection

directories memory pro cesses threads transactions vir b oundary crossings which improves p erformance Addi

tual memory page physical memory page synchroniza tionally graft implementations frequently use information

tion primitives and queue management gathered from the kernel by placing grafts in the kernel

the bandwidth of the channel b etween the

graft and the kernel is increased

Addition of New Types

We restrict the exibility of the resource schema by re

Graft Safety

quiring that it b e sp ecied at system build time This

When adding userwritten co de to a running system the

provides p erformance and robustness sup erior to a sys

question of the safety of the co de comes into question

tem with a dynamic schema b ecause it allows static type

checking and op eration dispatch It is however a func not to mention the sanity of the system architects How

reasonable is it to allow random bits of co de to b e added

tional limitation and we must weigh its b enets against

to a running op erating system

the liabilities introduced

The safety of a graft can b e divided into two parts

A new resource type is introduced by subtyping an ex

its b ehavior relative to the application and its b ehavior

isting type For example the VideoFile resource type may

b e subtyped by QuickTimeFile The new type could have relative to the rest of the system

the same interface as a VideoFile or it may augment the In the rst case keep in mind that an application can

only graft its own resources a graft can not directly af

interface by adding new op erations and prop erties

fect other applications If a graft do es not implement the

A resource type is added to VINO by compiling any

standard semantics for an op eration only the application

new or mo died kernel co de and relinking the kernel If

the new type do es not extend the interface of its sup er that installed the graft will er

type it can b e linked into the kernel without changing In the second case we are concerned with whether the

graft is b enign or malicious with resp ect to the system

or recompiling any VINO source co de In this case we

Co de can attack the system in dierent ways we are con

can dynamically link the new resource type directly into

cerned with co de that attempts to deny resources to other

the running kernel just as device drivers are dynamically

applications eg through not giving up the pro cessor in

linked into current systems

a timely fashion that overwrites the kernels private data In a research environment the requirement of relinking

structures calls a privileged kernel function or in some

the VINO kernel to add a new resource type is at worst

other violates system security

an inconvenience In a pro duction environment we feel

We ensure safety of grafts through controlling the graft

that new resource types will b e added infrequently

compilation environment Grafts are written in C or

For example the standard interface would supp ort a

C and compiled by our graft which is based device with a straightforward serial interface such as au

on work underway at Harvard The graft compiler inserts dio or video Supp ort of synchronized audio and video

range checks on all memory accesses including function

streams would entail the addition of substantial new func

calls The checks ensure that a graft do es not stray out

tionality a new interface and rebuilding the VINO kernel

side its b ounds or mo dify itself to do so

A second scheme that ensures safety sandboxing was

Overriding Implementation

prop osed by Wahbe et al WAHBE Sandb oxing as

signs a range of memory to the co de and data of each Application control of p olicy is accomplished by overrid

graft Instructions are inserted into the co de which force ing the default implementation of a resource This is done

all memory references to fall within the assigned segments by grafting a new implementation of an op eration into

Sandb oxing masks faults instead of detecting them but VINO The graft applies to a single resource owned by

it is more ecient than range checking the grafting application

Co de generated by the graft compiler is marked with an The graft is loaded dynamically into the kernel and will

encrypted ngerprint that is eectively imp ossible to forge b e used in the place of the default implementation pro

RABIN A ngerprint is a type of digital signature vided by the resource type Alternatively a graft can b e

if the ngerprint for a le is valid its contents have not wrapp ed around the default implementation adding a bit

b een mo died of pre or p ostpro cessing to the standard b ehavior

In order to ensure that a graft do es not keep a critical VINO places extension co de into the kernel rather

system lo ck for an extended p erio of time each graft than leaving it in as is done in Mach Re

invocation is p erformed in the context of a transaction search has shown that the cost of crossing protection

When a lo ck is acquired for a critical the b oundaries b etween user and system or usersystem

lo ck is assigned an expiration time If the graft do es not user as in the case of a Mach external server is very high

release the lo ck b efore it expires the grafts transaction application can choose to not pro ceed

is ab orted In order to ensure fairness of allo cation an application

The idea of a graft is not new it is found in b oth op er must b e privileged in order to make hard requests

ating systems eg of device drivers in Applications with less stringent requirements can make

Solaris and database systems eg IllustraPOSTGRES soft requests sp ecifying a preferred minimum resource

and Thor Our use diers in that grafts are used to not allo cation If the of the soft requests exceeds the

only augment the system but to adapt VINO to meet system resources VINO will arbitrate b etween the re

the sp ecic needs of an application Grafts can b e used questers sharing the resources available

to make incremental changes to VINO

For example a buer management graft can b e used to

VINO Services

control the caching b ehavior for an application by mo d

ifying the buer cache resource assigned to the applica

VINO consists of a for implementing services

tion The rest of the buer management co de and the

and a collection of services that implement standard op er

lesystem co de can b e left unchanged Under Mach an

ating system b ehavior Many of these implement b ehavior

application would need to develop a complete new server

that is shared by op erating systems and databases They

can b e used asis or augmented to implement the under

Dynamic Augmentation

lying system supp ort for a database management system

In some rare situations an application may need to create

a resource that has op erations not dened on its resource

Synchronization Primitives

type For example an application may encrypt its data

The op erating systems synchronization primitives are

les the graft co de that reads the le will need to know

the key Because there is no interface on the File resource typically much simpler and more ecient than those pro

type to sp ecify an key such a resource would vided to userlevel applications For example System

style userlevel semaphores incur a large number of system

need an augmented interface with an additional op era

calls and context switches while simple lo cks are vir

tion

tually free SELT VINO provides a kernel lo ck man

We provide an escap e mechanism that allows new op

ager accessible to resources

erations to b e added to a resource onthey These op er

In its simplest the lo ck manager provides spin ations will not b e invoked by the kernel and applications

lo ck synchronization on memory lo cations requiring ker

have complete control over how they are used A set of

nel intervention only in the case of a contested lo ck This

op erations are provided by the ro ot resource type that can

interface is available b oth to the kernel and to applica

b e used to add remove and invoke dynamic augmenta

tions

tion op erations

As the resources b eing lo cked b ecome more complex

so do es the lo cking paradigm The VINO lo ck manager

Resource Allo cation

supp orts generalpurpose hierarchical locking GRAY

For example the le system typically requires lo cking on One of the primary jobs of an op erating system is to ar

blo ck le directory and le system levels In most con bitrate and abstract resource access Some devices such

ventional kernels this hierarchy is enforced by convention as physical memory are shared and preemptible others

In VINO it is enforced by design such as disk space or a serial p ort are not

We call the various levels at which lo cking may b e Some applications require service guarantees eg an

needed the containment hierarchy When a lo ck is re application displaying realtime video using a double

quested from the lo ck manager the manager consults the buered display needs to b e scheduled thirtytwo times

resources containment hierarchy to determine if the lo ck a second and have physical memory large enough to hold

may b e granted Applications using the lo ck manager can two copies of the displayed A query pro cessor can

de alternate containment hierarchies and make them tune its join algorithm to the amount of physical memory

available to the lo ck manager eg a DBMS might create available for its use if it can assume that the memory

a logical containment hierarchy of database relation tu once allo cated will not b e taken away

ple and eld Additionally applications can create new Such applications can make hard resource requests

instances of a lo ck manager that enforces alternate lo ck where no less than the minimum requested resources will

ing proto cols eg alternate deadlo ck handling blo cking b e allo cated and once allo cated they will not b e pre

vs nonblo cking or new lo cking mo des empted If a new hard request will exceed the physical

Finally by integrating the kernel and user level lo cking resources of the system VINO will not grant the request

systems concurrency can b e increased For example a A hard request can b e thought of as applicationsp ecied

DBMS running on a conventional Unix le system may entrance criteria if the resource can not b e allo cated the

implement its own lo ck manager and issue multiple IO arbitrarily complex transaction proto cols

requests to the same le Unfortunately the Unix le sys Because the implementation of the transaction manager

tem exclusively lo cks the le during each IO op eration so can b e incrementally mo died dierent transaction

that no concurrency is achieved even though the DBMS is matics eg as outlined in BILI can b e implemented

already ensuring the integrity of the op eration In VINO as needed by applications

since the same lo ck manager handles b oth DBMS requests

and IO requests lo cks held by the DBMS are strong

Name Service

enough to p erform IO and no additional lo cking is re

Persistent ob jects are denoted by a manager storageid

quired by the le system

pair A manager is a kernel resource that implements a

lelike interface for each of the resources under its con

Log Management

trol A manager can b e thought of as a le system and a

VINO provides a simple log management facility that is

storageid as a leid relative to that lesystem Equiv

used by the le system and the transaction system and alently a manager can b e thought of as a database and

accessible to applications as well the storageid is the key of the corresp onding record or

A log resides on one or more physical devices It can ob ject

b e created on a single device or extended onto a second The name service maps a name in a global hierarchical

device not necessarily of the same type as the rst For

namespace to a manager storageid pair Any resource

example a DBMS might request a log that spans b oth that can b e accessed via standard lelike op erations eg

magnetic disk and media eg tap e or optical read and write can b e registered with the name service

disk The kernel requests a volatile inmemory log to and made available through it

supp ort transactions on ephemeral data such as pro cess The namespace is separate from the storage systems

structures and buer cache metadata The le system re

managers adjacent names in a directory can b elong

quests a log that spans nonvolatile RAM NVRAM and

to dierent managers For example in the directory

disk le system log records are written rst to NVRAM homechris one might nd entries time

and are later written to disk in large ecient transfers and todaytodo The rst would b e readonly man

The key interface to the log facility is the readwrite in aged by the timemanager it resp onds to read requests

terface which provides the essential information for write by returning the current time of day The second would

ahead logging read by logsequencenumber and return

b e stored by a le manager on a disk The third would b e

ing of logsequence by writes It also supp orts a a reference to a database and when read would present

WAL op eration for writeahead log synchronization and a the result of a query that determined what was on to days

checkpoint op eration for log reclamation and archival As todo list

logs can and contract the partitioning b etween

log resources and other data resources is not static but

File Service

can change as system demands uctuate

The VINO le service is a manager that implements a

fairly conventional FFSstyle lesystem It uses the VINO

Transaction Manager

log manager to store its metadata simplifying recovery in

The VINO kernel uses transactions to maintain consis the event of a crash Its p ersistent storage is provided

tency during up dates to multiple related resources eg by the extent manager which arbitrates requests made

a directory and its contents For example the carefully

for disk space from the dierent services that oer lo

ordered writes of FFS can b e reimplemented as a simpler

p ersistent storage

series of unordered writes encapsulated in a transaction

The transaction interface supp orts the standard

Memory Management

transactionbegin transaction and transaction

The VINO memory management system is based on the abort interfaces It accepts references to appropriate log

ideas of the Mach VM architecture describ ed in sec and lo ck manager instances to use for each transaction

tion although its implementation diers consider At transaction b egin a new transaction resource is cre

ably ated This resource references the appropriate log and

A MemoryResource is a collection of pages As in Mach lo ck managers and is referenced by each protected up

it is backed by a le mapp ed into memory It includes date Most kernel transactions are protected using a sim

op erations to read pages from and write pages to a backing ple shadowresource scheme with a log residing in main

store memory either volatile or nonvolatile dep ending on the

When a takes place VINO determines which resources b eing protected The mixing and matching of

MemoryResource if any is assigned to the virtual mem logging and lo cking comp onents enables VINO to supp ort

ory page containing the faulting address A request is le system allo cation or management strategy is in

made of the MemoryResource with the address at which appropriate for the DBMS the FileSystemManager

to write the requested page resource type can b e subtyped to change it appropri

Unlike a Mach pager the MemoryResource is entirely ately or a new direct subtype of Manager can b e writ

in the kernel when a page fault o ccurs which causes a ten Because the database storage is implemented

into the op erating system it is not necessary to go using the Manager resource type the database can

back across the protection b oundary from the kernel to b e manipulated using standard utilities

the user level

 Buer Management each pro ceess has asso ciated

Also the Mach architecture requires that a new exter

with it one or more MemoryResources A Memo

nal pager b e written for each kind of b ehavior needed

ryResource can b e assigned to map a p ortion of the

VINO allows each MemoryResource to use as much or as

DBMS File resource into the address space of the

little of the standard implementation as is appropriate

DBMS If the DBMS is larger than the virtual ad

the application need only override or augment the op era

dress space of the pro cess the MemoryResource type

tions it wants to change

can b e subtyped or its implementation overridden to

The AddressMapResource is patterned after the Mach

bank switch pages of the underlying database into

ob ject of the same Each address space has an asso ciated

the pro cess address space The DBMS can control

AddressMapResource which contains a mapping b etween

the decision of which disk pages to keep in physical

physical pages and virtual memory pages When VINO

memory by implementing a DBMSsp ecic Cho ose

determines that a mapping needs to b e removed either

Victim op eration for the MemoryResource

b ecause of a virtual memory fault or b ecause the number

of physical pages assigned to the address space is b eing de

 Log Management the VINO Log Manager can b e

creased it invokes the Cho oseVictim op eration dened

used to maintain the database log The physical stor

on AddressMapResource By default Cho oseVictim se

age of the log can b e managed by the VINO le sys

lects the least recently used page although an application

tem the DBMS or a third service

can the implementation of Cho oseVictim with the

algorithm of its

 Management the generalpurp ose hierarchical

Note that as in Caos work describ ed in section

lo cking primitives provided by the VINO Lo ck Man

VINO retains control over the number of mappings allo

ager can b e used directly by the DBMS

cated to an address space but not the mappings them

selves The former b ehavior is not under the control of an

 Transaction Management transaction semantics can

application in order to ensure fairness of allo cation the

b e supplied by the provided transaction manager or

management of the mappings is delegated to each appli

extended to supp ort extended transaction mo dels

cation

 Recovery Subsystem if the VINO transaction and

log managers are used to implement the DBMS no

separate DBMS recovery subsystem is needed After

VINO As a Platform For

a crash the VINO recovery system will traverse the

Database Development

database log and clean up after any transactions that

were in progress at the time of the crash

Our motivation for developing VINO is to supp ort

resourceintensive applications such as DBMS b etter

than conventional op erating systems To that end we

Ob jectRelational Database

include highlevel sketches of how the implementation of

The storage and consistency requirements of an

relational ob jectrelational and ob jectoriented DBMS

ob jectrelational DBMS eg IllustraPOSTGRES are

would change when built on VINO

similar to those of a relational DBMS The primary dif

ference b etween the two is supp ort for data types not

Relational Database

normally supp orted by relational systems

IllustraPOSTGRES can b e extended to supp ort new

A relational database storage system consists of several

2

data types through the use of DataBlades A

comp onents We sketch how each would b e implemented

DataBlade is loaded into the client or server and imple

on VINO

ments the op erations necessary to manipulate the data

type

 File Management data can b e stored using the

VINO le system or a new storage manager see sec

2

This is a reference to the razorandblades strategy of shaving

tion can b e developed that allo cates and manages

equipment manufacturers much more money is made from selling

replacement blades than from the initial razor sale disk space If the latter course is taken b ecause the

Conclusions This mo del maps very cleanly onto the VINO Resource

Type mo del A relation would b e mo deled as a collec

The op erating system and database communities have

tion of Tuple resources each Tuple would b e a collection

b een separately addressing the problems of inexibil

of Field resources If a client application wished to ex

ity unco op erativity and irregular treatment of resources

tend the system it would install a new Field resource

Rather than ignore each other the b est way to accomplish

type either in the client or through negotiation with the

the shared goals is to work together A simple regular ar

server The new type would implement the abstract b e

chitectural mo del is needed to supp ort resource intensive

havior shared by all elds ie getset value compare

applications

value with another Field and would resp ond to requests

When analyzing a hardware design an de

to build and maintain typesp ecic indexes as needed

scrib es a badly aligned interface as an imp edance mis

match We have the same problem in software the op

Ob ject Page Server

erating system interface is inappropriate for DBMS and

for any other system that isn the least common denom

An ob ject page server DEWI serves pages of ob jects to

inator

client applications Several current database systems eg

We b elieve that the VINO architecture is a go o d

Ob jectStore LAMB and QuickStore WHITE are

towards this needed co op eration and is the right architec

based on this mo del

ture for further research into DBMS and op erating system

Ob ject references are represented by virtual memory

structure

p ointers If an ob ject is not in the client cache its ad

dress is unmapp ed invalid When a client dereferences

an unmapp ed p ointer a virtual memory fault takes place

References

The fault causes a trap to the kernel which passes it on

to an applicationlevel handler The signal handler

References

determines that the fault is to an ob ject not in the clients

cache and makes a request of a separate server pro cess If

ACET Acetta M Baron Bolosky D Golub

the page is not in the servers cache the server reads the

R Rashid A Tevanian and M Young Mach A

page from disk and then passes it to the client handler

New Kernel Foundation for UNIX Development

The handler installs the page in the clients memory and

Pro ceedings of the Summer Usenix Conference July

restarts the client at the p oint of the fault

The type of access made to the ob ject is implicit not

explicit the instruction that caused the fault was

APPEL App el A Li Virtual Memory Primi

either a read or write instruction In order to determine

tives for User Programs Proceedings of ASPLOS

whether a read or write lo ck is needed the handler must

IV

nd the faulting instruction and insp ect it If a read lo ck

is obtained the page must b e marked readonly if the

BERS Bershad B C Chambers S Eggers C

application then writes to the ob ject a second fault takes

Maeda D McNamee P Pardyak S Savage

place the client database co de knows that the page was

E Sirer SPIN An Extensible Microkernel for

mo died and the read lo ck can b e upgraded to a write

Applicationsp ecic Op erating System Services

lo ck

University of Washington Technical Rep ort

Each fault which entails multiple user to kernel domain

February

crossings is exp ensive Under VINO many of these do

BILI Biliris S S Gehani N Jagadish H

main crossings are unnecessary The MemoryResource for

V and Ramamritham K ASSET A System for

the OODBMS virtual memory can b e grafted to handle a

Supp orting Extended Transactions Proceedings of

fault directly no dispatch to user level is needed If the

SIGMOD Minneap olis MN May

fault requires that the server b e contacted the graft can

do this directly from the kernel Mo dication of virtual

CAO Cao P Felten E and Li K Application

is done without p erforming a system

Controlled File Caching Policies Proceedings of the

call

Winter Usenix Conference pp June

It can take less time to a page of data over the

network than to read it from a lo cal disk NELS If

the server has a page in its inmemory cache a client need CHANG Chang A Mergen M Rader R Rob erts

not write an unmo died page to disk when it is evicted Porter S Evolution of storage facilities in AIX

Each client cache would store only mo died pages and Version for RISC System pro cessors IBM

the server would b e resp onsible for storing a single shared Journal of Research and Development v n

copy of the unmo died pages January

CHEN Chen B and Bershad B The Impact ILLU Il lustra Users Guide Il lustra Server Releae

of Op erating System Structure on Memory Sys Illustra Information Inc Oakland

tem Performance Proceedings of the th SOSP CA June

Asheville NC December

KLEI SR Kleiman Vno des An Architecture for

Multiple Types in Sun UNIX Proceed

CHOU Chou H and D Dewitt An Evaluation

ings of the Summer Usenix Conference pp

of Buer Management Strategies for Relational

June

Database Systems Proceedings of VLDB pp

Sto ckholm Sweeden

LAMB Lamb C G Landis J Orenstein and

D Weinreb The Ob jectStore Database System

CHUT Chutani S O T Anderson M L Kazar

Communications of the ACM v n pp

B W Leverett W A Mason and R N Side

Octob er

b otham The Episo de File System Proceedings of

the Winter Usenix Conference pp Jan

LISK Liskov B M Day and L Shrira Distributed

uary

Ob ject Management in Thor in Distributed Object

Management Morgan Kaufmann San Mateo Cali

DEWI Dewitt D D Maier P Futtersack and F

fornia

Velez A Study of Three Alternative

MCKU McKusick M Joy W Leer S Fabry R

Server Architectures for ODBMS Proceedings of

A Fast File System for UNIX Transactions on

the th VLDB

Computer Systems v n pp August

DION Dionysios C T and A Klug eds The

ANSIXSPARC CMBS Framework Rep ort on

MCNAM McNamee D and K Armstrong Ex

the Study Group on Data Base Management Sys

tending The Mach External Pager Interface To Ac

tems Information Systems v

commo date UserLevel Page Replacement Policies

Proceedings of the Usenix Mach Workshop

ENGL Engler D M F Kaasho ek and J OToole

Burlington VT

The Op erating System Kernel as a Secure Pro

grammable Machine Proceedings of the Sixth

MOGUL Mogul J and Borg A The Aect of Con

SIGOPS European Workshop September

text Switches on Cache Performance Proceedings

of ASPLOS IV

EPPI Eppinger J L Mummert and A Sp ector eds

Camelot And Avalon A Distributed Transaction

MYERS Myers A Resolving the Integrity Perfor

Facility Morgan Kaufman San Mateo CA

mance Conict Fourth Workshop on Workstation

Operating Systems pp Octob er

GRAY Gray J Lorie R Putzolu F and Traiger

I Granularity of lo cks and degrees of consistency

NELS Nelson M Welch B and Ousterhout J

in a large shared data base Modeling in Data Base

Caching in the Sprite ACM

Management Systems Elsevier North Holland New

Transactions on Computer Systems v n

York pp

February

ORA Administrators Guide Oracle

GUIL Guillemont M Lipkis J Orr D Rozier

Corp oration V April

M A SecondGeneration MicroKernel Based

UNIX Lessons in Performance and Compatibility

PATT Patterson R G Gibson M Satyanarayanan

Proceedings of the Winter Usenix Technical

A Status Rep ort on Transparent Informed

Conference January

Prefetching Op erating Systems Review v n

pp April

HASK Haskin R Y Malachi W Sawdon and

G Chan Recovery Management in Quicksilver

PRES Presotto D R Pike H Trickey and K

ACM Transactions on Computer Systems v n

Thompson Plan a Distributed System Proceed

pp February

ings of the Spring EurOpen Conference May

HART Harty K Cheriton D Application

RABIN Rabin M Fingerprinting by Random Poly Controlled Physical Memory using External Page

nomials Harvard University Center for Research in Cache Management Proceedings of ASPLOS V

Computing TR Boston MA pp Octob er

RASH Rashid R A Tevanian M Young D Golub

R Baron D Black W Bolosky and J Chew

MachineIndependent Virtual Memory Manage

ment for Page Unipro cessor and Multipro cessor Ar

chitectures IEEE Transactions on v

n August

ROSE Rosenblum M Ousterhout J K The De

sign and Implementation of a LogStructured File

System Transactions on Computer Systems v

n pp February

SELT Seltzer M Olson M LIBTP Portable

Mo dular Transactions for UNIX Proceedings

Winter Usenix Conference San Francisco CA pp

January

SELT Seltzer M K Bostic M K McKusic and

C Staelin An Implementation of a Logstructured

Filesystem for UNIX Proceedings of the Winter

Usenix Conference January

SKAR Skarra A Using OS Lo cking Services to Im

plement a DBMS An Exp erience Rep ort Pro ceed

ings of the Summer USENIX Conference Boston

MA pp June

SOLA Solaris Users Guide

STON Stonebraker M Op eraing System Supp ort

for Database Management of the

ACM v n pp July

STON Stonebraker M and L Rowe The POST

GRES Papers Memorandum No UCBERL

M Electronics Research Lab oratory Univer

sity of California Berkeley June

SULL Sullivan M and R Chillarege Software De

fects and Their Impact on System Availability A

Study of Field Failures in Op erating Systems Di

gest st International Symposium on Fault Tolerant

Computing June

SYB Sybase Administration Guide Sybase Corp ora

tion Rev May

WAHBE Wahbe R Lucco S Anderson T and

Graham S Ecient SoftwareBased Fault Isola

tion Proceedings of the th SOSP Asheville NC

December

WHITE White S DeWitt D QuickStore A High

Performance Mapp ed Ob ject Store Proceedings of

SIGMOD Minneap olis MN May