SEPTEMBER1986 VOL.9NO.3

a quarLerlybulletin of the IEEEcomputersociety technicalcommittee Database Engineering

CONTENTS

Chairman’sMessage 1 S. Jajodia Changes to the EditorialStaff of DatabaseEngineering 2 W. Kim

Letterfrom the Editor 3 H. Boral

An OperatingSystem for a DatabaseMachine 5 C. Nyberg TheJASMINKernel as a DataManagerBase 9 W. K. Wlkinson, M-YLai

Supporting a DatabaseSystem on Symbolics LispMachines 17 H-i Chou, J. Garza N. Ba/Iou The CamelotProject 23 A. Spector, J. Bloch, 0. Daniels, R. Draves, D. Duchamp, J. Eppinger, S. Menees, 0. Thompson Getting the OperatingSystem Out of the Way 35 J. Moss OperatingSystemSupport for DataManagement 43 M. Stonebraker, A. Kumar Calls for Papers 52

SPECIALISSUE ON OPERATINGSYSTEMSSUPPORTFOR DATAMANAGEMENT Editor-in-Chief,DatabaseEngineering Chairperson, IC Dr. Won Kim Dr. SushilJajodia MCC NavalResearch Lab. 3500WestBalconesCenterDrive Washington, D.C. 20375-5000 Austin, TX 78759 (202)767-3596 (512)338-3439 ViceChairperson, TC Associate Database Editors, Engineering Prof. ArthurKeller Dr. HaranBoral Dept. of ComputerSciences MCC University of Texas 3500WestBalconesCenterDrive Austin, TX 78712-1188 Austin, TX 78759 (512)471-7316 (512)338-3469 Prof. MichaelCarey Treasurer, TC ComputerSciencesDepartment Prof. LeszekLilien University of Wisconsin Dept. of ElectricalEngineering Madison, WI 53706 and ComputerScience (608)262-2252 University of Illinois Dr. C. Mohan Chicago, IL 60680 IBM AlmadenResearchCenter (312)996-0827 650 HarryRoad San Jose, CA 95 120-6099 Secretary, TC (408)927-1733 Dr. Richard L. Shuey Dr. SunilSarin 2338Rosendale Rd. ComputerCorporation of America Schenectady, NY 12309 4 CambridgeCenter (518)374-5684 Cambridge, MA 02142 (617)492-8860 Prof.YannisVassiliou GraduateSchool of BusinessAdministration NewYorkUniversity 90 TrinityPlace NewYork, NY (212)598-7536

DatabaseEngineeringBulletin is a quarterlypublication of Memoership in the DatabaseEngineeringTechnicalCom the IEEEComputerSocietyTechnicalCommittee on Database mittee is open to individualswhodemonstratewillingness to Engineering. Its scope of interestincludes: data structures activelyparticipate in the variousactivities of the TC. A and models,accessstrategies,accesscontroltechniques, member of the IEEEComputerSocietymayjoin the TC as a databasearchitecture,databasemachines,intelligentfront full member. A non-member of the ComputerSociety may ends,massstorage for very largedatabases,distributed join as a participatingmember, with approvalfrom at least databasesystems and techniques,databasesoftwaredesign one officer of the TC. Both full members and participating and implementation,databaseutilities,databasesecurity members of the TC are entitled to receive the quarterly and relatedareas. bulletin of the TC free of charge, until furthernotice.

Contribution to the Bulletin is herebysolicited.Newsitems, letters,technicalpapers, bookreviews,meetingpreviews. summaries,casestudies,etc.,should be sent to the Editor. All letters to the Editor will be considered for publication unlessaccompanied by a request to the contrary.Technical papers are unrefereed.

Opinionsexpressed in contributions are those of the indi vidualauthorratherthan the officialposition of the TC on DatabaseEngineering, the IEEEComputerSociety, or orga nizationswithwhich the author may be affiliated. Dear TC DBEMembers and Correspondents:

It is a great honor and a challenge for me to serve as the chairperson of the Technical Committee on DatabaseEngineering. It is a great honor to follow Professor Gio Wiederhold who contributedsignificantlytowards the growth and prestige of our TC during the past few years. On behalf of our TC, I wish to express our sincereappreciation for his valuableservices. The chairmanship of the TC is a challengesince we must take rapidaction if we are to keep the Bulletingoing in the presence of a severefinancialcrisisfacing the IEEEComputerSociety.

Since these problemsrequire our immediateattention, I have takenseveralsteps on which I need to report to you. Following Gio’s advice, I have appointedProfessorArthur M. Keller vice chairperson,ProfessorLeszekLilientreasurer, and Dr. Richard L. Shueysecretary to help me tackle these problems. I am also pleased to announce the appointment of Dr. Won Kim as the new editor-in-chief of the Bulletin. We are grateful to the outgoing editor, Dr. David Reiner, who has served in this position for severalyears and wished to step down.

Because of the recent cuts in the ComputerSocietycontribution to the TechnicalCommit tees. it has becomenecessary for us to find new ways to generatefunds to meet the publication cost of the Bt’!!etin. As Gio pointed out in his letter in the last issue of the BuIl~tht, it costs the ComputerSocietyabout$18,000 to publish the Bulletin,requiring close to 40% of Computer SocietyTechnicalActivitiesBoard’sreducedbudget,whereas a fair and balancedformulabased on a fixed amount per TC plus a fixed amount per TC member who is also a ComputerSociety member would net us about $1,900. In light of this imbalance, we are consideringseveral differentoptions to set up a viable fundingmechanism, but we have agreed to take only two immediateactions.

Beginning with this issue, we are instituting a voluntary page charge for the manuscripts accepted for publication in the Bulletin. The author’scompany or institution will be requested to pay a page charge of $50 per printed page, which it may pay in full or in part. I wish to reemphasize that payment of thesecharges will be strictlyvoluntary and will not in any way be a prerequisite for publication. On the other hand, if there are funds in an author’s grant or organization that have been set aside for this purpose, their use for this is one concrete way to showsupport for the Bulletin.

Moreover, we shall starting with the next issue charge $100 per page for any conference announcements that are included in the Bulletin. In the past, severalconferenceorganizers havevolunteered to pay, but there was no mechanism in place to receive the money.

We are consideringotheroptions too, including a subscription fee for the recipients of the Bulletin. Please feel free to write to me (arpanet:jajodia@?nrl-css uucp: decvax!nrl-css!jajodia) about how you feel about our efforts to dealwith the issuesfacing us.

In addition to all the individualsmentionedabove, I want to acknowledge the invaluable help of Professor C. . Ramamoorthy and Professor Ben Wah in working out the ideas expressed in this letter. Also, we are fortuiiate to have Mr. JohnMusa as the Vice President for TechnicalActivities. I appreciatevery much his support of our efforts.

There have been many complaints in the past about the need to update the TC member for this ship list. I am pleased to tell you that the ComputerSociety has recentlyhiredsomeone the purpose and new updates are proceedingrapidly. If you wish to have your name added to You obtain this form list, you can do so by filling out a TC ApplicationForm. can by writing

to either me or the ComputerSociety.

Finally, I would like to mention the ThirdDataEngineeringConference,which will be held February 2-6, 1986 in Los Angeles. I along with several other ProgramCommitteemembers attended the AugustProgramCommitteemeeting in Chicago. There were over 200 submissions, make decisions. The end too many of high quality,forcing the ProgramCommittee to tough all of will attend. result, I think, will be an excellentconference, and I do hope that you

SushilJajodia September 15, 1986

—1— Changes to the Editorial Staff of DatabaseEngineering

It is a pleasure to return as Editor—in—Chief of DatabaseEngineering. I would like to thankoutgoing

ChiefEditor, DaveReiner, for the great job he has doneduring the past two years. I alsowould like to thank Fred Lochovsky for his services as an AssociateEditorduring the past three years. I have invited Mike Carey and Sunil Sarin to serve as new AssociateEditors. Haran Boral, C. Mohan, and

YannisVassiliou will stay on as AssociateEditors. I look forward to working with theseoutstanding colleagues on the editorial staff, as well as the new officers of the TC.

Haran Boral has put together the presentissue of DatabaseEngineering. The next issue is being put together by YannisVassiliou on the EuropeanESPRITproject. These two issues will complete the 1986 edition of DatabaseEngineering. The uncharacteristicdelay in the publication of these two issues is the result of uncertaintythat existedduring 1986about the financialsituation of the DatabaseEngineer ing TO. It was not clear until late 1986 whether the TO would be able to continuepublishing the

DatabaseEngineeringbulletin. SushilJajodia, the new chairperson for the TO, deservesmuch of the credit for making it possible for the IC to publishthese two issues.

I am presentlyorganizing an issue for March 1987 on integratedsoftwareengineeringsystemsand/or databaserequirements for such systems. Mike Carey will edit the June issue on extensibledatabase systems. SunilSarin will followwith an issue on federateddatabasesystems in September. 0. Mohan is tentativelyscheduled to do the Decemberissue on bridgingdatabasetheory and practice.

Won Kim

Editor—in—Chief January, 1987

—2— Letterfrom the Editor

This issue of DatabaseEngineering is titled “OperatingSystemsSupport for DataManage ment”. Although not widelydiscussed in the literature, the impact of the operatingsystem on the DBMSperformance is substantial:usually in a negative way as has been originally discussed by Gray in his Notes on DatabaseOperatingSystems ‘and Stonebraker in his July ‘81 CACMpaper and demonstrated by Hagmann and Ferrari in their March ‘86 TODS paper.

For this issue I askedseveralpeople to contributepapersdescribingexistingimplementa tions/experience as well as current work. We are fortunate to have six papers: three describingrunningimplementations, two describingongoingresearchprojects, and the last, examining the relationshipbetween the DBMSneeds and the servicesprovided by currentresearchoperatingsystems.

The first paper, “An OperatingSystem for a DatabaseMachine” by ChrisNyberg of Britton Lee describes the 1DM kernel. The paperdiscusses the variousdecisionsundertaken in order to achieve high performance. It is an excellentexample of specialization. It is also the only paper in the issue describing a commercialsystem.

KevinWilkinson and Ming—Yee Lai of Beilcorecontributed “TheJASMINKernel as a Data ManagerBase”. JASMIN is an operatingsystem kernel operational at Belicore. After describing the JASMINDBMS and the kernelKevin and Ming evaluate the kernel both from an implementation (i.e., were all the servicesneeded there; could they be used easily) and performance (which of the featureshinderedoverall DBMS performance) points of view.

“Supporting a DatabaseSystem on Symbolics Lisp Machines” by Hong—TaiChou,Jorge F. Garza, and Nat Bauou of MCC describes the issuesencountered in implementing a DBMS in Lisp on the Symbolicsmachine. The main issuesaddressed in the paper are storage of Lisp objects on secondarystorage and transforming Lisp objectsbetween disk and in—memoryrepresentations.

“TheCamelotProject” by AlfredSpector and his students of CMU describes a distributed transactionsystem to supporttransactionsagainst a widevariety of objecttypes,including databases. The projectemphasizesgenerality as well as performance. After a descrip tion of the Camelotfunctions its implementation is outlined with emphasis on its relationship to , the operatingsystemkernelCamelot runs on.

Eliot Moss of U.Mass.authored the next paper,“Getting the OperatingSystem Out of the Way”. Eliot proposesDBMS’s to be built in a new programminglanguage on top of a minimalkernel. The paperoutlinessome of the desirablelanguagefeaturesneeded. It also discusses the pros and cons of the approach.

-3- The last paper‘OperatingSystemSupport for DataManagement” by MichaelStonebraker and Akhif Kumar of UC—Berkeleyexamines many of the features of “next generation” operatingsystems from the “databaseguys” point of view. The paperargues that fea turessuch as transactionmanagement and locationindependent files are harmfulwhereas remoteprocedure calls and light weightprocesses are essential.

I hope the DBEreaders find this issue as enlightening as I have and join me in thanking the authors for their excellentcontributions.

Haran Boral January 1987

-4- An OperatingSystem for a DatabaseMachine

Chri8 Nyberg Br~tton Lee Inc. Berkeley, CA

Introduction

The IntelligentDatabaseMachine (1DM) is a backenddatabasemachinemanufactured by Britton Lee Inc. Ubell 85, Epstein 80]. The 1DM has been observed to have “surprisinglygood” performance for multi-userbenchmarksBoral 86, Boral 84]. This paperdescribes the characteris tics of the 11DMthat allow it to achieve its high level of multi-userperformance.

Residentsoftware on the 1DM consists of a kernelsupportingmultipleprocesses. Each 1DM processimplements a connection with a host process,executingtransactions(multi-statement queries or updates) on the host process’behalf. The specialpurposenature of the systemallows the kernel and process code to be highlyintegrated. The kernel does not need to supporthetero geneous or potentiallyhostileprocesses. This integrationresults in efficient and coordinated allo cation of resources by both the kernel and processes. Resources such as the cpu and process memory are more effectivelyallocated than in a generalpurposeoperatingsystem. The integra tion also allows the kernel to provide betterdatabasesupport than a generaloperatingsystem. Often this supportconsists of simplyletting the processesmanage and sharetheir own resources.

The kernelprovides a variety of services to the 1DMprocesses. The focus of this paper is on the parts of the kernel that take advantage of the specialpurposenature of the machine and on the integration of kernel and process code to provideeffectivedatabasemanagementsupport. Features of the kernel such as queuebasedprocessmanagement, host communicationsupport and tape i/o will not be discussed.

Section 1 explains how the kernelallowsprocesses to accessshared data and describes the allocation method for processmemory pages. In section 2 the virtues of the nonpreemptive schedulingpolicy are given. Section 3 describes how disk managementresponsibilities are split between the kernel and the processes. Supportprovided by the kernel for concurrencycontrol is presented in section 4.

1. MemoryManagement

The 1DM is based on a 16-bit microprocessor with four virtual 64K addressspaces,kernel program and data, and processprogram and data. A memorymanagement unit maps the 4 64K addressspaces to main memory (1 to 6 megabytes in size). Each virtualaddressspace is divided into 32 2K pages. Pages for process data space are divided into per-process data (private data, stack and heap) and data shared by all processes. The kernel dataspacecontains dataprivate to the kernel and the datashared by all processes. The programimages for both kernel and process are permanentlyresident in main memory, as are the kernel data and sharedprocessdata.

The kernelallocates main memory for the per-process part of each process’ data image. A simple,efficientmemorymanagementscheme is used that does not involvepaging or swapping virtualimages to disk. Movingprocess data to and from disk was considered too expensive. In additionpaging is incompatible with the nonpreemptivescheduling policy of the 1DM (see next section)since a page fault is a preemption.

1DM processesspend most of their time waiting for a query from the host. During this time a process’ data image can be taken away and later restored. The kernel uses these facts as the basis of its memorymanagementalgorithm.

To begin a transaction, a process must have data memoryallocated to it. The process then retains its data memory until the end of the transaction. When the processcompletes its transac tion and waits for anotherquery from the host, it is eligible to have its data memorytaken away

-5- by the kernel. This is done only if there is anotherprocess in need of the memory.Memory-less processes are alwayswaiting for input from the host and are not inside a transaction. There is nothingspecialabout the state of theirdataimages that cannot be easilyrecreated by the kernel.

A processwithoutmemory is simply a process table entry that contains the current data base of the process, the networkaddress of its corresponding host process, and input and output queues for communication with the host process. Such a process will be granted data memory when it receives a query from its host process if memory is available. If not, the process must wait on a queue of processesneeding data memory. As memorybecomesavailable, it is allocated to the queuedprocesses on a first-come-first-servebasis.

The limitation of this algorithm is that the number of simultaneousprocesses is limited by the amount of available main memory. By storingvirtual data pages on disk the kernel could supportmore “simultaneous”processes.However if processesholdinglockswereswapped to disk, the system could very easilythrash. The processes not holdinglocks are those not currently exe cuting a transaction. Ratherthan-swappingvirtual data for those processes to disk and back, the 1DMkernelsimplyscuttles it and restores it later.

The limitation on simultaneousprocesses is of practicalconcern only with small memory configurations. With the largest main memoryconfiguration the kernel can allocatememory for approximately 200 processes. This number of simultaneoustransactions is clearlysufficient to find bottleneckselsewhere in the system.

2. No PreemptiveScheduling

Once the kernelallows a process to run it does not preempt the process. The kernelimple ments a “soft”quantum by setting a globalvariable in process dataspace after a process has held the cpu a fixedamount of time. Processesperiodiclycheck this variable at pointswhere it is con venient to relinquish the cpu. If the variable is set a process will call the kernel to deschedule.

Withoutpreemption,processes can performcriticalsection code that wouldnormally need to be executed in a privilegedaddressspace. Althoughkernel and process code are differentiated by separateaddressspaces, the “operatingsystem” of the 1DM extends from the kernel and into process code. Indeed many sections of 1DM code could be executed either in kernel or process space. Code that involvesinterrupts(host, disk and tape i/o) or requiresnumerousmemorymap ping changes, such as lock management (see section 4), runs in the kernel. Othercriticalsection code can be executed in processspacewhere it can be invokedwithoutsuffering the overhead of a system call.

It is possible to allow preemption if processesobtain locks for their criticalsections. How ever this uncoordinatedallocation of resources leads to inefficient use of the cpu. The classic example of this is the convoyphenomenonobserved in System R Blasgen ~91. Once a process is preempted while holding a highly utilized short term lock, a queue of processeswaiting for the lock can quickly form and persist for a long time. There are steps that can be taken to lessen the degradation of preemptivescheduling on criticalsections (such as using finer granularitylocks), however it requirescarefulmonitoring of a preemptivesystem to identifythesedegradations. It is simpler not to allowpreemptivescheduling,therebyallowingprocesses to easilycoordinate use of the cpu with the execution of criticalsections.

In addition to criticalsectionexecution,nonpreemptivescheduling also allowsprocesses to makebetter use of noncriticalsharedresources. For instance a process will not give up the cpu in the middle of referencing a disk cache page. If the process was preemptedduring this time and the disk page was no longer in the cache when the process was rescheduled, the process would have to arrange for the page to be read into the cacheagain. Withoutpreemption 1DMprocesses can deschedulewhen they have completed a short term use of a sharedresource.

3. DiskManagement

Disk managementresponsibilities are split between the processes and the kernel. At the low level the kernelschedules disk i/o at the request of processes and supports the optionalmirroring

-6- of disk data on separate disk drives. Above this the processescontrol a cache of disk pages and implement the file system (or accessmethods).

Disk pages in the 1DM are 2048 bytes in size. A pool of up to 950 main memory pages is used to cache disk pages,knownsimply as “buffers”. For each bufferthere is a bufferstructure in the sharedportion of process dataspace. Bufferstructurescontain the disk address and status of the buffer and allow the buffers to be hashed by disk address,grouped by relation and ordered by replacementpriority. Processes. can directlymanipulate the buffer structures and rely on the nonpreemptiveschedulingpolicy to executecriticalsections.

When a buffer is referenced its replacementpriority is updated by the referencingprocess. The bufferreplacementpolicyimplemented is a variation on LRU (LeastRecentlyUsed) known as “GroupLRU” Nyberg 84]. The buffers are divided into three groups of decreasingpriority: systemrelationpages, user relationindexpages, and user relation data pages. Within each group a LRU policy is used. Buffers can also be made neverreplaceable and, conversely,immediately replaceable.

A status field in the bufferstructureallowsbuffers to be marked as 8cheduled for i/o, dirty, or log pinned. A process that finds a buffer scheduled for i/o usually calls the kernel to descheduleuntil the i/o has completed. A status of log pinned is used in conjunction with dirty to indicate that the changed page can not be written to disk before the transaction log is. Log pin ning is part of the IDM’swriteahead log ~Gray 781 implementation.

The kernel provides system calls that allow processes to initiate disk i/o on buffers. Processes can eitherhave the disk i/o done asynchronously or be descheduled until the i/o is com plete. The asynchronous i/o is used to prefetchpages or, better yet, read or writesequential sec tors.

The kernelsupports a system call to flush out all dirty buffers for a particularrelation. If the relation is the transactionrelation (i.e. the transaction log) the kernel may have to order the writes of the buffers due to dependenciesamong the pages. This is similar to the selectedforce out described by Stonebreaker~Stonebreaker 811. A flush of the transaction log will also cause the kernel to clear the log pinnedstatus in all buffers,allowingprocesses to write the these buffers to disk at will.

Five of the virtualpages in process data space are used to directlyaddressbuffercontents. Processes can specifywhichbuffers are mapped into the pages by using a specificsystem call. No datamovement is necessary for a process to read tuplesfrom the disk page cache.

In a generalpurposeoperatingsystem, data (presumably a page in size) would have to be copied from the operatingsystem’s disk cache to the process’memory. Besides the overhead of the copying, the operatingsystem must allocate a page of main memory to hold the process’ copy of the disk page. If virtualmemory is used the disk data may get paged to disk, therebyunder mining the operatingsystem’scaching of the disk page. For these reasons,implementing the file system in processspace is moreefficient.

4. ConcurrencyControl

Concurrencycontrol in a databasesystemusuallyrefers to synchronizing dataaccessesusing datalocks so that each transaction gets an atomic view of the data. In this section that definition will be extended to include other types of inter-processsynchronizationsupported by the 1DM kernel. The additionalfunctionalityincludes locks for criticalsections that involve disk i/o and checkpointsupport.

Data locks are supported in two forms: relation and page locks. A table for these locks resides in main memory and can be up to lOOK bytes in size. Sincenumerousmemorymapping changes may be needed to address the lock table, the setting and clearing of datalocks is done in the kernel. A process will be denied a lock if a conflicting lock is already held. When this hap pens the requestingprocessbecomesblocked by the processholding the conflictinglock. The ker nel checks for deadlock at this time. If deadlock is detected the kernel picks a deadlockvictim. The process with the least accumulated cpu time is chosen as the victim and is directed by the

-7— kernel to abort its transaction. When a processcompletes a transaction, it calls the kernel to clear all locks it holds in the lock table and to wake up all processesblocked by it.

Some criticalsections of the 1DM process code involve disk i/o. Most of these operations involve the file system:allocating disk pages,creating,destroying,dumping or loading a database. Processes in the 1DM are descheduled when waiting for disk i/o. To preserve each of these file systemcriticalsectionsthere are speciallocks. The processes set and cleartheselocksthemselves in sharedmemory. The kernelallowsprocesses to wait for one of theselocks if the lock is already allocated or wake up otherprocesseswaiting for such a lock when the lock is freed.

Checkpoints Gray 781 on each database are periodiclyperformed by a dedicatedprocess. The checkpointprocedureconsists of writing all dirty bufferpages for the database to disk and writing a checkpoint record to the database’s log. To make this operation atomic the kernel allows the checkpointprocess to determinewhatotherprocesses are activeupdaters in the check- pointeddatabase and to temporarilysuspendtheseprocessesduring the checkpoint.

Conclusions

The 1DM takesadvantage of its specializednature to allocate main memory for processes in a manner that does not result in thrashing. Processes share memory and control their own descheduling,allowing them to efficientlyallocate and share resources. The kernel and processes are integrated to provideefficient disk managementtailored to needs of the databasemanagement system. Datalocks and otherforms of concurrencycontrol are supported by the kernel. The end result of these techniques is a databasesystem that providesremarkablyconstantthroughput in the face of an increasingworkload.

Acknowledgements

While the implementation of the 1DM kernel was done principally by the author, the design of the memorymanagementalgorithm and integration of kernel and process code came from databasemanagementarchitects,notably Bob Epstein,Mike Ubell,PaulaHawthorn and Charles Koester. Manypeoplereviewed this paper and provided usefulsuggestions: Bob Taylor,Charles Koester, ScottHumphrey,MikeUbell and PaulaHawthorn.

References

Blasgen 79) Blasgen, M. et al., “The ConvoyPhenomenon,”OperatingSystemsReview, vol. 13, No. 2, Apr. 1979, pp. 20-25. Boral 84) Boral, H. and D.J. Dewitt, “A Methodology for DatabaseSystemPerfor manceEvaluation,”Proceedings of SIGMOD ‘84, 1984, pp. 176-185. Boral 86) Boral, H., “DesignConsiderations for 1990DatabaseMachines,”COMPCON ‘86 Proceedings, Mar. 1986, pp. 370-373. Epstein 80) Epstein, R. and P. Hawthorn,“DesignDecisions for the IntelligentDatabase Machine,”Proceedings of the 1980NationalComputerConference, 1980, pp. 237-241.

Gray 78) Gray, J., “Notes on Data Base OperatingSystems,”ReportRJ3120, IBM ResearchLab., San Jose, Calif., Oct. 1978.

Nyberg 84) Nyberg, C., “Disk Scheduling and Cache Replacement for a Database Machine,”Master’s Project Report, University of Califorina,Berkeley, August 1984. Stonebreaker 811 Stonebreaker, M., “OperatingSystemSupport for DatabaseManagement,” Communications of the ACM, Vol. 24, No. 7, July 1981, pp. 412-418. UbelI 85) Ubell, M., “The IntelligentDatabaseMachine(1DM),” QueryProcessing in DatabaseSystems, ed. by Won Kim et a!., Springer-Verlag, 1985, pp. 237- 247.

-8- The JASMINKernel as a DataManagerBase

W. Kevin Wilkinson Ming—Yee Lai

Bell CommunicationsResearch 435 South Street Morristown, New Jersey07960

ABSTRACT

The JASMINkernel is an experimental,capability—basedoperatingsystemkernel that provides a core set of facilities on which to builddistributedapplications. The JASMINdatabasesystem was designed as a researchproject in distributedprocessing and databasemanagement but also serves as a test of the functionalityprovided by the JASMINkernel. This paperprovides an overview of the JASMINdatabasesystem and the kernel and reports on our experience in using the kernel. We discuss the merits and demerits of the variouskernelfacilities in building the databasemanagement system. We claim that a general—purpose,minimalkernel is an excellent base for a dedicated application such as a databasemanagementsystem.

1. Introduction

The JASMINkernel is a capability—basedoperatingsystem that provides a minimal set of facilities for distributedapplicationsLEE84]. It offersthreetypes of services:tasking,memorymanagement and inter—processcommunication. It includes no database—specific features and even device drivers are not included. A primary goal of the JASMI\database system is to support high

throughputtransactionprocessing for a wide range of applications, It is designed as a set of cooperatingsoftwaremodules. Thesemodulescommunicate using the kernelfacilitieswhich hide processorboundaries. Thus,modules can be easilymoved to multipleprocessors and modules may be replicated to achievehigherperformance.

The next two sectionsprovide an overview of the JASMINdatabase system and the JASMIN kernel. Following that, we discussimplementing the databasesystem on the kernel and describe

the utility of variouskernelfeatures and what we foundlacking. In a final section, we discuss the impact of the kerneldesign on systemperformance.

2. The JASMINDatabaseSystem

The JASMINdatabase system F1S84],LA1841) was designed to providehigh—performance

transactionprocessing over a wide range of applications. The approachtaken was to functionally decompose the databasemanager into moduleswhichcommunicate using a processor—transparent IPC mechanism. The modules can then be replicated and distributedacrossmultipleprocessors for improvedperformance. Our decompositionconsists of 3 modulesrepresentingincreasinglevels of abstraction: the Intelligent Store (IS) provides a page interface and transactionmanagement, the RecordManager (RM) provides a recordinterface and accesspaths, and the Data Manager (D\4) provides a relationalmodelinterlace. Each module relies on the facilitiesprovided by the lower level. In addition, each module’sinterface is user—accessible.Applicationsneeding only page

-9- access can access an IS directly and avoid the overhead of higher—levelservices. Finally, the system is easily reconfigured since the moduleinterfaces are processortransparent.

2.1 Intelligent Store (ES)

The Intelligent Store R00821) offers a page interface to applications. It models the disk as a collection of LogicalVolumes where each LV has some number of pages. An LV is simply a number of adjacent disk cylinders.However, an LV differs from a disk extent in that LV page numbers are logical, rather than physical. Thus, page i may be severalcylinders away from page i+1*. Further, the page size may vary across LV’s allowing an LV containing an index to have larger pages, for example. The IS is transactionoriented and provides an optimistic,versioning concurrencycontrolmechanism. Each transaction is guaranteed the view of the database (LV

collection) as of the transaction’s starttime as if a snapshot had beentaken. Conflictingupdates to the same page are resolved by giving each transaction its own copy of the page. Note, this is how logicallyadjacentpagesbecomephysicallydistant. A transactioncommits so long as its view of the database is still current at its commit time, i.e. during the lifetime of the transaction, no other transactioncommited an update to a page accessed by the transaction. Note that, as in other optimisticschemes,transactions neverblock. However, our scheme has the additionaladvantage that read—onlytransactionsalwayssucceed since they always sees a consistent view of the database provided by the versionirigscheme of the IS.

The IS provides a prioritycachingscheme for its buffers. This allows a databasedesigner to specify higherpriority for a particular LV (e.g. an LV containing index pages vs. one containing data

pages). Also, the ES maintains a set of versioncounters that are used by applications to detect cache updates. For example, an RM will request the current value of a versioncounter after readingmeta—datat into a local cache.Transactions whichupdate meta—data will increment the correspondingversioncounter. The RM can detect an out—of—datecache by periodicallychecking

the value of the versioncounter. Currently, we are working on an implementation to support distributedtransactions.

2.2 RecordManager (RM)

The RecordManagerLIN82j) maps records into IS pages. Recordsconsist of any number of field—value pairs. Variable—length fields, missing values and repeating fields are all supported. Recordtypes are defined by the user. Eachrecord type has a primary(clustering) index and may have multiplesecondaryindices. The user must also specify the LV to contain the indexpages and the LV for the data pages. Thereafter, the RM handles the details of mappingrecords into LV pages.

An RM requestprovidesassociativeaccess to one record type at a time. Currently,B—trees are the accessmethod used. Prefix and exact matching are supported as well as the standardcomparison operations. The RM is transaction—oriented and a transaction may accessmultiplerecord t~pes during its lifetime. The RM depends on the concurrencycontrolfacilities of the IS for database consistency.

2.3 Data Manager (DM)

The Data Manager (DM) provides a full relationalinterface to the user. It has a QUELinterface ST076J) and mapsrelations into RM records. In addition to the functions of queryoptimization

* Some control over placement is possible using “cells” which are not described here. t By meta—data, we mean the storedschema that describes the databasestructure.

-10- and planning, the DM uses the RM to process one or more single—relationqueries. Complex queries are processed in a pipelinefashion with intermediateresults from separatesingle—relation RM queries being fed, concurrently, to a network of DM tasks that communicate via shared memory. This avoids the need to create temporaryrelations.

3. JASMINKernel

The JASMINkernel is a sparseoperatingsystem with little of the overhead (and few of the fea tures) found in conventionaloperatingsystems. The motivation was to give the designer of a distributedapplicationmuchcontrol over how resources are allocated and scheduled. It basically providesthree types of services:tasking,memorymanagement and inter—processcommunication.

These are discussed in more detail below.

3.1 Tasking

A JASMINtask represents a singlethread of control. It consists of a stack, a priority,private* data space and capabilities(describedbelow). Tasks are created and destroyeddynamically. Task scheduling is by priority and is non—preemptivewithin a priorityclass. A task releases the processor only if it waits for a message or a systemresource or if a higherpriority task becomesrunnable.

A JASMIN task executes in the address space of a module. A moduleconsists of one or more associated tasks that share text (code) space and data space. A modulefacilitatesconcurrent activitywithin a shared data area, for example, the pipelined tasks within the DM. Modules are indivisible, i.e. all tasks in a module run on the same processor. However,severalmodules max’ share a processor. A task cannot tell if a task in a differentmodule is on the same processor. So, programscannotcontainprocessordependencies. By simplychanging the assignment of modules to processors, the system is easily reconfigured with no change to any programs.

Note that there is no kernelfacility to create or destroy a module. Such a facilitywould imply the existence of a disk server from which to load the module. And since the kernel itselfcontains no disk service,modulecreation must be a higher levelservice built on top of the kernel. A system is booted by downloading a large core imagecontaining one or more modules and the kernel.

3.2 MemoryManagement

The JASMINkernelprovides a separateaddressspace per task but does not do virtualmemory management. Thus, the stack and privatespace of a task are hardware—protected from othertasks both within and without the same module. Note that since the kernelprovides no disk service, it would be hard—pressed to do paging.

As mentionedabove, a moduleconsists of text space and data space shared by all tasks in the module. In addition, a module may have a pre—definedsection of private data space that is allo cated to each task. For example,variablescontaining the task identifier and task creation time might be in this staticprivate space since these values are unique for each task. At task creation

time, a task inherits the text space and shared data space of its parent. Duringexecution, a task may create and destroy data segmentsdynamically. Suchsegments are normallyprivateexcept that any child task createdthereafter is also givenaccess to the segment. However, there is no facility for true dynamicsharedsegments (e.g. betweensiblings).

* Like the stack, private data space is hardware—protected and is inaccessible to othertasks.

—11— 3.3 Communication

The JASMINkernelprovides a singlemessagemechanism to he used for both synchronization and communication.Thesemessages are small and fixed—length (16 bytes) and are sent alongone—way communicationchannels, called paths. Our paths are similar to links in Roscoe S0L791 and DEMOSBAS77I. Paths are actuallycommunicationcapabilitiesmanaged by the kernel. A task may allocate a path and pass it to another task in a message. This grants the receiving task the capability to send a message to the creator. The rights to duplicate or re—use a path are set when the path is created and may be furtherrestrictedwhenever the path is duplicated.

Additionally, a path is created with a class and tag. There are 7 messageclasseswhichrepresent separatereceivequeues. They can be used for selectivereceive and to prioritizemessages. A task can elect to wait for a message on any subset of the classes. When a message is delivered, the class and tag (of the path on which the message was sent) are returned to the receiving task. The tag is used to disambiguatemessages for paths that use the same class.

The initial rendezvousbetween tasks is usuallyaccomplished via the Name Manager task. A new task is typicallycreated with a path to the Name Manager. Any task wishing to advertiseservices will register a path and a name with the Name Manager. Tasks requestingservice can then get a copy of the server path from the Name Manager.

Large data transfersbetween tasks are done by attaching a buffer to a path. A separatekernel facility (iomove) is used to transfer data between the path buffer and a local buffer.

3.4 Services

The JASMINkernelprovides a few additionalfunctions as aids to applicationdevelopers.Among these are the local time—of—dayclock, cpu—usageclock, primitives to print a string on the system console and to read a string from the consolekeyboard. There is also a routine to print status informationabout all tasks in the system.

As mentionedearlier, devicedrivers are not included in the kernel. Instead,drivers exist as ser vices whichadvertise in the Name Manager. For example.requests to read and write disk pages are sent as messages to the disk server. The messageincludes a path with an attachedbuffer. The disk server then uses the client’sbuffer for the disk 1/0 operation. Note that some devicedrivers requirelow—level kernel functionsunavailable to normal tasks. For example, the kernel must translatedeviceinterrupts into messages to the appropriateservers. So, a devicedrivermust tell the kernel its interrupt level and give it a path on which to send an interruptmessage. Also, some devicedrivers need to directlyaccess path buffers,bypassingiomove (e.g. for DMAoperations).

Thus, there is a routine that returns the physicaladdress for a path buffer.

4. ImplementationExperience

In this section, we discuss our experience in using the JASMIN kernel to build the JASMIN databasesystem. We adopt the categorization used in the previoussection and, so, discuss the tasking,memorymanagement and communicationfacilities of the kernelseparately.

4. 1 Tasking

The taskingfacilities of the JASMINkernel were quite satisfactory for our databaseapplication. The ability to multiprogram by usingcooperatingtasks in a module was exploited by all 3 modules.

—12— The ES and DM modulesused a fixednumber of tasks while the number of RM tasks was load—de pendent. In the DM and IS, a front—end task would receive all requests and then forward the request to a free worker task for processing. B~’ contrast, in the RM the front—end task only receivedstart—transaction requests. For each new transaction, it wouldcreate a separate task to process all subsequentrequests for the transaction. Thus, the kernelprovided the flexibility for both approaches*. Another good aspect of the kerneltaskingfacilities was that tasks (within a module)differed only in theirstack and privatespace. This provides an opportunity for the kernel to optimizeintra—module task switch time, since the memorymanagementregisters for text and shared data do not change. If only one module runs per processor, task switch time can be sig nificantlyreduced. However, the kerneldefinition does not describe this optimization(although some implementations have used it).

This brings up a minorcomplaint. The modulenotion is not formalized in the kerneldefinition. Formalizing a modulemightinvolveaddingintra—modulefacilities (e.g. dynamicsharedmemory) and changing the semantics of existingfacilities. These are things that could be added, of course. The importantpoint is that a module is more than a collection of tasks as implied by the current definition. A moduleinvolvessharedfacilities,cooperatingtasks and someoverallmodulecontrol in the form of a coordinator or scheduler. Futuredesignsshouldrecognize this and incorporate appropriatefeatures.

The kernelprovidednon—preemptiveschedulingwhich we feel is the correctchoice for database managers. Besides the obviousadvantage of preventinginterference over the disk arm, it allowed us to writecriticalsections with no specialsynchronizationmechanism. For example, the IS uses a single task to process all commitrequests. This ensures no concurrentactivity and facilitates making the commitatomic. However, in spite of the non—preemptivescheduling, it would have

been useful to be able set a limit on processorusage per task to facilitatedebugging(detection of infiniteloops, etc.). There is a facility to send a messageaftersomespecifieddelay. But this is not sufficient since it requires that the receiving task have some control over the sending task (e.g. to

interrupt or destroy it). There is no way to do this in AS\4N. This is anotherargument for addingmodulefeatures to the kernelinterface.

4.2 MemoryManagement

The kernelprovides a separateaddress space per task but does not do virtualmemorymanage

ment. This is a good compromise for an experimentaldatabasemanagementsystem. Separate addressspaces is goodsoftwareengineering,especiallywhenseveralindependentprogrammers are

involved in a project. Goodperformanceargues for no virtualmemorymanagement since unex

pectedpaging can causeproblems for queryscheduling and execution. The notion of privatespace proved very useful. For example, the RM uses privatespace to store metadata and privatebuffers for transactions. The IS, which does not have a dedicated task per transactioncannot take advan tage of this scheme since multiplerequests from a transaction could be processed by different workers. Thus, the IS needs a morecomplexbufferingschemewhich is sharedamong the workers.

And a transaction’s data is not hardwareprotected as it is in the RM.

The ability to share data amongtasks in a moduleproved very useful in managing the shared data

buffers for the database. This was necessary in the IS, since,severalworker tasksshare a common data buffer. Buffermanagement is under the full control of the commit and workertasks.

$ The modules were written by separateprogrammers; thus the different task structures.

-13- As mentionedearlier, a featuremissing from the kernel is a primitive to create dynamicshared space. Such a facilitycouldpermitmoreefficient use of memory.Currently,users mustpreallocate all sharedtables at link time so that they exist in staticsharedspace. Thus, there is alwaysenough space for the worst case load. But, this reduces the amount of spaceavailable for the buffercache, etc.

4.3 Communication

The communicationfacilities of the JAS\’IINkernel are the most controversial. Some of the best aspects are the logicalmessageaddressingthrough the use of capabilities and the ability to perform a selectivereceive by specifying a subset of messageclasses. Logicaladdressingfacilitates easy reconfiguration of the system by eliminating any processordependencies in the programs. A client need not know the physicallocation of a server to obtain a connection. The ability to selectively For receive a message from any one of multiplemessagequeues was very useful. example, the front—end task in an RM module may receivemessages fromeither RM clients or RM workertasks. Undernormal load, it is willing to accept a message from eithersource. But, underconditions of heavy load, it must ignorerequests from clients to start new transactions. The selectivereceive makes this simple. Some othersystemsdecouple the selection from the receive (such as the select primitive in BSD 4.2) and, so, require an additionalsystem call. Another nice feature of the JAS MIN communicationfacilities is that it uses a send—receiveprotocol as opposed to a call/return protocol* (which is used in the V—Kernel ICHE83I). This makes it easier to implement asynchronousprocessing as is required for two—phasecommit.

One majordrawback of the JASMI\ IPC design is that communicationcapabilities are mari~—o—

one, i.e. there can be multiplesenders on a path but only one task can receive on the path. A

consequence of this is that multi—castmessages are impossible to send. But, more importantly,

whencalling a server, it adds the overhead of an extramessage to process a request. Recall that the

IS and DM task structures have a front—end task and multipleworkertasks. Requests must Iirst be

directed to the front—end which forwards them to a free worker.

A bettersolutionwould be to have the workertasksshare a requestqueue and just remote the next requestthemselves. But since JASMIN has no sharedreceivequeues(capabilities are owned by RM only one task) the workertaskscannot do this. As an aside, note that the task structure does not have this problem. The RM has a dedicated task per transaction so requests go directly to the worker. On the otherhand, it has the seriouslimitation that the RM can only process one request

per transactionat—a—time.

Anothernuisance of the IPC design is that only one capability path can be passed per message.

Normally, this is sufficient. But there are situationswhere two paths are needed. Any request that needs two separatebuffersrequires an additionalmessageexchange to send the capability for the secondbuffer. For example, the RM needs to read both the metadata and cacheversionnumbers

JASMI\ does a in one request and these are stored in differentbuffers. Actually, provide way to To the is package a collection of paths into one capabilitycalled a path—list. our knowledge, facility for this situation due the uniqueamongdistributedkernels. However, it is not a good solution to overhead of packing and unpacking the path—list. This situationoccursfrequentlyenough that it meritsspecialattention.

* Actually, the mcall and iomcallprimitivesprovide a call/returnfacility.

-14- A final drawback is that there is no low—costsynchronizationfacility (such as semaphores or locks).

Thus,intra—modulesynchronization must be done with messageswhichincurs moreoverhead than necessary.

5. PerfonnanceImplications

Probably the largest impact on performance was the fact that all communication was done via capabilities, i.e. securecommunicationschannels. A seeminglyinnocentactivity such as passing a capability in a message could result in a flurry of activity by the kernel,especially if the sending task, receiving task and the capabilityowner are all on separatemachines. More specifically, the overhead is not so much due to capabilitymaintenance but to the fact that capabilitiesrequire a connection—basedcommunicationsprotocol. Passing,duplicating and destroying paths requires creating a channel to the path owner to maintainreferencecounts,sequencenumbers, etc.. And, this (at times, significant)overhead is hidden from the kernel user since one cannot tell who owns the path. A cleverimplementation can piggy—back the bookkeepinginformation onto other mes sages. But this only works in certainsituations.Connections work well when the amount of com municationexceeds the overhead of setting up and destroying the channel. In our application, much of the communicationbetweenmodules is one—shot so the overhead of maintaining the connections was often wasted.

And, it is not clear that much was gained by the use of capabilities. They are elegant and in some cases simplified the programming. But in a dedicatedapplication with ‘trusted’ code such as a databasemanagementsystem, the protection they offer seems not worth the cost.

Anotherperformancedrain was the decision to implementdevicedrivers as servers. Although this simplified the implementationsomewhat and kept the kernelsimple it was at great cost. For ex ample, consider the TTY driver which receives an interrupt per character. Since processing the interruptinvolves at least two messages, this adds an overhead of severalmilliseconds to the inter rupt processingtime. Similarresults hold for the disk driver. In fact, for clients of the disk server, we found the disk latency to be approximately 3/4’s of a disk revolution for readingconsecutive disk sectors.

Originally, it was hoped and expected that messagepassingwould be fast enough that this overhead would not be a problem. Unfortunately, that was not the case. In our prototypesystem,perfor mance was not the highestpriority and futureimplementations will do better. Although, it is an open issue it connection—basedprotocols can ever be competitive with connection—lessprotocols for this type of application.Similarly, it was hoped that task creation time would he fast. Thus, the RM designer was willing to accept the overhead of creating a new task for each transaction in order to simplify the programmingchore. In retrospect, this was not a good choice in our environment since transactions tend to be short.

It is worth commenting on the separatefacilities for data transfer(iomove) and communication (sendmsg) in JASMIN. An alternativedesignwould have been a singlecommunicationmechanism that supportedvariable—lengthmessages. Clearly, the iomovefacilitysimplifies the kernel since it never has to buffer large amounts of data and all messages are fixed—length. It is claimed to be a performancewinner, as well. With iomove, a task never gets data until it is ready. This saves memory on the receiver side and facilitates DMAbetweensender and receiver if the hardware is capable. It is not entirely clear if it really is a performanceadvantage,however. lomoveessentially requires two synchronizationpoints to transfer data; one message from sender to receiver to indi cate data is available and a second from receiver to senderindicating the transfer is complete.

—15- Usingvariable—lengthmessages, there is just one synchronizationpoint. Thus, the tradeoff is be tween one extra message and memory space (and a more complicatedkernel).

In summary, the kernel provided an excellentdevelopmentenvironment. It facilitates fast prototyping and the tasking and memorymanagement were just what was needed. The com municationfacilities are simple and elegant, but, perhaps are too general for high—performance applications.

6. Acknowledgements

Thanks to Haran Boral and PremkumarUppaluru for reading a draft of this paper and making many helpful suggestions to improve the content and readability. Premkumar and Hikyu Lee ported the JASMINkernel to our currenthardware at Bellcore.

7. References

BAS77IBaskett, FH, Howard, JH, Montague, JT, TaskCommunication in DEMOS.Proceedings of the 6th ACM Symposium on OperatingSystemPrinciples,November, 1977, pp. 23—3 1.

CHE83]Cheriton, DR, Zwaenepoel, W, The Distributed V Kernel and Its Performance for Disk- less Workstations. Report No. STAN—CS—83—973,StanfordUniversity, July, 1983.

F1584]Fishman, DH, Lai, MY, Wilkinson, WK, An Overview of the JASMINDatabaseMachine. Proceedings of the ACM SIGMOD Conference, Boston, MA, June, 1984, pp.234—239.

LAI84] Lai, M Wilkinson, W. K. DistributedTransactionManagement in JASMIN,VLDB 84, August 1984.

LEE84J Lee, H, Premkumar, U, The Architecture and Implementation of DistributedJASMIN Kernel. Bell CommunicationsResearchTechnicalMemo,TM—ARH—000324, Oc tober, 1984, Morristown, N.J.

L1N82]Linderman, J. P. Issues in the Design of a DistributedRecordManagementSystem, Bell SystemTechnicalJournal 61, 9 (Nov. 1982), Part 2, 2555—2566.

R0O82]Roome, W. D. A Content—AddressableIntelligentStore, Bell SystemTechnicalJournal 61, 9 (Nov. 1982), Part 2, 2567—2596.

SOL79ISolomon, MH, Finkel, RA, The RoscoeDistributedOperatingSystem.Proceedings of the 7th Symposium on OperatingSystemsPrinciples,December, 1979, pp.108—’14.

STO76IStonebraker, M., Wong, E., Kreps, P., and Held, G., The Design and Implementation of INGRES, ACMTODS 1, 3 (Sept. 1976), 189—222.

-16- Supporting a DatabaseSystem on Symbolics Lisp Machines

Hong—Ta!Chou,Jorge F. Garza, Nat Ba/Iou

Microelectronics and ComputerTechnologyCorporation 3500 West BalconesCenter Drive Austin,Texas78759

ABSTRACT

We examine a number of importantissuesrelated to the implementation of a prototype object—orienteddatabasesystem on a Symbolics3600 Lisp machine. Our mainfocus is on with the interfacing storagesystem,memorymanagement, and processcontrol of the Symbolicsoperatingsystem. We discussvariousimplementationproblems that we have encountered and present our solutions.

1. Introduction

ORION is a prototypeobject—orienteddatabasesystemunderdevelopment at MCC to support the data management needs of object—orientedapplications from the CAD/CAM, Al, and OlS domains BANE87]. The intendedapplications for ORIONimposed two types of requirements:advancedfunction and ality high performance. The ORIONarchitecture was designed to satisfytheserequirements.ORION a number of advanced provides features that conventionalcommercialdatabasesystems do not, in version cluding control and changenotificationCHOU86],storage and presentation of unstructured multimedia data and WOEL86], dynamicchanges to the databaseschemaBANE86]. For high per formance,ORION efficient supports accesspaths, and novel and efficienttechniques for queryprocess ing, buffermanagement, and concurrencycontrol.

ORION is beingimplemented in Common Lisp STEE84] on a Symbolics3600machineSYMB85aJ. Supporting an object—orienteddatabasesystem in a Lisp environmentpresented a uniquechallenge. The generalproblem of supporting a databasesystem in a Lisp environment is twofold. First, using Lisp as an implementationlanguagepresentssometechnicaldifficulties. Lisp was not originallydesigned for systemprogramming.Low—levelmanipulation of primitive data, such as pointers and bytes, requires careful Second, coding. storing Lisp objectsefficiently is a challenge to a databasesystem, mainly because are so in Lisp objects dynamic terms of datatype and length. We shallfocus our discussion on the latter,since the former can be solved moreeasily by codingtechniques. In particular, we will exam ine issues implementation related to the storagesystem,memorymanagement, and processcontrol of the Symbolicsoperatingsystem.

2. Storing Lisp Objects

The machine Symbolics providesthreelevels of storageinterface to users (Figure 1): the LMFS Machine File (Lisp System), the FEP (Front—EndProcessor) file system, and the user—disksystem the SYMB85c].Internally, Symbolicsmachine also implements a system—diskcomponent to support paging of its virtualmemory.

—17— LMFS Virtual ____ page T:-:-i files I h~r.hin I Memory ioo.iisp ...~‘... ~1~ \. >(paging file) > partitiontable FEP (LMFS ~—‘ FEP files >LMFS.FILE (LMFS file partition) files

I (user (system mode) mode) Disks ~pack Disks

Figure 1. StorageOrganization of Symbolics

Lisp Machine File System(LMFS)

LMFS provides a file streaminterfacethroughwhich a Lisp object can be stored in the form of printed text, called its printedrepresentation. It is built on top of the FEP file system.Normally, all LMFS files are stored in a big FEP file, “LMFS.FILE”, which is the default LMFS file partition (file system space).Alternative or additional file partitions, each implemented as a FEP file, can be specified in the FEP file “FSPT.FSPT”. An objectstored in a file stream is retrieved as a string of characters and recon structed into an in—memory Lisp object. As shown in Figure 2, a typicalin—memoryrepresentation of a

in—memoryrepresentation printedrepresentation in a file stream

memory Lisp printer e

(x (y z)) new memory

Lisp reader

Figure 2. Representation of a Lisp Object (x (y z))

nested list (x(y z)) is a collection of memorycellsconnectedthroughpointers. The list can be written to of a file stream by the Lispprinter, and later be read in and reconstructed (in a new collection memory cells) by the Lispreader. For databaseapplications,there are severalfunctional and performanceprob lems with file streams:

-18- (1) Someobjects do not have a printedrepresentation from which the objects can be recon structed in memory. For example, the printedrepresentation of compiledcode can not be read in.

(2) Storingobjects in a file stream is expensive as eachin—memoryobject has to be translated into a text stringbefore it can be stored.

(3) Certain Lisp objectshave a lengthyprintedrepresentation. For example,storing a bit vector as a sequence of one’s and zero’s in the text form is obviouslyinefficient.

(4) Althoughrandomaccesses are possible with a file stream,storing an object at an arbitrary

locationbeyond the end of the stream is not allowed. (In otherwords, a file stream can not contain

“holes” in which no data has been written.) This makesupdating an objectdifficult,because we can not reservespace in anticipation of futuregrowth of the object.

(5) Althoughlogicallycontiguous, the actualdata of a file stream may be scattered on disk. Since Lispprogrammershave no controloverphysicalplacement of objects, the ability to clusterobjects on disk is severelylimited.Further,buffering of disk blocks for a file stream is transparent to the programmers.Thus, the ability to prefetchobjects and to exploit any intelligentbufferingalgorithm

is lost if file streams are used.

AlthoughbinaryLMFSfiles are also provided to helpovercomesome of theseproblems, the byte— at—a—timeinterface(throughwhich one byte is transferred at a time) makes it too inefficient for dealing with largeamounts of data. Besides, this binary file facility is not very usefulsince only integers can be stored.

Symbolics FEP File and Disk Systems

The FEP file system is a low—levelstoragesystem that deals primarily with blocks of data. It provides an interface that allowsun—bufferedreading and writing of disk blocks (i.e. withoutintermedi ate buffering). Such an 10 operation can be eithersynchronous(blocking) or asynchronous(non—block ing). Besidesproviding file space for LMFS, FEP files are used internally by the operatingsystem for storingsystemdata, such as paging files and boot commands.Underneath the FEP file system is the

Symbolics Disk System, whichsupports two modes of disk transfer, usermode for normal file storage and systemmode for virtualmemorysupport. Both the FEP file and user—diskinterfacerequire a system resource, diskarray, for buffering disk blocks. A diskarray is an array of fixnums plussome disk related headerinformation, and resides in virtualmemory. (A fixnum is an integerwhich has an efficient and typicallyfixed—sizerepresentation in a Common Lisp implementation.)

Figure 3 shows a typical data flow duringretrieval of a Lisp object from an LMFS file. First, LMFS translates the current file position(within a file stream) into a blocknumber (in a FEP file), and calls the FEP file system to retrieve the data block into a disk array. The FEP file system, in turn, looks up the physical diskaddress of the data block and issues a read request to the disk driver.When the disk—read operationcompletes, the FEP file systemreturnscontrol to LMFS,whichthen extracts the relevantdata (in the text form) from the disk array and transforms it into an in—memoryrepresentation of the Lisp object. Note that the disk array may have to be paged in through the system disk interfacebefore the data block can be retrieved.

Building a DatabaseStorageSubsystem

It is desirable to integrate the storagesubsystem of a databasesystem with that of the Symbolics machine to minimizememorymanagementoverhead, such as paging.However, this requiresmodifica

—19— Figure 3. Data Flow in the SymbolicsStorageHierarchy

tion to the Symbolicsoperatingsystem. To avoidsuch a majorundertaking, we havechosen the FEP file system as our primary“disk”interface.However,creating a FEP file is an expensiveoperation.Further, performance of the FEP file systemmay be severelydegraded if a disk is fragmented by a largenumber of small FEPfiles.Thus, to reduce FEPsystemoverhead and diskfragmentation, we allocate(semi—per manently) a large FEP file as a virtualdisk. Similar to LMFS, we implement our databasestoragesubsys tem on top of the FEP file system, and take control of page allocationwithin the virtual disk.

There are two Symbolicsutilityprogramsthat dump and load Lisp objects to and from a FEP file in a binaryformatSYMB85b].However, the inability to store a Lisp object at a prescribedposition(other than zero) in a file, among other deficiencies,makes them unsuitable for direct use by a database system.Therefore, usingthese two utilities as the basis, we implemented our own dumper and loader. The dumperencodes a Lisp object into one or more16—bitpackets. Thetype of the object, and its value or otherinformationabout the object, are recorded in the first packet.Additionalpackets mayfollow to providespace for storing the objectvalue if it can not fit in the first packet. In general,objects are more compact in the diskformat. A smallinteger, for example, can be encoded in a singlepacket. Withthese two utilities, we can place an objectanywhere on a diskpageaccording to our clusteringscheme. More importantly, it allows us to update part of an object in placewithoutextensivestoragereorganization.

3. ManagingObjects in Memory

The Symbolicsmain memory, as presented to the users, is organized as a hierarchySYMB85c].

At the top level, the virtualmemory is divided into a number of areas in whichrelatedobjects are stored.

Each area can have its own paging and garbagecollectionalgorithms, thus enablingknowledgeable users to fine—tunemanagement of main memory. An area is furtherdivided into a number of regions, each of whichcontains data of the samerepresentationtype. On the basis of its representationtype, a

Lisp objectoccupies a number of 36—bitmemorywords. Each memoryword consists of threefields: cdr—code, data type, and pointer or data value. (Cdr—code is a two—bit field for optimizingmemory representation of lists.)

VirtualMemoryManagement

The Symbolicsmachineallows a user to have certaincontrol over virtualmemorymanagement, includinggarbagecollection and resourceallocation. A resource is basically a data structuredefinition

-20- plus a collection of instances,which can be allocated to a useruponrequest. Of particularinterest to us is the disk—arrayresource, which is required as block buffers for the FEP file system. The Symbolics resourcemanagerkeepstrack of disk—arrays to makethemre—usable. Data contained in a disk array, however, is lost once the disk array is returned to the resourcemanager. Thus, we have implemented our own pagebuffermanagerwhichkeepstrack of the buffersallocated and the identities of the pages in the buffers.Initially, page buffers are allocated from the disk—arrayresource. An allocatedbuffer is not releasedevenwhen the page in it is no longer in use, so that the page may be re—used by others. Further, a buffer may be re—assigned to anotherpage by the replacementalgorithm. In any event, the buffers are under the control of the page buffermanager until the databasesystem is shut down, at which time they are returned to the disk—arrayresource.

PhysicalMemoryManagement

The Symbolicsoperatingsystem also allowsusers to participate to someextent in physical (real) memorymanagement, in particular, the pagingalgorithm. An object or a section of the virtualmemory space can be swapped in or out under user control. In addition, an object or a section of the virtual memoryspace can be wireddown(locked) in physicalmemory.Wiringmemory is necessary, for exam ple, to pin down a disk arrayduring an 10 operation. A user can also use this feature to “force”impor tant or critical data structures to stay in physicalmemory to avoid performancedegradation due to paging. We use this feature to wire certainimportant page buffers in physicalmemory.

4. ProcessManagement and Synchronization

On the Symbolicsmachine, stackgroups(primitivecoroutines) are used to supportmultiprogram ming. A stackgroup holds the state information,including a controlstack and a bindingstack, for its associatedcomputation. A controlstack is a stack of functioncalls. It keepstrack of the Currentrunning function, its caller, and so on, and the returnaddress of each function on the stack. A bindingstack (environmentstack)contains all the valuessaved by Lambda—binding of specialvariablesSTEE84]. At any time, the Symbolicsmachineperforms a computation that is associated with a stackgroup.Control over the Symbolicsmachine can be passed(switched) from the currentstackgroup to anotherstack groupthroughresumption.

Processes on Symbolics are implemented with stack groups. A process can be either active (ready to run) or stopped(waiting for an event, such as the completion of an 10 operation). The active processes are managed by a specialstackgroup,called the scheduler,whichrepeatedlycyclesthrough the list of activeprocesses and determineswhichprocessshould run next. The currentprocess, i.e. the process that is running,continues its computation until it decides to wait, or a systeminterruptoccurs. In eithercase, the scheduler is resumed to selectanotherprocess to run.

All Symbolicsprocesses share the same virtual addressspace. With the exception of process stateinformation,which is kept in each individualstackgroup, all the Lisp data and globalvariables are sharedamongprocesses. The ability to sharememoryaddressspace is important for efficientimple mentation of common data structures, such as page buffer pool. Direct sharing of in—memory data structures,however,presentssomesynchronizationproblems.Mutualexclusion is necessary to protect the consistency of criticalsystem data structures(tables). The Symbolicsmachineprovides a special type of resourcecalledlocks,which are similar to semaphores.Concurrentdatabaseprocessesfollow a simplelockingprotocol to synchronize with one another. When a databaseprocessneeds to access a shared data structure, it must first acquire the lock on it. As soon as the process is done with the data structure, it must release the lock so that anotherprocesswaiting for the same lock can proceed.

-21- 5. Conclusions

Building a databasesystem on a Lisp machine is a uniqueexperience.Although the basicarchitec ture and implementationtechniques for a databasesystem are mostlyunaffected by the Lisp environ ment,storing Lisp objects on secondarystorage is a problem that requiresspecialconsid~-ations. The need for transforming a Lisp object between the in—memoryrepresentation and disk representation introduces an additional cost factorwhich must be weighedcarefully to avoid potentialperformance problems. We plan to conduct a performancestudy to thoroughlyevaluate the ORIONsystem with the view to validate our design.

Acknowledgments

We thankWon Kim, our projectleader, for his contribution and helpfulcomments on drafts of this paper.Othermembers of the project, Jay Banerjee and DarrellWoelk, have also contributed to the work reported here.

References

BANE86]Banerjee, J., H.J. Kim, W. Kim, and H.F. Korth, “SchemaEvolution in Object—Oriented PersistentDatabases”, in Proc. 6th AdvancedDatabaseSymposium,Tokyo,Japan,August 1986.

BANE87]Banerjee, J., N. BaIlou, H.T. Chou, J. Garza, W. Kim, and D. Woelk,“DatabaseSupport for Object—OrientedApplications”, to appear in ACM Trans. on OfficeInformationSystems, April 1987.

CHOU86] Chou, H.T., and W. Kim, “A UnifyingFramework for VersionControl in a CAD Environ ment”, in Proc. Int’l Con!. on Very Large Data Bases,August 1986, Kyoto,Japan.

STEE84JSteele, Guy L. Jr., Scott E. Fahlman,Richard P. Gabriel, David A. Moon, and Daniel L. Weinreb,“CommonLisp”, DigitalPress, 1984.

SYMB85a)SymbolicsInc., “User’sGuide to SymbolicsComputers”,Symbol/csManual#996015, March 1985.

SYMB85b]Symbolics Inc., “ProgramDevelopmentUtilities”,SymbolicsManual # 996045,Febru ary 1985.

SYMB85c]Symbolics Inc., “Internals,Processes, and StorageManagement”,SymbolicsManual # 996085, March 1985.

WOEL86JWoelk, D., W. Kim, and W. Luther, “An Object—OrientedApproach to MultimediaData bases”, in Proc. ACM SIGMOD Con!. on the Management of Data, May 1986, Washington D.C..

-22- The CamelotProjecti

Alfred Z. Spector,Joshua J. Bloch,Dean S. Daniels, Richard P. Draves, Dan Duchamp,Jeffrey L. Eppinger, Sherri G. Menees,Dean S. Thompson

Department of ComputerScience CarnegieMellonUniversity Pittsburgh, PA

Abstract

Camelotprovidesflexible and high performancetransactionmanagement, disk management, and recoverymechanisms that are useful for implementing a wideclass of abstractdatatypes,including largedatabases. To ensure that Camelot is accessibleoutside of the CarnegieMellonenvironment, Camelot runs on the Unix-compatible Mach operatingsystem and uses the standardArpanet lP communicationprotocol. Camelot is being coded on RT PC’s, is being frequently tested on MicroVaxes, and it will also run on variousshared-memorymultiprocessors. This paperdescribes Camelot’sfunctions and internalstructure.

1. Introduction

Distributedtransactions are an importanttechnique for simplifying the construction of reliable and availabledistributedapplications. The failureatomicity,permanence, and serializabilityproperties provided by transactions lessen the attention a programmer must pay to concurrency and failuresGray 80, Spector and Schwarz 831. Overall,transactions make it easier to maintain the consistency of distributedobjects.

Many commercialtransactionprocessingapplicationsalready use distributedtransactions, for examp’e, on Tandem’s TMF}-ielland 85]. We believe there are many more a’gorithms and applications that will benefit from transactions as soon as there is a widespread,general-purpose, and high performancetransactionfacility to supportthem. For example, there are a plethora of unimplementeddistributedreplicationtechniques that depend upon transactions to maintain

invariants on the underlyingreplicas.

A few projectshavedevelopedsystemsthatsupportdistributedtransactionprocessing on abstract objects. Argus,Clouds, and TABSLiskov and Scheifler 83, Allchin and McKendry 83, Spector et al.

85, Spector 85] are a few examples. Thesesystemspermitusers to define new objects and to use

1Thisworkwassupported by the DefenseAdvancedResearchProjectsAgency,ARPAOrder No. 4976,monitored by the Air ForceAvionicsLaboratoryunderContractF33615-84-K-1520, and the IBM Corporation.

The views and conclusionscontained in this document are those of the authors and should not be interpreted as representing the officialpolicies,eitherexpressed or implied, of any of the sponsoringagencies or the US government.

-23- themtogetherwithintransactions. While the interfaces,functions, and implementationtechniques of

Argus,Clouds, and TABS are quitedifferent, the projects’goals havebeen the same: to provide a commontransactional basis for many abstractions with the ultimate goal of simplifying the construction of reliabledistributedapplications.

Building on the experience of these and other projects, we have designed and are now implementing an improveddistributedtransaction facility, called Camelot(CarnegieMellon Low OverheadTransactionFacility). Camelotprovidesflexible and efficientsupport for distributed transactions on a widevariety of user-definedobjectssuch as databases,files,messagequeues, and

I/O objects. Clients of the Camelotsystemencapsulateobjectswithin data serverprocesses,which executeoperations in response to remoteprocedurecalls. Otherattributes of Camelotinclude the following:

• Compatibilitywithstandardoperatingsystems.Camelotruns on Mach, a Berkeley 4.3 UnixTM.compatibleoperatingsystemAccetta et al. 86]. Mach’sUnix-compatibility makesCameloteasier to use and ensures that good programdevelopment tools are available. Mach’ssupport for sharedmemory,messagepassing, and multiprocessors makesCamelotmoreefficientand flexible.

•Compatibility with Arpanet protocols. Camelot uses datagrams and Mach messages, both of which are built on the standardArpanet IP networklayerPostel 82]. This will facilitatelargedistributedprocessingexperiments.

• Machine-independentimplementation. Camelot is intended to run on all the uniprocessors and multiprocessorsthat Mach will support. We developCamelot on IBM RT PC’s, but we frequently test it on DEC MicroVaxes and anticipaterunning it on multiprocessorssuch as the Encore and Sequentmachines.

• Powerfulfunctions. Camelotsupportsfunctions that are sufficient for manydifferent abstracttypes. For example,Camelotsupports both blocking and non-blockingcommit protocols,nestedtransactions as in Argus, and a scheme for supportingrecoverable objectsthat are accessed in virtualmemory.(Section 2 describesCamelot’sfunctions in moredetail.)

• Efficientimplementation. Camelot is designed to reduce the overhead of executing transactions. For example,sharedmemoryreduces the use of messagepassing;multiple threads of controlincreasesparallelism; and a common log reduces the number of synchronousstablestoragewrites. (Section 3 describesCamelot’simplementation in moredetail.)

• Carefulsoftwareengineering and documentation.Camelot is beingcoded in C in conformance with carefulcodingstandardsThompson 86]. This increasesCamelot’s portability and maintainability and reduces the likelihood of bugs. The internal and externalsysteminterfaces are specified in the CamelotInterfaceSpecificationSpector et al 86], which is then processed to generateCamelotcode. A usermanualbased on the specification will be written.

To reduce further the amount of effort required to constructreliabledistributedsystems, a companionproject is developing a set of languagefacilities,calledAvalon,whichprovidelinguistic

-24- support for reliableapplicationsHerlihy and Wing 86]. Avalonencompassesextensions to C+ +,

Common Lisp, and ADA and automaticallygeneratesnecessarycalls on Camelot.Figure 1.1 shows the relationship of Camelot to Avalon and Mach.

I I I I I I VariousApplications I I I I VariousServersEncapsulatingObjects

Avalon LanguageFacilities

CamelotDistributedTransaction-Facility

Camelot Mods to Mach Inter-node Communication Communication

ARPANET IP Layer

Mach, Unix-compatibleOperating System

Figure 1.1: Relationship of Camelot to OtherSystemLayers

Machexecutes on uniprocessor and multiprocessorhardware.Inter-nodecommunication is logicallylayered on top of Mach. Camelotprovidessupport for transactionprocessing,includingcertainadditions to the communicationlayer. Avalonprovides linguisticsupport for accessingCamelotand Mach. Usersdefineserversencapsulatingobjectsand applicationsthatusethose objects.Examples of servers are mail repositories,distributed file systemcomponents, and databasemanagers.

One goal of the CamelotProject is certainly the development of Camelot; that is, a system of sufficientquality,performance, and generality to support not only our own, but others’development of reliabledistributedapplications. In buildingCamelot, we hope to demonstrateconclusively that generalpurposetransactionfacilities are efficientenough to be useful in manydomains.However, we are alsodeveloping newalgorithms and techniquesthatmay be usefuloutside of Camelot. These include an enhancednon-blockingcommitprotocol, a replicatedloggingservice, and a facility for testing distributedapplications. We also expect to learn much from evaluatingCamelot’s performance,particularlywith respect to the performancespeed-up on multiprocessors.

2. CamelotFunctions

The most building blocks for reliabledistributedapplications are provided by Mach, its communicationfacilities, and the Matchmaker RPC stub generatorAccetta et al. 86, Cooper

86, Jones et al. 85]. Thesebuildingblocksincludeprocesses,threads of controlwithinprocesses, sharedmemorybetweenprocesses, and messagepassing.

-25- Camelotprovidesfunctions for system configuration,recovery, disk management,transaction management,deadlockdetection, and reliability/performanceevaluation2. Most of thesefunctions are specified in the CamelotInterfaceSpecification and are part of CamelotRelease 1. Certainmore advancedfunctions will be added to Camelot for Release 2.

2.1. ConfigurationManagement Camelotsupports the dynamicallocation and deallocation of both new data servers and the recoverablestorage in which dataserversstorelong-livedobjects.Camelotmaintainsconfiguration data so that it can restart the appropriate data servers after a crash and reattach them to their recoverablestorage. These configuration data are stored in recoverablestorage and updated transactionally.

2.2. DiskManagement

Camelotprovidesdataserverswith up to 2~ bytes of recoverablestorage. With the cooperation of

Mach,Camelotpermitsdataservers to mapthatstorage intotheiraddressspace,thoughdataservers must call Camelot to remaptheiraddressspacewhentheyoverflow32-bitaddresses. To simplify the allocation of contiguousregions of diskspace,Camelotassumes that all allocation and deallocation requestsspace are coarse (e.g., in megabytes). Data servers are responsible for doing their own microscopicstoragemanagement.

So that operations on data in recoverablestorage can be undone or redoneafterfailures,Camelot provides data servers with logging services for recordingmodifications to objects. Camelot automaticallycoordinates paging of recoverable storage to maintain the write-ahead log invariantEppingerand Spector85].

2.3. RecoveryManagement

Camelot’srecoveryfunctions includetransaction abort, and server, node, and media-failure write-ahead value recovery. To supportthesefunctions,CamelotRelease 1 provides two forms of form in whichboth old logging; oneform in whichonlynewvalues are written to the log, and a second values and new values are written. New value logging requires less log space, but results in increasedpaging for long runningtransactions. This is becausepages can not be written back to their home location until a transactioncommits. Camelotassumes that the invoker of a top-level transaction knows the approximate length of his transaction and specifies the type of logging accordingly.

used in Camelot’stwo loggingprotocols are based on the old value/newvaluerecoverytechnique TABSSpector 85] and described by SchwarzSchwarz 84]. However, they havebeen extended to

2Synchronizationmechanisms for preservingserializability are distributedamong data servers;Camelotsupportsservers with the assistance thatperformeitherlocking or hybridatomicityHerlihy85]. Thissynchronization is commonlyimplemented of Avalon’sruntimesupport.

-26- supportaborts of nestedtransactions, new valuerecovery, and the logging of arbitraryregions of memory.

Camelotwrites log data to locallyduplexedstorage or to storagethat is replicated on a collection of dedicatednetwork log serversDaniels et al. 86]. In someenvironments, the use of a sharednetwork loggingfacilitycouldhavesurvivability,operational,performance, and costadvantages.Survivability is likely to be better for a replicatedloggingfacilitybecause it can tolerate the destruction of one or moreentireprocessingnodes.Operationaladvantagesaccruebecause it is easier to manage high volumes of log data at a small number of loggingnodes,rather than at all transactionprocessing nodes. Performancemight be betterbecausesharedfacilities can havefasterhardwarethan could be afforded for each processingnode. Finallyproviding a sharednetworkloggingfacilitywould be less costly than dedicatingduplexed disks to each processing node, particularly in workstation environments.

Release 2 of Camelot will support an operation (or transition)loggingtechnique in which type implementors can log non-idempotentundo and redooperations. Thistype of loggingincreases the feasibleconcurrency for sometypesand reducesthe amount of log spacethattheyrequire.

2.4.TransactionManagement

Camelotprovidesfacilities for beginning newtop-level and nestedtransactions and for committing and abortingthem. Two optionsexist for commit:Blockingcommit may result in data that remains locked until a coordinator is restarted or a network is repaired.Non-blockingcommit,thoughmore expensive in the normal case, reduces the likelihood that a node’s data will remainlocked until anothernode or networkpartition is repaired. In addition to thesestandardtransactionmanagement functions,Camelotprovides an inquiryfacility for determining the status of a transaction. Data serversand Avalonneedthis to supportlockinheritance.

2.5. Support for DataServers

The Camelotlibrarypackages all systeminterfaces and provides a simplelockingmechanism. It also containsroutines that perform the genericprOcessingrequired of all data servers. This processingincludesparticipating in two-phasecommit,handlingundo and redo requestsgenerated afterfailures,responding to abortexceptions, and the like. Thefunctions of this library are subsumed by Avalon’smoreambitiouslinguisticsupport.

2.6. DeadlockDetection

Clients of CamelotRelease 1 must depend on time-out to detectdeadlocks. Release 2 will incorporate a deadlockdetector and exportinterfaces for servers to reporttheir local knowledge of wait-forgraphs. We anticipate that implementingdeadlockdetection for arbitraryabstracttypes in a largenetworkenvironment likethe Arpanet will be difficult.

—27- 2.7. Reliabilityand PerformanceEvaluation

CamelotRelease 2 will contain a facility for capturingperformancedata,generating and distributing workloads, and inserting(simulated)faults. Thesecapabilities will help us analyze,tune, and validate

Camelot and benefitCamelot’sclients as they analyzetheirdistributedalgorithms. The information returned by the facilitycould also be used to providefeedback for applications thatdynamicallytune themselves. We believethat,whenproperlydesigned, a reliabilityand performanceevaluationfacility will prove as essential for building large distributedapplications as source-leveldebuggers are essential for traditionalprogramming.

The reliability and performanceevaluationfacility has threeparts. The first capturesperformance data and permitsclients to gaugecriticalperformancemetrics, such as the number of messages,

page faults,deadlocks, and transactions/second. Certaininformation is application-independent, but otherusefulinformationdepends on the nature of the application.Therefore, the performance evaluation facility will be extensible and capture application-specific data from higher level

components. Onceinformation is obtained from variousnodes on the system, the facilitycombines and presents it to systemimplementors or feeds it back to applications for use in dynamictuning.

The second part of the performance and reliabilityevaluationfacilitypermits the distribution of applications (or workloads) on the system. When manynodes are involved in a workload, this task can be verydifficultunless it is possible to specify the nodes and workloadsfrom a singlenode. We

havebuilt a prototypefacility of thistype for TABS, and we will extend it for use on Camelot.

Thethird part permitssimulatedfaults to be insertedaccording to a pre-specifieddistribution. This

is crucial for understanding the behavior of a system in the presence of faults. For example the

low.levelcommunicationsoftwaremay be instructed to lose or reorderdatagramswith a pre-specified

probability. Or, a pair of nodes could greatly raise networkutilization to probe the effects of contention.

2.8. MiscellaneousFunctions

Camelotprovides both a logical clockLamport 78] and a synchronizedreal-timeclock. These clocks are useful, for example, to support hybrid atomicityHerlihy 85] and replication using optimistictimestampsBloch 86]. Camelotalsoextends the Machnamingservice to supportmultiple serverswith the samename. This is useful to supportreplicatedobjects.

3. CamelotImplementation The major functions of Camelot and their logicalrelationship is illustrated in Figure 3-1. Disk

management and recoverymanagement are at the base of Camelot’sfunctions. Both activities are local to a particularnode,exceptthat recoverymayrequirecommunication with the networklogging service. Deadlockdetection and transactionmanagement are distributedactivities that assume underlying disk management and noderecoveryfacilities. Communicationprotocols and reliability

-28- and performanceevaluation are implementedwithinmanylevels of the system. The librarysupport for dataserversrests on top of thesefunctions.

Data Server LibrarySupport

C Rel. & DeadlockDetection 0 Pert m m E TransactionManagement i 1 C U RecoveryManagement a a t t —

Disk Management 0 0

Figure3-1: LogicalComponents of Camelot

This figure describes the logicalstructure of Camelot. Camelot is logicallyhierarchical,except that communication and reliabilityand performanceevaluationfunctionsspanmultiplelevels.

All of Camelotexcept the libraryroutines is implemented by a collection of Machprocesses,which run on everynode. Each of theseprocesses is responsible for supporting a particularcollection of functions.Processes use threads of controlinternally to permitparallelism. Calls to Camelot(e.g., to begin or commit a transaction), must be directed to a particularCamelotprocess. Somefrequently calledfunctionssuch as log writes are invoked by writing to memoryqueuesthat are sharedbetween a dataserverand a Camelotprocess. Otherfunctions are invokedusingmessagesthat are generated by Matchmaker.

Figure 3-2 shows the seven processes in Release 1 of Camelot3:mastercontrol, disk manager, communication manager, recovery manager, transaction manager, node server, and node configurationapplication.

• MasterControl. ThisprocessrestartsCamelotafter a nodefailure.

• Disk Manager. The disk managerallocates and deallocatesrecoverablestorage, accepts and writes log recordslocally; and enforces the write-ahead log invariant. For log records that are to be written to the distributedloggingservice, the disk manager

3CameIotRelease 2 will use additionalprocesses to supportdeadlockdetection and reliability and performanceevaluation.

-29- works with dedicatedservers on the network. Additionally, the disk managerwrites pagesto/from the disk whenMachneeds to servicepagefaults on recoverablestorage or to cleanprimarymemory. Finally, it performscheckpoints to limit the amount of work duringrecovery and worksclosely with the recoverymanagerwhen failures are being processed.

• CommunicationManager. The communicationmanagerforwardsinter-node Mach messages, and provides the logical and physicalclockservices. In addition, it knows the format of messages and keeps a list of all the nodes that are involved in a particular transaction. This information is provided to the transactionmanager for use during commit or abort processing. Finally, the communicationmanagerprovides a name service that createscommunicationchannels to named servers. (The transaction manager and distributed logging service use IP datagrams,therebybypassing the CommunicationManager.)

l~ Node Config. • • • Application ication Appi ______

Data Data

— ~er s~:r ~er

Recovery I Transaction Camelot Manager L Manager — System Components Disk Communication Manager Manager

Master Control

~ Mach Kernel

Figure3-2: Processes in CamelotRelease 1

Thisfigureshows the Machkernel and the processesthat are needed to executedistributedtransactions. The nodeserver is both a part of Camelot, and a Camelot data serverbecause it is the repository of essentialconfiguration data. Other data servers and applications use the facilities of Camelot and Mach. The nodeconfigurationapplicationpermitsusers to exercise controlover a node’sconfiguration.

-30- • RecoveryManager. The recoverymanager is responsible for transactionabort,server recovery, node recovery, and media-failurerecovery. Server and node recovery respectivelyrequireone and two backwardpassesover the log.

• TransactionManager. The transactionmanagercoordinates the initiation,commit, and abort of local and distributedtransactions. It fullysupportsnestedtransactions.

• Node Server. The node server is the repository of configuration data necessary for restarting the node. It stores its data in recoverablestorage and is recoveredbefore otherservers.

• Node ConfigurationApplication. The node configurationapplication permits Camelot’shuman users to update data in the node server and to crash and restart servers.

The organization of Camelot is similar to that of TABS and R Spector 85, Lindsay et al. 84].

Structurally,Camelotdiffers from TABS in the use of threads,sharedmemoryinterfaces, and the combination of logging and disk management in the sameprocess. Manylow-levelalgorithms and protocols have also been changed to improveperformance and provideaddedfunctions.Camelot differsfrom R in its greater use of messagepassing and support for commonrecoveryfacilities for servers. Of course, the functions of the two systems are quite different; R*Is transactions are

intendedprimarily to support a particularrelationaldatabasesystem.

4. Discussion

As of December1986,Camelot 1 wasstill beingcodedthoughenough(about20,000lines of C) was functioning to commit and abort local transactions. Though manypieces were still missing (e.g., support for stablestorage and distribution),Avalondevelopers could begin their implementation work. Before we beginadding to the basic set of Camelot 1 functions, we will encourageothers to

portabstractions to Camelot, so that we can get feedback on its functionalityand performance.

Performance is a very importantsystem goal. Experience with TABS and very preliminary performancenumbersmake us believe that we will be able to executeroughly 20 non-paging write transactions/second on an PT PC or MicroVaxworkstation.Perhaps, it is worthwhile to summarize why the Camelot/Machcombinationshouldhaveperformancethatevendatabaseimplementors will like:

• Mach’ssupport for multiplethreads of control per process permit efficient server organizations and the use of multiprocessors. Shared memorybetweenprocesses permitsefficientinter-processsynchronization.

• Disk I/O should be efficient, because Camelot allocates recoverable storage contiguously on disk, and becauseMachpermits it to be mapped into a server’smemory. Also,serversthat knowdisk I/O patterns,such as databasemanagers, can influence the pagereplacementalgorithms by providinghints for prefetching or prewriting.

-31- Recoveryadds little overhead to normalprocessingbecauseCamelot useswrite-ahead logging with a common log. Though CamelotRelease 1 has only value-logging, operation-logging will be provided in Release 2.

• Camelot has an efficient,datagram-based,two-phasecommitprotocol in addition to its non-blockingcommitprotocol. Even withoutdelayingcommits to reduce log forces (“groupcommit”),transactionsrequire only one log force per node per transaction. Camelotrequiresjust threedatagrams per node per transaction in its star-shapedcommit protocol,because final acknowledgments are piggy-backed on futurecommunication. Camelotalso has the usualoptimizations for read-onlytransactions.

• Camelotdoes not implement the synchronizationneeded to preserveserializability. This synchronization is left to servers(and/orAvalon),which can applysemanticknowledge to providehigherconcurrency or to reducelockingoverhead.

Today, we would guess that Camelot’s initial bottlenecks will be low-level disk code and the

remainingmessagepassing. For example,though the frequent calls by servers to Camelot are asynchronous and via sharedmemory, all operations on servers are invoked via messageusing the

RPC stub generator. To furtherreducemessagepassingoverhead, we might have to substitute a form of protectedprocedurecall. Thisshould not changeCamelotvery muchsince all inter-process communication is alreadyexpressedwith procedure call syntax.

In the course of our implementation and the subsequentperformanceevaluation, we expect to learn

much about large reliabledistributedsystems. Once Camelot is functioning, we plan to perform extensiveexperimentation on multiprocessors and distributedsystemswith a largenumber of nodes.

In particular, we will measure the actual availability and performance of various replication techniques.

Our overall goal remains to demonstrate that transactionfacilities can be sufficientlygeneral and efficient to support a widerange of distributedprograms. We are gettingcloser to achievingthisgoal, but muchworkremains.

Acknowledgments

Thanks to our colleaguesMauriceHerlihy and JeannetteWing for their advice,particularly as it relates to support for Avalon.

-32- References

Accetta et al. 86] MikeAccetta,RobertBaron,WilliamBolosky,DavidGolub,RichardRashid, AvadisTevanian,MichaelYoung. Mach: A NewKernelFoundation for UNIXDevelopment. In Proceedings of SummerUsenix. July,1986.

Allchinand McKendry83] James E. Allchin,Martin S. McKendry.Facilities for SupportingAtomicity in OperatingSystems.TechnicalReportGIT-CS-83/1,GeorgiaInstitute of Technology, January,1983.

Bloch 86] Joshua J. Bloch. A Practical,EfficientApproach to Replication of AbstractDataObjects. November,1986.CarnegieMellonThesisProposal.

Cooper86] Eric C. Cooper. C Threads. June,1986.CarnegieMellonInternalMemo.

Daniels et al. 86] Dean S. Daniels,Alfred Z. Spector,DeanThompson.DistributedLogging for TransactionProcessing.TechnicalReportCMU-CS-86-106,Carnegie-MellonUniversity, June,1986.

Eppingerand Spector85] Jeffrey L. Eppinger,Alfred Z. Spector. VirtualMemoryManagement for RecoverableObjects in the TABSPrototype.TechnicalReportCMU-CS-85-163,Carnegie- MellonUniversity,December,1985.

Gray80] James N. Gray. A TransactionModel.TechnicalReportRJ2895, IBM Research Laboratory,SanJose,California,August,1980.

Helland 85] Pat Helland.TransactionMonitoringFacility.DatabaseEngineering8(2):9-18,June, 1985.

Herlihy85] Maurice P. Herlihy.Availability vs. atomicity:concurrencycontrol for replicateddata. TechnicalReportCMU-CS-85-108,Carnegie-MellonUniversity,February,1985.

Herlihyand Wing 86] M. P. Herlihy, J. M. Wing. A va/on:LanguageSupport for ReliableDistributed Systems.TechnicalReportCMU-CS-86-167,CarnegieMellonUniversity,November,1986.

Jones et al. 85] Michael B. Jones,Richard F. Rashid,Mary R. Thompson.Matchmaker: An InterfaceSpecificationLanguage for DistributedProcessing. In Proceedings of the Twelfth AnnualSymposium on Principles of ProgrammingLanguages,pages225-235. ACM, January,1985.

Lamport 78] LeslieLamport. Time,Clocks, and the Ordering of Events in a DistributedSystem. Communications of the ACM21(7):558-565,July,1978.

Lindsay et al. 84] Bruce G. Lindsay,Laura M. Haas, C. Mohan,Paul F. Wilms,Robert A. Yost. Computation and Communication in R*: A DistributedDatabaseManager. ACM Transactions on ComputerSystems2(1):24-38,February,1984.

Liskovand Scheifler83] Barbara H. Liskov,Robert W. Scheifler.Guardiansand Actions:Linguistic Support for Robust,DistributedPrograms. ACMTransactions on ProgrammingLanguages andSystems5(3):381.404,July,1983.

Postel 82] Jonathan B. Postel. InternetworkProtocolApproaches. In Paul E. Green, Jr. (editor), ComputerNetworkArchitecturesandProtocols,chapter 18, pages 511-526.PlenumPress, 1982.

-33- Schwarz84] Peter M. Schwarz.Transactions on TypedObjects. PhDthesis,Carnegie-Mellon University,December,1984. Available as TechnicalReportCMU-CS-84-166,Carnegie- MellonUniversity.

Spector85] Alfred Z. Spector. TheTABSProject.DatabaseEngineering8(2):19-25,June,1985.

Spectorand Schwarz 83] Alfred Z. Spector,Peter M. Schwarz.Transactions: A Construct for ReliableDistributedComputing.OperatingSystemsReview17(2):18-35,April,1983. Also available as TechnicalReportCMU-CS-82-143,Carnegie-MellonUniversity,January1983.

Spector et al 86] Alfred Z. Spector, Dan Duchamp,Jeffrey L. Eppinger,Sherri G. Menees,Dean S. Thompson. TheCamelotlntérface Specification.September,1986.CamelotWorking Memo 2.

Spector et al. 85] Alfred Z. Spector,Dean S. Daniels,Daniel J. Duchamp,Jeffrey L. Eppinger,Randy Pausch.DistributedTransactions for ReliableSystems. In Proceedings of the Tenth Symposium on OperatingSystemPrinciples,pages127-146. ACM,December,1985. Also available in ConcurrencyControlandReliability in DistributedSystems, Van Nostrand ReinholdCompany,NewYork, and as TechnicalReportCMU-CS-85-117,Carnegie-Mellon University,September1985.

Thompson 86] DeanThompson.CodingStandards for Camelot. June,1986.CamelotWorking Memo 1.

-34- Getting the OperatingSystem Out of the Way

J. Eliot B. Moss

Department of Computer and InformationScience University of Massachusetts Amherst,Massachusetts 01003

Abstract. In thisnote I outlinewhatfundamentalsupportdatabasesystemsneedfrom theiroperatingenvironment. I arguethatgeneralpurposeoperatingsystems do not provide the necessarysupport, or at least not in a verygoodway, and that they provide a number of things that mainly get in the way of a databasemanager. In response, I recommend a somewhatradicalapproach:building the databasesystem in a programminglanguage on top of a minimalkernelconsisting of the language run time system. Thistechnique in essence does awaywith the operatingsystem,replacing it with a programminglanguage. While not feasible for use in systems that mustsupportgeneralpurposecomputing, this should be an effective way to buildefficient,dedicateddatabasemanagementsystems.

1 WhatDatabasesNeed.

First, I will outline the basic needs of databasemanagers,which I categorize as support

for input/output,multiprogramming, and memorymanagement. The primary need is configuration and deviceindependence,with interfaces for the devicesbeingsimplewhile

allowingvery efficient use. Input/Output.Databasesystemsobviouslyrequiresupport for input/output. They

need I/O to be efficient, and they need to be in a position to take advantage of what

ever performanceenhancingfeatures a deviceoffers. However, it is clearlyuseful for the

databasemanager to be shieldedfrom a number of devicedetails. While the “topology”

of disks (number of cylinders,tracks per cylinder,sectors per track, etc.) may be useful

in somecircumstances,access to fixed size blocks by sequentialblocknumber is probably

adequate most of the time. The assumptionmade is that block numbers are related to

physicalposition:readingsequentialblocksshould be fast, movingfrom one block to an

other will be faster if the blocks’numbers are close, etc. Efficientreading of a number of

blocks at a time should be supported (full track read/write), and hintsconcerninggood clusteringparametersshould be available(thesewould be derivedfrom disk topology). A databasesystem needs asynchronous(nonblocking) I/O, directly to and from the

memory it manages for its buffers. Low level error handlingshould be available, to

mask most I/O errors,through retry, disk sectormapping, etc., reflecting the error to the databasesystem only if it is unlikely that retry will help matters.

While it seemsreasonable,though not trivial, to provide the necessarysupport for

mass storage such as disks and tapes,deviceindependentcommunications and terminal

I/O are of equalimportance, and moredifficult to achieve. In networkcommunications,

databasemanagersseem to need two levels of functionality:essentiallypacket level I/O,

-35- useful for commitprotocols and the like, and virtualcircuits (one way at least) for bulk data transfer. Otherparadigmsmight includeremoteprocedure call. While the .best

high level modelmight be an issue for debate, the databasemanager still needsefficient,

but simple to use (i.e., messydetails are hidden)communicationssupport. As with mass

storage,errormaskingshould be provided. The databasemanagershould also be shielded

as much as possiblefrom the vagaries of addressing,packetformats,protocolstandards, etc.

Communication with interactiveterminalspretty much falls into the samecategory,

as far as communcationsgoes, but raises the additionalproblems of dealing with visual formatting so that terminal/workstationfeatures are used to best advantagewithout lots of special case code in the databasemanager. A number of packagesalready exist that

assist in this endeavor, but there is no easy way aroundhaving to supportterminals of vastly differentcapabilities. However, minordifferencesamongsimilardevices, as well

as low level details,should be masked. Databasesystems have an advantage over more generalsystems in that they typically do ratherstylizedinteractions(forms,etc.),which, whilevarying in detailfromterminal to terminal,sharemost of the top levelfunctionality. See the next section for furtherremarksconcerning I/O.

Multiprogramming. Somesupport for multiprogramming is clearlynecessary,since

the sort of databases being discussed are shared. The underlyingsystem must provide means for creating,scheduling,destroying,etc., threads of control.However,thesethreads

are under the management of the databasesystem. As with I/O,what is needed is a clean collection of primitivesfromwhich the databasemanager can fashionwhatever it requires.

The appropriate kind of multiprogrammingsupport is lightweightprocesses,sharing a singleaddressspace. A number of process may be managingI/O,someperformingactual datamanipulation and computation, and othersdoinghousekeepingfunctions.Processes should be cheap. For example, it should be reasonable to fork one for eachincomingquery

and to let it die when the response has beensent. It is unreasonable to requireeverylogged on terminal to have an associatedprocess, sinceprocessesrequire a stack, etc., which is

too muchspace to devote to every user in a large system. Processswitchingshould be veryefficient, too, so that lightweightprocesses can be used as “interrupthandlers” — that

is, they can blockwaiting I/O completion. Input/output and otheroperatingsystemfeaturesshould be directlyincorporated into

the databasemanager. By this I mean that thesefeatures are only a procedure call away,

not an operatingsystemsupervisor call or contextswitch. If lightweightprocesses can be

used, we can revert to logicallysynchronous I/O: the lightweightprocess will block until

completion, and anotherlightweightprocess will be chosen to run in the meantime.

The databasemanagerneeds to have control over the schedulingalgorithm and pa

rameters,such as timeslices,priorities, and so on. It may be sufficient to offer a priority

basedscheduler that does roundrobinscheduling of processes at equalpriority. One also

-36- needs to be able to performcriticalsections. This can be madeindependent of database concurrencycontrol to someextent, but it is necessary to preventpreemptionduringcrit ical sections. The support code should not assume that it knowsabout all the necessary criticalsections.Protectingcriticalsectionsmust be veryefficient (i.e., add littleoverhead to lockingroutines that are perhaps less than 100 instructionslong).

MemoryManagement. This breaksdown into two areas: management of memory containingcode, and management of dataspace. Management of dataspaceshould gen erally not be demandpaged in the senseusedwithtimesharingsystems. Mostglobaldata should not be eligible for paging, and that which is shouldprobably have prioritiesasso ciatedwith it, according to the priorities of processes that haveused or expect to use the data. That is, it should be possible to give stronghintsabout the importance of various pieces of data as a processexecutes. This will insure that data is paged in withmaximum efficiency, and not paged out when it is likely to be neededagain. It would be up to the systemdesigner to insure that this schemewould not starveprocesses. However, high performancesystemsclearlydepend on havingenough real memoryavailable to minimize paging.

Relativelyautomaticmanagement of code, would be helpful. It could be based on hinting,whenstarting to executesomemajorpiece of code,what the piece of code is and the importance of the activity that will be performed by executing it. When the process is done, it should also note that. Majorpieces of code will probably not be eligible for paging, andwould be managed by thejust describedmethod,which is somewherebetween overlays and demandpaging in its properties.

It is not clear to whatextentcode and datapaging, as well as I/O bufferspacemanage ment,should be tied together or separated. If the priorityschemes are coherent, it may be best to tie all the memorymanagementtogether, to gain best utilization of real memory.

The point is to providesomeminimalfunctionality, but to allow the databasesystemsub stantialcontrolover policy,throughparameters and hints. It mighteven be reasonable to let it decide the pagereplacementpolicy, by having the pagefaulthandler call a database systemsuppliedsubroutine for choosingreplacementvictims. That is harder to use in someways, but fullygeneral.

2 Whatdatabases do not need.

In this section I deliver my diatribeagainstgeneralpurposeoperatingsystemsthatattempt to supportdatabasesystems, too. The main points of the argument are that in trying to supportdatabase a generalpurposeoperatingsystemprovidesfeatures that are too complex, never quite right, rathersystemspecific, and expensive to use. Hence, many databasesystemdevelopersspendmuch of theirtimetrying to figure out how to game the operatingsystemjust right, or how to get it out of their way.

-37- Multiprogramming. In this area operatingsystemstypicallyprovidelarge,expen sive processes,primarilybecause they are trying to protect the system, as well as other processes,fromprocesses that go haywire, or try to accessprotecteddata, etc. If we are prepared to dedicate the computer to the databaseapplication, then the protection issue goesaway. The databasesystem will almostcertainly be of highprivilege in a timesharing environment and will itselfcontrol user access to the database in waysmoresophisticated than the operatingsystemcan, so there is not muchpoint in protecting the databasefrom itself. There is an increased risk that bugs in the databasesystemitselfwill be a problem, but high performancedatabasesystems use sharedmemoryalready, so I do not think this can be considered a majorproblem.Getting rid of the protection (andhencecontext switching)boundariesgreatlysimplifiesthings, as well as substantiallyimprovingperfor mance. For example,getting at locks,buffers,internaltables, etc., can be done directly rather than having to call across an interface. We can do things in microsecondsrather than milliseconds. Using a single address space with no per-processprotectionmakes lightweightprocessesmorepractical as well. In sum, to supportdatabases, the operating systemshould not provideelaborateprotection and securityfeatures — get the fences out of the way for the databasesystem.

Memorymanagement. Straightdemandpagedvirtualmemory does not help a databasesystem very much: it needscontrol over when data is written to disk, so as to implementrecoveryprotocols.Further, page replacementalgorithmssuitable for general use are not reallyappropriate,because the databasesystemknows a lot about how data and code are used. Muchbetterperformance can be had by letting the databasesystem havesubstantialcontrol in this area.

Input/output. Databasesystems may not be able to take advantage of elaborate accessmethods that come with an operatingsystem. The problem is they probably will not quitematchwhat the databasesystemwants to do. Besides, the databasesystem will

providemorevariety and sophistication in this area, so it is probably not worth the effort

to provide halfheartedmeasures.

Concurrencycontrol. If locking is an operatingsystem call away, the database sys

temwill not use it. Thepointhere is thatgoodlockingcode is perhaps ten to one thousand

instructions long; I have heard that on typicalsuperminicomputers 100 microseconds is

an average lock acquisition time (for an available lock). An operatingsystemprobably

cannotevendispatch the supervisor call that fast, and its datastructureprobably will not be quiterighteither. Thisrelatesback to multiprogramming: the operatingsystemshould

provide the basicmechanisms,simple and cheap, and let the databasesystemworryabout policy. Transactionfeatures. While there has been considerableinterest of late in the

use of transactions to help build operatingsystems, it wouldprobably be a mistake to

build a singletransactionmechanism to servethemboth. Transactionsmightindeedhelp

-38- structure a reliable,distributedoperatingsystem, just as they are a powerful and useful concept for databases, hut the objectsbeingmanipulated are probablyratherdifferent. As with concurrencycontrol, the operatingsystem will almostcertainly not provideexactly the right thing. Worse,havingprovided a mechanism, the operatingsystem may then preempt the ability to do somethingdifferent,leaving the databasesystemdesigner up the creek.

In sum,operatingsystemsdesignersshould not secondguessdatabasesystemdesigners, nor preempttheirability to buildmoresophisticatedcapabilities.

3 The LanguageBasedApproach.

Beingsomething of a languagedesigner and implementer by trade, I naturallyfavor de signing and then using an appropriatelanguage for buildingdatabasesystems of the kind

I have been describing, in fact, one way to achievesomemeasure of portability is to de sign the requiredprimitives into the language. To someextent all I havedone is turn an operatingsystemkernelimplementationproblem into a language run timesystemimple mentationproblem.However, the languageapproach has the additionalpower of being able to shovevariousthings onto a compiler and sometimes take care of them at compile time. A typicalexample of this is checkingvariousproperties of arguments in “system calls” (which turn into procedurecalls or invocations of builtinprimitives).

The sort of language I am describingshould not be taken as being at as high a level as a databasequery language. Rather, we are talkingabout a systemsimplementation language. To a certainextent, Ada, Modula, and Mesa have the right flavor. They all providecontrol over the bits, have lightweightprocesses, and offer someinter-process synchronization and communicationprimitives. SinceMesa has been usedsuccessfully to build the operatingsystemCedar, it mighteven be a reasonablecandidate.

Let us consider more specifically some attributesdesirable in a databaseimplemen tationlanguage.Foremost, I would say, is a good facility for data abstraction and code modularization. Dataabstraction is one of the best ways to hide hardwaredependencies.

For example, a dataabstraction for pageallocation,deallocation, and access can takecare of mappingdown to diskcylinders and tracks,optimizingplacement of relateddata(when given hintsfrom invokers of the abstraction), etc. It could also help insure that data are written back to the rightplace, by identifyingpages, and, in conjunction with a memory bufferabstraction,associatingdatabasepageswith mainmemorybuffers.

However, it wouldappear that one probablyneedsways to escapefrom type checking and similar rules, too. In particular,memorymanagement will require the ability to manipulateaddresses, and when data are read from disk, they must be coerced (e.g., via the type cast construct of the C language) into typesknown to the program. This is an issuethat has beenraised, anddealtwith in variousways, in programminglanguagecircles,

-39- but may be less familiar to databasecognoscenti. A moreambitiousapproachwould be to extend the language to includepersistence, i.e., directlyunderstand and manipulate objects on disk automatically. This is an exciting area of currentresearch, but beyond the scope of the presentdiscussion. Whilepointermanipulation and memorymanagement require the use of unsafefeatures, the unsafe code can be quite localized,reducing the likelihood of error. The languageshouldencourage such localization; C, for example, is weak in this respect.

Concerningprocesses and synchronization, I think the features of Mesa and Modula are bettersuited than those of Ada. In particular, usinglocksseems morenatural than usingrendezvouswhenexpressingmutualexclusion in access to internaltables. In any case, interruptsshould turn into synchronizationsignals (e.g., a V on a semaphore, etc.) to suspendedprocesses. Thus a process might write a buffer out to disk by executing the code, including the devicedriver,itself,blocking for the completionsignal, and then continuing.Components of the systemwhich act as schedulersshould be able to access waitingprocesses as entities~ and manipulatethemeasily (e.g., to movethemfromqueue to queue). It is helpful if a process can lock a data structure, file itself in the data structure, and then atomically: block,unlock the datastructure, and signal a scheduler.

The scheduler can then lock the datastructure and get at the suspendedprocess.Similar argumentssuggest that individualmessages (or streams) be dataobjects,with no strong, fixedrelationship to processes,allowingcommunications to be handled in a flexible,priority fashion.

Control over paging of code and data is not an issue of the language per se. Rather, one aspect is the ability to lock and releasepages in memory,etc., which can be done via a pagedataabstraction. The otheraspect is describingpieces of code/data,groupingthem, etc. This is more an issue in the construction of tools,such as the compiler and linker, and theirconnectionback to the languageelements(procedures,variables,etc.). For example, one mightgroupcode intosegments. It wouldthen be necessary to determine the segment containing a givenroutine,then the location and size of the segment, in order to lock the routine and its associates into mainmemory all at once. A complementaryapproach is to deliverpagefaults back to the databasesystem. This requiresbeing able to specifysome pages that mustabsolutelynever be paged out, etc.

In sum, we can identifyfrom our previousdiscussionsomefeaturesdesired of a database

implementationlanguage. But the featuresmentionedprobably do not form a complete or exactlycorrect list.

4 Advantages of the LanguageApproach

The advantages of the languageapproach are several. First, it shouldoffer greaterporta

bilityacrossmachines. We get the usualhiding of hardwaredifferencesassociated with a

-40- high level language. But in addition,rather thanadapting to differentoperatingsystems, we adapt the maehine to us. Thus we push the differencesdown to low level abstractions, and avoid unnecessaryvariations (e.g., amongdifferentoperatingsystems on the same hardware).

We also gain the manyadvantages of modernprogramminglanguages in support of systemdevelopment and evolution. In particular,dataabstraction and type checking can be offered. At present one mustchoosebetween an efficient,unsupportivelanguage (e.g., assembly or C) or a less efficient, less convenient, moresupportivelanguage (Adamight be an example). In designing a new language we can likely offer both.

Evenbetter, we gain leverage by providing the rightfeatures in the language,making it easier to expresswhat the databaseimplementerwants to do. This will ease implemen tation and maintenance, as well as reduceerrorsarisingfrom a poormatch of language to programmingtask. Finally, and most importantly, the languageapproach offers improvedperformance by getting the operatingsystem out of the way. We avoidunnecessaryprocesscontext switches,unnecessaryprotectioncontextswitches, and unnecessaryargumentchecks. We also shed entiremechanisms(protection,accesscontrol) that add overhead and must be redoneanyway. A moresubtleeffect is that produced by using less general and more ap propriatealgorithms.Amongthese are policies (e.g., for paging or device I/O scheduling) that are more general or even counterproductivecompared to databaseneeds. Carried to its extreme, this argumentwouldsuggesteliminatingunnecessaryfeatures from the hardwarearchitecture, and thus increasingspeed,reducingcost, or both.

In sum, the main advantages of the languageapproachinclude:portability (just im plement the run time system on your targetarchitecture — not trivial, but thereshould be a pretty clear specification of what is required);efficiency (no operatingsystem call overhead, and no protectiondomains);support for modernsoftwareengineeringpractices

(manylanguages used in implementingdatabases do not have adequatesafeguards, or features for building large systems); and leverage (the appropriate set of featuresmakes programmingeasier).

5 Objections to My Proposal.

The worst objection is that the approachdescribed is applicablemainly to dedicated databasesystems, as opposed to generalpurpose time sharingsystems,whichmight also wish to supportdatabaseaccess for theirusers. If the correct set of features can be worked out, it may be possible to graft a databasesystem of the kind I envision onto a general purposeoperatingsystem,provided the operatingsystemoffers the necessaryhooks,How ever, there will almostcertainly be a significantperformancepenaltycomparedwith the dedicatedsystem.

-41- A secondobjection is portability: to port the databasesystem,onemust in essencebuild a. new operatingsystem for the targethardware.However, this may not be too bad. First, we are not talkingabout a full blowntimesharingsystem,withutilities, and so forth. We are talkingabout an operatingsystemkernelonly, and one providing a standardinterface. It is conceivable that considerablepartscould be written in a reasonablelanguagewhich is fairlyportable and available, such as C, Modula, or Ada. This would cut the porting timedownsubstantially.Having the correctspecification and designreally will help.

Finally, it is clear that I am describingwhat must for the time being be considered research or advanceddevelopment, by no means a provenconcept. If your measure is

“Will it sell in the marketplace?”,then I havesomedoubts, at leastuntil I haveexperience building one of thesesystems. I am engaged in research to define a language,implement it, and build a distributedobjectorienteddatabase with it. However, the language will itself be objectoriented, and, because the emphasis is a little more on functionality and capability, for the time being we will probablyinterpret it rather thancompile it, and we will not strain for performance at the moment. (Get it right, then make it fast.)

This note owes a lot to discussions I have had with a number of people, and presenta tions I haveseen over the years, as well as bits writtendown here and there. It expresses opinion and religionmorethanexperimentallydemonstratedfact, but I think that a num ber of people agree with substantial parts of my criticisms and suggestionsconcerning operatingsystems. The language part is moreradical.Anyway, the words are mine, but the ideas can frequently be attributed to others.However, I will not mentionindividuals, since it is hard to say to whatextent they mightagreewith what I havesaid.

-42- OPERATINGSYSTEMSUPPORTFORDATAMANAGEMENT

MichaelStonebraker andAkhilKumar

EECSDepartment University of California Berkeley, Ca., 94720

Abstract In this paper we discuss four concepts that have been proposed over the last few years as operatingsystemservices. These are: 1) transactionmanagement 2) network file systems and locationindependence 3) remoteprocedurecalls 4) lightweightprocesses

From the point of view of implementors of a data manager, we indicate our reac tion to these services. We conclude that the first service will probably go unused while the second is probablydownrightharmful. On the other hand, the absence of the last two services in most conventionaloperatingsystemsrequires a data base systemimplementor to writesubstantialextracode.

1. INTRODUCTION

An operatingsystem is nominallydesigned to provideservices to clientappli cations,including large scale data managers. In STON81I one of us commented on the utility of current constructs as services for a data manager. Since that time, there have been numerousproposals for new operating systemservices, and the purpose of this paper is to comment on several of these from the point of view of perceivedutility to the implementor of a data base sys tem.

We focus our attention on four particularservices. First, many researchers proposenext-generationoperatingsystems with support for transactions. Older proposalsincludeMITC82,BROW81] while more recent ones can be found in CHAN86,MUEL83,PU86,SPEC83].Comments on the viability of operating sys tem transactionmanagers can be found in STON84,STON85,TRAI82]. In Sec tion 2 of this paper we present our currentthinking on this topic. Then in Section 3 we turn to a discussion of the network file systems that exist in many current operatingsystems (for example Sun UNIX and ApolloDOMAIN).Current data

This research was sponsored by the NationalScienceFoundation underGrant DMC 8504633 and by the NavyElectronicsSystemsCommandundercontractN00039-84-C-0039.

-43- managerswhich run in this environment (e.g. INGRES and ORACLE)usually do

not use this service and we indicate the reasons for this state of affairs. Section 3 then continues to the more generalnotion of locationtransparency, such as pro vided by LOCUSWALK83] and CloudsDASG85]. We indicate the reasonswhy a data managerwould be obliged to turn this service off to obtainreasonab’eperfor mance. Section 4 then discussesremoteprocedure call mechanisms and the use which a data managerwould find for this construct.Although one can code around their absence in currentsystems, we conclude they would be a most beneficial addition. Lastly, we turn in Section 5 to the need and intended use for lightweight processes. We indicate why they are crucial to data managers and what one would wantthem to do. The tax for their non-inclusion in most currentsystems is quite high, and we would encourage more operatingsystemdesigners to think deeply aboutproviding this service.

2. OPERATINGSYSTEMTRANSACTIONMANAGEMENT

A transaction is a collection of data manipulationcommands that must be atomic and serializableIIGRAY78].Current data managersimplement the transac tion concept by means of sophisticated user spacecrash recovery and concurrency controlsoftware. This software is consideredtedious to write, hard to debug, and its reliability is absolutelycrucial. All data base implementors that we have met wouldappreciatesomebody else beingresponsible for this function. In particular, abdicatingthesefunctions to the operatingsystemseems the only realisticpossibil ity.

There are two possibleways this abdicationcould take place,namely: totalabdication partialabdication

In total abdication, the data manager would accept a ‘‘begintransaction’’ com mand from a user and pass it directly to the operatingsystem. ‘‘Aborttransac tion’’ and ‘‘committransaction’’instructions would be similarlyredirected. Hence, the data manager would be uninvolved in (and unconcerned with) the details of transactionsupport.

In partialabdication the data managerwould still be required to define the ‘‘events’’ whichmake up the contents of a current data base log as well as pro

-. vide the required‘‘undo’’ and ‘‘redo routines. Any special case locking of indexes,systemcatalogs and otherspecial data structurewould also requiredirect data managerimplementation. Partialabdication would move the basic control flow surrounding log processing into the operatingsystem as well as the innards of the lock manager.

In our opinion,partialabdication would simplify data base transaction code only marginally since the easiest part of the code is all that is removed. Defining events and implementingthemrecoverably is still left to the data manager. This is where a majority of the time and effort in currenttransactionsystems is spent. Hence we are MUCH more excited by the prospect of total abdication. A recent IBM ResearchCHAN86]proposal for movingtransactions into the OS is the most promising one we have seen because it suggestshardwaresupportedlocking on ‘small-granularity 128 byte objects and a write-ahead log on the same small granules. Hence, we are investigating the viability of this proposal by performing

-44- an extensivesimulationstudy.

We have constructedsimulationmodels to compare the performance of an OS transactionmanager with that of its databasecounterpart. The models must cap ture the relative strengths and weaknesses of each approach. The trade-offs involved in the two alternativeshave to be highlighted in the models by means of parameters that correspond to performancefactors in each situation. For example,

our models have to reflect the fact that the OS can set locks at a lower cost than

the data manager. This factor is likely to make the OS solutionrelatively more attractive. On the other hand, the OS transactionmanager lacks semantic infor mation about the database. For instance, it cannot distinguish between data objects and indexes while the data manager can. This ability to distinguish betweendifferent types of objectsenables the data manager to apply short-term locks on hot spots like indexes and systemcatalogs and, therefore, allow greater concurrentactivity. It also means that the data manager need only log updates to data objects since the index may be reconstructed from the data at recovery time. Therefore, the data manager can implementcrashrecoverymechanisms in a smar ter way and at a lower cost. This increased level of semantics tends to make the data managersolutionpreferable.

Simulationmodels have previously been used in this area to study the perfor mance of concurrencycontrolalgorithms in a conventional data manager (e.g., R1E579],CARE84J,AGRA85]). We haveexpandedthesemodels in severalways. First, we are modelling both concurrencycontrol and crash recovery. Second, we have a more elaborate data base model for treating data and index objects separately.Thirdly, our modelincludesbuffermanagementwhich has been omit ted in the previousstudies where it is assumed that every data object access involves a disk I/O. As a side resultfrom the study we will be able to examine the effect, if any, of buffer size on each alternative.Lasitly, our primarymotivation is to study where transactions are supported and not to investigate a range of differentuser-spacealgorithms for transactionmanagement.

Our initial studiesindicate that an operatingsystem transactionmanager providesabout 30 percentlowertransactionthroughput than its data base counter part under a variety of simulationparameters. This performance loss is primarily the result of a log that is vastlylargerthan the one used by the data manager, the presence of more deadlocks, and the largernumber of locks set by the operating systemtransactionmanager. Moreover,recovery time is about twice as long for the OS transactionmanagerlargelybecause of the necessity of reading a much larger log.

Hence, our initial experimentssuggest that total abdication results in an unacceptable loss of performance. Unless there is some way for the operating sys tem to speed up transactionmanagement,perhaps by providingspecialhardware appropriate to the task, we are skeptical that the operatingsystemtransaction manager can provide the performancerequired by current data managers. A full paper on this study is in preparationKUMA86].

3. NETWORKFILESYSTEMSANDLOCATIONTRANSPARENCY

PopularNetwork File Systems (NFS) allow a user to access a remote file in the same way as he accesses a local file. This level of transparencywould seem to

-45- be a good idea; however, it is unused by data managers in the environments in which it runs. According to LAZO86], the cost of a remote read is about 2-3 times the cost of a local read. Although these measurements were particular to one hardware and operatingsystemenvironment, they seem to reflect the performance of mostcurrentimplementations.Consequently, if a data manager is accessing the EMPrelation to find all employees who live in Boston and are over 50 or alterna tively live in New York and are over 55, then it can read the entire EMP file over an NFS to discover the employees who actuallyqualify. Alternately, it can run the data manager on the remote node where it will read the entire file as a collec tion of local reads. Then, only the qualifyingrecords are actuallymovedbetween sites. Obviously, if there are few qualifyingemployees, the latter tactic will be much more efficient. Hence, the commonwisdom is move the query to the data and not the data to the query’’ BothORACLE and INGRESoperate in this more efficientmode in environmentssupportingNFS. Hence, a data manager is willing to go to the trouble of executing a remoteprocess and setting up the inter-machine communication to obtain a perceivedsubstantialperformance gain.

Beyond NFS is the possibleinclusion of locationtransparency for a file sys tem. In this context, a user need never worry about where a file is, he simply accesses it as if it were local to his site. This generalizes NFS by not requiring a user to ever know where a file actuallyresides. Moreover, the operatingsystem could decide to move the file if disk traffic or accessingpatternswarranted a switch.

Unfortunately, a data managerwouldinsist on turning this service off. If the data manageraccesses files as if they werelocal,then it wouldsimply run the nor mal one-site query optimizer built into existing data managers. Such a one-site optimizer can decide amongvariousquery processingoptions for complexqueries includingmerge-sort and iterativesubstitution.However,optimizers that work in a distributedenvironment are much more complex. They attempt to minimize a different cost function that includes the cost of networkcommunication and utilize additionalstrategiesincludingsemi-joins.Moreover, they worryextensivelyabout the location of intermediateresultsproducedduring the queryexecutionprocess. Often it is a good idea to move two objects to a third site and perform a join at that site.

If a data managercannot find out the location of objects, then all this optimi zation cannot be used. Moreover, if the data manager can find out the current location but later the locationmight be changed by the underlyingoperating sys tem, then the optimization can be performed but may be disasterouslysub-optimal if locationssubsequentlychange. Our intuitive sense is that a system offering locationtransparency may well be severalorders of magnitudeslower than a con ventionaldistributed data base system. In this case, a data managerwould never utilize a file systemofferinglocationtransparency.

4. REMOTEPROCEDURECALL

As noted in the previoussection, the applicationprogramoften must run at a different site than the data base engine. In addition, a distributed data base sys tem is architected by having a local managerwhich runs at each site containing data relevant to a usercommand and then a globalmanagerwhichcoordinated the

-46- local managers. Hence, to solve a distributed query involving data at N sites, there will be N local managers and one globalmanager who must communicate. Lastly,there is considerablerecentinterest in extendible data base systems,which can be tailored to individual needs by includinguser-writtenprocedures. See CARE86,STON86] for two differentapproaches to such a system. In an extendible environment, one must call procedureswritten by others, and a robustdebugging environmentseemsessential. Hence, one would like to link the user written pro cedure into the data manager for high performance and also to execute a call to a protectedaddressspace for programisolationduringdebugging. This last feature requirescommunication to a remoteprocessrunning on the samemachine.

In all thesecases, the run-timesystemmustlinearize the parameters of a pro cedure call, then pass them though the availableinterprocesscommunication sys tem, and then repackage the data structure at the other end. All this effortwould go away if a remoteprocedure call (RPC)mechanism were available. Hence, we are enthusiasticabout the RPC facilities.

5. LIGHTWEIGHTPROCESSES

A conventionaloperatingsystem binds each terminal to an application pro gram. Hence, if there are N active terminals, then N applicationprocesses will exist. Moreover, the obvious way to architect a data manager is to fork a data base process for each data base user. This will lead to N data base processes and a * total of 2 N activeprocesses. The N data base processes can all share a single copy of the code segment and have private data segments.Moreover, if the operat ing systemprovidesshared data segments, then the lock table and buffer pool can be placedthere.

An alternativeapproach is to construct a data base server process. Here, there would be only one data base process to which all N applicationprocesses would send messagesrequesting data base services. A serverprocessentails writ ing a completemulti-taskingspecial purposeoperatingsystem inside the data manager. This duplication of operatingsystemservicesseemsintellectually very undesirable.

As noted in STON81I a data manager must overcomesignificantproblems whenusingeitherarchitectures. In this section we indicate one additionalproblem that occurs in a process-per-userarchitecture. When run in commercialenviron ments on large machines, it is not unusual for there to be severalhundredtermi nals connected to a system. Five hundred is not a giganticnumber, and the larg est terminalcollection we know of exceeds40,000. Assuming 500 terminals each supporting a data base user and assuming a process-per-usermodel of computation, the OS must support 1000 active processes. Currentoperatingsystems tend to generateexcessiveoverhead when presented with large numbers of propcesses to manage. For example, the common tactic of having the scheduler do a sequential scan of the process table to decide who to run next is not workable on large sys tems.

In addition,memorymanagement with a large number of processes tends to be a problembecause the memory per terminal is often not very large. For exam ple, a 32 mbyte system with 500 terminals yields only 60 kbytes per terminal. Consequently, great care must be taken to avoid thrashing. Lastly, excessive

-47- memoryconsumption is likelybecause the implementor of a process-per-user data manager will likely allocate static areas for such things as parse trees and query plans. Consequently, the data segment of a process-per-user data manager tends to be of substantial size. In a serverimplementation, all such allocation would likely be done dynamically. Hence, in a 500 user environment, the server data segmentwould be muchsmaller than 500 times the size of a process-per-user data segment. Consequently, the total amount of main memoryconsumed by the data manager is higher in a process-per-userenvironment, and the probability of thrashing is increased.

A solution to the resourceconsumption and bad performance of a process-per user model is for an operatingsystem to supportlightweightprocesses. This model of computationwouldallocate one addressspace for a collection of tasks. Multiple threads of controlwould be active in the singleaddressspace and thereby be able to share open files and data structures. One would like an operatingsystem to schedule the address space and then as a seconddecisionschedule the particular thread in the addressspace. The data managerwould like easy-to-usecapabilities to add and deletethreads from an addressspace. Such operationsshould be much cheaperthancreating or destroying a process.

More generally, the concept of binding an address space to every terminal is also questionablesince in a data base environment, many of the users will be exe cuting the sameapplicationprogram. Hence,lightweightprocessesmakesense for applicationsoftware also. To utilize this construct, the hard binding of a terminal to a processmustdisappear. The general idea then would be to create Ni applica tion servers and N2 data base servers each supportingmultiplethreads of control. All the ‘‘plumbing’ - to connect this environmenttogethershould be done within the operatingsystem. Moreover, one should be able to add and drop serverseasily and have their load be automaticallyredirected to otherservers. CICS in an IBM environment and Pathway in a Tandemenvironment are system with this flavor. Hopefully, some of these capabilities will migrate into future general purpose operatingsystems.

REFERENCES

{AGRA85I Agrawal, R., et. al., “Models for StudyingConcurrencyControlPerformance Alternatives and Implications,” Proc. 1985 ACM-SIGMODConference on Management of Data,June 1981. BROW81] Brown, M. et. al., “The Cedar DatabaseManagementSystem,” Proc. 1981 ACM-SIGMODConference on Management of Data, Ann Arbor, Mich., June 1981. {CARE84} Carey, M. and Stonebraker, M., “The Performance of ConcurrencyControl Algorithms for DatabaseManagementSystems,” Proc. 1984 VLDB Confer ence,Singapore, Sept. 1984.

-48-- CARE86] Carey, M., “The PreliminaryDesign of Exodus,” Proc. 11th Very Large Data BaseConference,Kyoto,Japan,Sept. 1986. CHAN86I Chang, A., (privatecommunication) DASG85] P. Dasgupta, et. al., “The CloudsProject: Design and Implementation of a Fault TolerantDistributedOperatingSystem,”Technical report GIT-ICS 85/29,GeorgiaTech., Oct. 1985. {GRAY78J Gray, J., “Notes on DataBase OperatingSystems,” in OperatingSystems: An AdvancedCourse,Springer-Verlag,1978,pp393-48 1. KUMA86I Kumar, A., “Performance of AlternateApproaches to TransactionProcessing,” (in preparation). LAZO86] Lazowska, E. et. al., “File Access Performance of DiskiessWorkstations,” ACM-TOCS,August 1986. MITC82] Mitchell, J. and Dion, J., “A Comparison of Two Network-Based File Servers,” CACM,April1982. MUEL83] Mueller, E. et. al., “A NestedTransactionMechanism for LOCUS,” Proc. 9th Symposium on OperatingSystemPrinciples,October 1983. 11PU86] Pu, C. and Noe, J., “Design of NestedTransactions in Eden,”TechnicalReport 85-12-03, Dept. of ComputerScience, Univ. of Washington,Seattle, Wash., Feb. 1986. RIES79} “The Ries, D., Effects of ConcurrencyControl on Data Base Management Sys tem Performance,”Electronics Research Laboratory, Univ. of California, Memo ERLM79120,April 1979. SPEC83 I Spector, A. and Schwartz, P., “Transactions: A Construct for ReliableDistri butedComputing,”OperatingSystemsReview, Vol 17, No 2, April 1983. STON81] Stonebraker, M., “Operating System Support for Data Managers”, CACM, April 1981. STON84J Stonebraker, M., “VirtualMemoryTransactionManagement,”Operating Sys tem Review,April 1984. STON85I Stonebraker, M., et. al., “Problems in Supporting Data Base Transactions in an Operating System TransactionManager,”Operating System Review,

-49- January, 1985. STON86]

Stonebraker, M., Inclusion of New Types in a Data Base System,” Proc. 1986 IEEEDataEngineeringConference, Los Angeles, ca., Feb. 1986. TRAI82] Traiger, I., “VirtualMemoryManagement for Data BaseSystems,”Operating SystemsReview, Vol 16, No 4, October 1982. WALK83] Walker, B. et. al., “The LOCUSDistributedOperatingSystem,” Proc. 9th Symposium on OperatingSystemPrinciples,” Oct 1983.

-50- IEEECOMPUTERSOCIETY TECHNICALCOMMITTEEAPPLICATION

B4STHUCT1OWS:PLEASEPRINT 114 INK OP TYPE (ORECHARACTERPERBOX INFORMATIORWRITTENOUTSIDE OF BOXES WILt NOT & RECORDED)BECAUSE OF PACKAGEDELIVERYSERVICES.STREETAOORESSES ARE PRE. FERREDRATHERTHAN. OR IN ADDITION TO. POSTOFFICEBOXNUMBERSINTERNATIONALMEMBERSARE REQUESTED TO MAKE BEST USE or AVAILABLESPACE FOP LONGADORESSES IEEE COMPUTERSOCIETY ~I IUJI11111 H JI I H1~UiJ I IT~ m U±11 LAST ifi.J~ F~ST NM.w diTIa& ~ ETC I H11 III!! 1111111 I ~±!U LII II 1111 ~ANV.IwNERS1TS.AGENCYitME O€PA~EMEl4T~A& STC1NBt*DS~G9 0 ~DX PAPTb~NT¶TC

11111! III! II11JIIIIIII!i~m CHECKONE STREETAOCieESS iCie ~T~FCE ~AV YEAR 0 NEWAPPLICATION rrnIIll!I!!IIIIIIW!mLLIIIIIIU 0 INFORMATIONUPDATE CIII STATE lie COOE I I Ii 111111! III liii liii!!! 11111111 LLU

Orricepi.Ev~ - ~ ri Itililili II I III! iiiii III ii IIIIIIIIII!Itll Ii

ELECTNDe~ MArE N€T~I ELXCThONic MArE A0O~S5(M~bD~I rIIllIl1IIIIIr1rHhlIrn~.~~~J1L1 ITEERNU~WERtOW.onie CEE fliEU&RAFFILiATE NO DES rio

DEFiNITiONS:

TC MEMBER—ActivelyparticIpates in 1C activities,receivesnewsletters and all TC Comn*jn,catlorm Active participationmeans that you do or are wilting to do something far the TC such as reviewpapers, help organize conferences. etc in Standards with TC etc. Person can be a workshops. . participate development, help operations. member in up to 4 TCs

PLEASEINDICATE ACTIVITY TC CORRESPONDENT—Does notparticipateactively in TC activities;receivesriewsietTersand other TC con’mIj YOUR IC INTERESTS nications Person can be a correspondent in up tO 4 more TCs

5 4 3 2

TechnicalInterests/Specialties: ‘ItOHINTEREST NO INTEREST

TECHNICALCOMMiTTEECODES IC Codon Other(Specify):

MEMSER V COORESPONDENT C

ComputationalMedicIne (01) 01 ComputerArchitecture (02) 02 ComputerCommunications (03) 03 ComputerElements (04) 04 ComputerGraphics (05) 05 ComputerLanguages (06) 06 ComputerPackaging (07) 07

Computers in Education (08) 08 Computing and the Handicapped (09) 09 Oata Base Engineering (10) to DesignAutomation (It) 11 DistributedProcessing (t 2) 12 Fault-TolerantComputing (13) 13 MassStorageSystems & Technology(14) 14 MathematicalFoundations ot Computing (15) 15 Microprocessors & Microcomputers (t6) 16 M.croprograrnming (17) 17 Multiple-ValuedLogic (18) 18 OceanicEngineering & Technology (19) 19

OlliceAutomation (20) 20 OperatingSystems(21) 21 OpticalProcessing (22) 22 PatternAnalysis & MachineIntelligence (23) 23 PersonalComputing (24) 24 Real TimeSystems (25) 25 Robotics(26) 26 Security and Privacy (27) 27 Simulation (28) 28 SottwareEngineering (29) 29 Test Technology(30) 30 VLSI (31) 31 Computer and DisplayErgonomics (32) 32 -J SupercomputingApplications (33) 33

pleasereturn this torm to (he followingaddress: IEEECOMPUTERSOCIETY 1730 Massachusetts Ave NW Washington DC 20036-t903

We Took lorward TO haeing~li will, us! CALLFORPARTICIPATION

IFIPWG8.4Workshop on

OfficeKnowledge:Representation,ManagementandUtilization

17-19August1987 University of Toronto Toronto,Canada

WORKSHOPCHAIRMAN PROGRAMCHAIRMAN ORGANIZINGCHAIRMAN

Prof. Dr. Alex A. Verrijn-Stuart Dr. WinfriedLamersdorf Prof. Fred H. Lochovsky University of Leiden IBMEuropeanNetworkingCenter University of Toronto

This workshop is intended as a forum and focus for research in the representation,management and utilization of knowledge in the office. Thisresearch areadrawsfrom andextendstechniques in the areas of artificialintelligence, databasemanagementsystems,programminglanguages, and communicationsystems. The workshopprogram will consist of one day of invitedpresentations from key researchers in the area plus one and one half days of contributedpresentations. Extendedabstracts, in English, of 4-8 double-spacedpages (1,000-2,000words) are invited. Each submission will be screened for relevance and potential to stimulatediscussion. There will be no formalworkshopproceedings.However,acceptedsubmissions will appear as submitted in a special issue of the WG8.4newsletterand will be madeavailable to workshopparticipants.

How to submit

Fourcopies of double-spacedextendedabstracts in English of 1,000-2,000words (4-8 pages)should be submitted by 15 April1987 to the ProgramChairman:

Dr. WinfriedLamersdorf IBMEuropeanNetworkingCenter Tiergartenstrasse 15 Postfach 10 3068 D-6900Heidelberg WestGermany

ImportantDates

Extendedabstractsdue: 15 April1987

Notification of acceptance for presentation: 1 June1987

Workshop: 17-19August1987

-52- CALLFORPAPERS

13thInternationalConference 1987 on ~thVU~ VeryLargeDataBases BRiGIf~~~ ~WND / ~PTEM~ 1-4 Brighton,England, U.K. 1—4September1987

THE CONFERENCE

VLDBConferences are a forum and focus for identifying and encouragingresearch,development, and the novelapplications of databasemanagementsystems and techniques. The ThirteenthVLDBConference will bringtogetherresearchersandpractitioners to exchangeideasandadvance the subject.Papers of up to 5000 words in lengthand of highqualityare invited on anyaspect of the subject but particularly on the topicslisted below. All submittedpapers will be read and carefullyevaluated by the ProgrammeCommittee.

TOPICS TO SUBMITYOURPAPERS

Majortopics of Interestinclude, but are not limited to: Fivecopies of double-spacedmanuscript in English up to 5000wordsshould be submitted 6 1987 Data Models by February to either of the following: DesignMethods and Tools DistributedDatabases MrWilliamKent QueryOptimization Hewlett-Packard ConcurrencyControl ComputerResearchCentre DatabaseMachines 1501 Page Mill Road PerformanceIssues PaloAlto CA 94394USA Security KnowledgeBase Representation Professor P M Stocker Multi-mediaDatabases ComputerCentre ImplementationTechniques University of EastAnglia ObjectOrientedModels Norwich NR4 7TJ

The role of toaics

IMPORTANTDATES

PAPERSDUE: 6 FEBRUARY,1987

NOTIFICATION OF ACCEPTANCE: 27 APRIL,1987

CAMERAREADYCOPIESDUE: 29 MAY,1987

GeneralConferenceChairman ProgrammeCo-ordinator ProgrammeCommittee Organi,ingCommitteeChairmen ProfessorPeter J H King Professor G Bracchi Co-Chairmen Dr Keith G Jeffery BirkbeckCollege Politecnico di Milano Professor P M Stocker Science & EngineeringResearchCouncil,England University of London Italy University of EastAnglia,England NorthAmericaOrganisingCo-ordinator England Mr William Kent ProfessorStuart E Madnick HewlettPackard MassachusettsInstitute of Technology U.S.A. U.S.A. -53- Preliminary Call for Papers

6th InternationalConference on APPROACH Entity-RelationshipApproach

November9-11, 1987 NewYork

ConferenceChairmen: MartinModel!,MerrillLynch StefanoSpaccapietra,Universit~ de Bourgogne ProgramCommitteeChairman: Salvatore T. March,University of Minnesota SteeringCommitteeChairman: Peter P. Chen, LSU and MIT

The Entity-RelationshipApproach is the basis for manyDatabaseDesign andSystem DevelopmentMethodologies. ER Conferencesbringtogetherpractitioners and researchers to share newdevelopmentsand issuesrelated to the use of the ER Approach. Theconference consists of presentedpapersaddressing the theoryandpractice of the ER Approach as well as invitedpapers,tutorials,panelsessions, anddemonstrations.

MajorThemes: DatabaseDevelopmentand Management * databasedesign * datamodelsand modelling * databasemanagementsystems * databasedynamics languages for datadescription and manipulation * databaseconstraints ApplicationSystems * CAD/CAM and engineeringdatabases * multi-mediadatabases

* knowledge-basedsystems userinterfaces * object-orientedsystems * businesssystems ManagingOrganizationalInformationResources dataplanning * datadictionaries * informationarchitectures * informationcenters

• * translatingdataplans into applicationsystems end usercomputing Submission of Papers: Papers are solicited on the abovetopics or on anyothertopicrelevant to the ER Approach. Five copies of completedpapersmust be received by the programcommitteechairman by April 15, 1987. Papersmust be written in English and may not exceed 25 doublespacedpages. Selectedpapers will be published in DataandKnowledgeEngineering(North-Holland). ImportantDates: Papersdue: April 15,1987 Notification of acceptance: June 20, 1987 Camerareadycopiesdue: August 15, 1987

All requests andsubmissionsrelated to the programshould be sent to the programcommitteechairman: ProfessorSalvatore T. March Department of ManagementSciences University of Minnesota Tel: (612)624-2017 271 19thAvenueSouth bitnet: march@umnacvx Minneapolis, MN 55455 USA csnet: march@umn-cs

THECOMPUTERSOCIETY I Non-profit Org ~,OFTHEIEEE 1730MassachusettsAvenue. N W I u.S.Postage PAID Washington, DC 20036-1903 SilverSpring, MD I Permit1398

A,BeTflSteU~ Prof PhiliP Wang j~5titUte Road 72 hsr~ 01879 Tyn~SbOP0’ ~iA USA