Explicit Network Scheduling

Richard John Black

Churchill College

University of Cambridge

A dissertation submitted for the degree of

Do ctor of Philosophy

December

Abstract

This dissertation considers various problems asso ciated with the scheduling and

network IO organisation found in conventional op erating systems for eective

supp ort for multimedia applications which require Quality of Service

A solution for these problems is prop osed in a microkernel structure The pivotal

features of the prop osed design are that the pro cessing of device interrupts is

p erformed by userspace pro cesses which are scheduled by the system like any

other that events are used for b oth inter and intra pro cess synchronisation

and the use of a sp ecially develop ed high p erformance IO buer management

system

An evaluation of an exp erimental implementation is included In addition to solv

ing the scheduling and networking problems addressed the prototyp e is shown

to outp erform the Wanda system a lo cally develop ed microkernel on the same

platform

This dissertation concludes that it is p ossible to construct an op erating system

where the kernel provides only the fundamental job of ne grain sharing of the

CPU b etween pro cesses and hence synchronisation b etween those pro cesses This

enables pro cesses to p erform task sp ecic optimisations as a result system per

formance is enhanced b oth with resp ect to throughput and the meeting of soft

realtime guarantees

To my parents

John and Marcella

Preface

Except where otherwise stated in the text this dissertation is the result of my

own work and is not the outcome of work done in collab oration

This dissertation is not substantially the same as any I have submitted for a

degree or diploma or any other qualication at any other university

No part of this dissertation has already b een or is b eing currently submitted for

any such degree diploma or other qualication

This dissertation do es not exceed sixty thousand words including tables fo ot

notes and bibliography

c

Copyright Richard Black All rights reserved

Trademarks

Alpha AXP DECstation TURBOchannel Ultrix and VAX are trademarks of Digital

Equipment Corp oration

Archimedes is a trademark of Acorn Ltd

ARM is a trademark of Advanced RISC Machines

Ethernet is a trademark of the Xerox Corp oration

MIPS is a trademark of MIPS Technologies Inc

TAXI is a trademark of Advanced Micro Devices Inc

unix is a trademark of ATT

Windows NT is a trademark of Microsoft Corp oration

Xilinx is a trademark of Xilinx Inc

Other trademarks whichmay b e used are also acknowledged i

Acknowledgements

Iwould like to thank my sup ervisor Derek McAuley for his encouragement and

advice during my time at the Lab oratory I would also like to thank

Roger Needham Head of the Lab oratory for his supp ort and for encouraging me

to sp end one summer getting a new p ersp ective on research by working as an

intern at DEC Systems Research Center in Palo Alto

I am grateful for the help and friendship of the memb ers of the Systems Research

Group who havealways proved ready to engage in useful discussions Jo e Dixon

Mark Hayter Ian Leslie and Eoin Hyden deserve a sp ecial mention

I am indebted to Paul Barham Mark Hayter Eoin Hyden Ian Leslie Derek

McAuley and Cosmos Nicolaou who read and commented on various drafts of

this dissertation Iwould also liketothankPaul Barham for artistic advice and

Robin Fairbairns for typ ographical guidance The world famous Tro jan Ro om

coee machine deserves a mention to o for its stimulating input over the years

Iwould liketotake this opp ortunity to congratulate Martyn Johnson and his sta

on the exceptionally high quality of the systems administration at the Computer

Lab and in particular to thank him for the cheerful and indeed encouraging way

he has reacted to my allto ofrequent requests to do p eculiar things to his system

This work was supp orted by an XNI studentship from the Department of Edu

cation for Northern Ireland ii

Contents

List of Figures vii

List of Tables viii

Glossary of Terms ix

Intro duction

Context

Outline

Background

Networking Technology

Asynchronous Transfer Mo de

Fairisle

Op erating System Research

Wanda

The xkernel

Pegasus

Nemo

Related Scheduling Work

Sumo

MetaLevel Scheduler

Summary

Pro cess Scheduling

Priority

Priority in the Internet

PriorityinWanda

Prioritybetween Applications

Priority within Applications

Priority in Devices iii

Perio dicity

Earliest Deadline First

Summary

Sharing the CPU

Interpro cess scheduling in Nemo

Interpro cess scheduling in Nemesis

Interpro cess scheduling in Fawn

Inter Pro cess Communication

Virtual Pro cessor Interface

Activations

Events

Time

Interrupts

IntraPro cess Scheduling

Event counts and sequencers

Concurrency primitives using events

SRC threads

Posix threads

Wanda threads

Priority

Summary

InterPro cess Communication

Language

Shared Libraries

IPC Mo del

Trust

Migrating mo del

Switching mo del

IPC Op eration

Architecture

Calling conventions

IPC Low Level

IPC Stubs

Binding

Service Management

Trading

Name Spaces

Restriction of Name Space iv

Bo otstrapping

Binder

Trading

Other communication

Summary

Input Output

Previous Schemes

Unix

Wanda

Fbufs

Application Data Unit Supp ort

Requirements

Considering Device Hardware

Proto col

Application Software

Scheduling

Streaming Memory

The Adopted Solution

Op eration

Usage

Longer channels

Complex channels

Out of band control

Summary

Exp erimental Work

Exp erimental Platform

System Conguration

Measurement Details

Interpro cess scheduling

Interrupt Latency

Jubilee Startup Costs

Same machine RPC

Intrapro cess scheduling

Comparative p erformance

Eects of Sharing

Fairisle Throughput

Fairisle Host Interface Performance v

Transmit

Receive

Summary

Further Work

Op erating System Development

Event value overow

Virtual Address System

Resource Recovery and Higher Level Issues

Heuristic Event Hints

Proto col Supp ort

Exp onential jubilees

Sp ecialist Platforms

Shared memory multipro cessors

Nonshared memory multipro cessors

Microthreaded pro cessors

Interpro cess Communication

Desk Area Network

Other Observations

Summary

Conclusion

Bibliography vi

List of Figures

Fairisle switch overview

Schematic of an by fabric

Port controller overview

Allo cation of CPU by the Nemesis scheduler

Example of sending an event up date

Server stub dispatcher pseudoco de

Unix Mbuf memory arrangement

Wanda IOBuf memory arrangement

Fbuf memory arrangement

Headers for various proto cols

Trailers for various proto cols

Rbuf memory arrangement

Control Areas for an IO channel between two pro cesses

A longer Rbuf channel Control path for Fileserver Writes

Interrupt progress for three scenarios

CPU activity during a Jubilee

Latency from jubilee start to rst activation

Same machine Null RPC times

Context switch times for various schedulers

Comparison of static and shared schedulers

CPU Usage vs Fairisle Throughput vii

List of Tables

Summary of VPI context area usage

Calling conventions for MiddlC

TMM and RMM prop erties

Comparison of Buering Prop erties

Interrupt Latency

Approximate null RPC latency breakdown viii

Glossary

AAL ATM adaptation layer

ADU Application Data Unit

ANSA Advanced Networked Systems Architecture

ARM Advanced RISC Machine

ATDM Asynchronous Time Division Multiplexing

ATM Asynchronous Transfer Mo de

AXP DEC Alpha pro ject logo

BISDN Broadband Integrated Services Digital Network

BSD Berkeley Software Distribution

CFR Cambridge Fast Ring

CPU Central Pro cessing Unit

DAN Desk Area Network

DEC Digital Equipment Corp oration

DMA Direct Memory Access

DRAM Dynamic Random Access Memory

ECC Error Correcting Co de

EDF Earliest Deadline First

FDL Fairisle Data Link

FIFO First In First Out

FIQ Fast Interrupt Request on ARM pro cessors

FPC Fairisle Port Controller

FPGA Field Programmable Gate Array

FRC Free Running Counter

IO Input and Output ix

IP Internet Proto col

IPC Interpro cess communication

JPEG Joint Photographic Exp erts Group a video compression standard

LED Light Emitting Dio de

MIDDL Mothys interface denition language

MLS MetaLevel Scheduler

MMU Memory Management Unit

MRU Most Recently Used

MSDR MultiService Data Representation

MSNA MultiService Network Architecture

MSNL MultiService Network Layer

MSRPC MultiService Remote Pro cedure Call

MSSAR MultiService Segmentation and Reassembly

NFS Network File System

NTSC Nemo Trusted Sup ervisor Co de or National Television Standards

Committee the American TV standard

OSF Op en Software Foundation

PAL Phase Alternate Line the Europ ean TV standard

PCB Printed Circuit Board

PDU Proto col Data Unit

QOS Quality of Service

RISC Reduced Instruction Set Computer

RMM Receive Master Mo de

ROM ReadOnly Memory

PTM Packet Transfer Mo de

RPC Remote Pro cedure Call

SAS Single Address Space

SRAM Static Random Access Memory

SRC Digital Equipment Corp oration Systems Research Center

STM Synchronous Transfer Mo de

SVR Unix System Five Release Four x

TAXI Transparent Asynchronous Transmitter Receiver Interface

TC Timer Counter

TCP Transp ort Control Proto col

TLB Translation Lo okaside Buer

TMM Transmit Master Mo de

UDP Unreliable Datagram Proto col

USC Universal Stub Compiler

VCI Virtual Circuit Identier also Virtual Channel Identier

VPI Virtual Pro cessor Interface

XTP Express Transp ort Proto col

XUNET Exp erimental University Network xi

Chapter

Intro duction

Over the past decade there have b een many advances in the eld of computer

networking ranging from improvements in bre optics to the widescale intercon

nection of heterogeneous equipment In particular developments in high sp eed

packet switching technology have created the likeliho o d of a universal paradigm

for all typ es of user communication the multiservice network

During this time the traditional exp onential growth of computing p erformance

has continued unabated and manychanges have b een made in the area of op erat

ing systems to capitalise on this growth in workstation p ower In the recent past

there has not however been an equivalent rise in the p erformance of network

systems which the market has demanded from workstations

Recent networking technology promises not only enormously improved p erfor

mance but also Quality of Service guarantees the ability to realise true multi

service communication to the workstation has b ecome essential and the deploy

ment of this high p erformance network technology has vastly reduced the ratio

of workstation power to network interface bandwidth

From the authors exp erience of the Fairisle high sp eed network using b oth tra

ditional and microkernel op erating systems substantial deciencies have b een

uncovered which threaten the latent utilityofthe available hardware

This dissertation presents scheduling and IO structures for Fawn a general pur

pose multiuser op erating system designed to maximise the service to applications

from such hardware

Context

Previous research has been done in Cambridge in the eld of Continuous Media

applications Much of this has been practical with the implementation of the

Pandoras Box Hopp er a continuous media p eripheral for a workstation

Such facilities are typically used in a distributed environment Nicolaou

Storage and synchronisation services have b een implemented for the

system which used an early ATM network known as the Cambridge Fast Ring

Temple

The Fairisle pro ject Leslie was an investigation of switch based ATM net

works This pro ject designed and constructed a testb ed switched ATM infras

tructure in order to investigate the management and behaviour of real trac

streams

Use of the Pandora system revealed the problems of op erating with rst genera

tion multimedia equipment which supp orts the capture and displayofcontinuous

media streams but not their direct manipulation by applications The observa

tion that busbased workstations where data traversed the bus many times were

not ideal led to a prototyp e second generation multimedia workstation replacing

the bus with an ATM based interconnect the Desk Area Network Hayter

Hayter Barham

Op erating system supp ort for continuous media streams has also been under

investigation the creation of the Pegasus pro ject Mullender was intended

to develop such a system At the time of writing the Pegasus pro ject has b egun

an implementation of some of the low levels of such a system

Outline

Background ideas to the work are intro duced in the next chapter together with

a discussion of the research environment in which this investigation to ok place

Consideration of previous or related work o ccurs in context in subsequentchapters

where the structure of the relevant parts of the Fawn design is examined

Chapter studies the quality of service issues of scheduling in a multimedia

multiservice networking environment The metho ds used by Fawn for inter

pro cess scheduling interpro cess communication virtual pro cessor interface and

intrapro cess synchronisation are generated

Subsequently chapter develops the higher levels of the communication mo del

and considers trust binding and naming

In chapter the schema for bulk data transp ortation in Fawn is presented after

surveying the merits of previous designs

An exp erimental implementation including p erformance measurements is exam

ined in chapter

Suggestions for further work and extensions to the prototyp e system are made

in chapter These fall broadly into Desk Area Network and Multipro cessor

concerns

Finally chapter summarises the main arguments of the dissertation and makes

some concluding remarks

Chapter

Background

This chapter discusses the areas of research which form the background to the

work describ ed in the rest of this dissertation Some of the previous research is

considered at the p oint at which its relation to the equivalent systems prop osed

in this work is more directly assessable

Networking Technology

Asynchronous Transfer Mo de

Asynchronous Transfer Mo de ATM which was originally called Asynchronous

Time Division Multiplexing ATDM has b een in use for approximately years

Fraser as a technique for integrating iso chronous and bursty trac in the

same data network This technique uses small xed sized data transfer units

known as cells or minipackets each of which includes a header which is used

throughout the network to indicate the lightweight virtual circuit over whichthe

data is to travel During this time Bell Labs has pro duced four ATM networks

Spider Datakit Incon and XUNET Fraser

At Cambridge the Computer Lab oratory develop ed ATDM networks based on

slotted rings b eginning with the Cambridge Ring Hopp er More recently the

Fairisle pro ject has develop ed a general top ology switch based ATM network

In Synchronous Transfer Mo de STM information from multiple circuits is mul

tiplexed together in a deterministic way into dierent time slots within a larger

frame The time slot is allo cated for the duration of the circuit at the time of

its establishment For example telephony circuits of Kbitsec maybemul

tiplexed onto a single trunk of Mbitsec The granularity of information from

each circuit is typically very small eg bits Since the p osition of each o ctet

within the frame is xed for each link in the network only a xed size buer is

needed at each switch The advantage of such a network is that the data stream

arrives at the recipient with no jitter and that the switching requirements are

deterministic The disadvantage is that it is inecient with bursty data trac

sources and cannot take advantage of ne grain statistical multiplexing

In Packet Transfer Mo de PTM information is packaged in large units usually

measured in tens of thousands of bits known as packets Multiplexing is per

formed asynchronously when a network no de has a packet to send it acquires

the entire bandwidth of the channel for the duration of the packet This involves

some form of media access proto col Contention between no des intro duces vari

able delays in accessing the channel leading to jitter The advantage of PTM is

that the bandwidth can be very eectively shared between bursty sources each

using the full channel when required At switching no des queues build up for

output p orts for whichpackets are arriving from multiple input p orts This leads

to large buering requirements and adds additional jitter on the data streams

In Asynchronous Transfer Mo de the allo cation of time to each circuit is also

not predetermined Instead a dynamic algorithm is used to decide which of the

circuits should b e p ermitted to make use of the bandwidth next To reduce jitter

and to ensure ne grained sharing of the available bandwidth the allo cations are

not as large as with PTM eg or o ctets and are of a xed size to

ease their handling in hardware Since the allo cation is dynamic ATM networks

can accommo date bursty trac sources much b etter than STM and can take

advantage of statistical multiplexing The disadvantage is that the buering

requirements at the switching p oints is nondeterministic and there is some jitter

in the arriving data stream The header or tag containing the Virtual Circuit

Identier VCI is kept as small as p ossible and only has lo cal signicance it is

replaced or remapped at each stage It is the remapping tables in the switches

along the route that represent the state for the circuit The circuits carried by

ATM networks are usually lightweightLeslie which means that they do not

have any hop by hop error control and may b e disbanded by intermediate no des

at any time ATM is designed to b e a compromise b etween PTM and STM which

includes sucientcontrol over jitter and resp onsiveness to bursty sources that it

Other than the marginal eect of the clo ck drift of the underlying transmission medium

can be used for interop eration b etween all typ es of equipment

A detailed discussion of multiplexing techniques and the advantages and disad

vantages of ATM networks may b e found in McAuley

The authors exp erience of ATM includes the implementation of an ATM proto col

stack within Unix Blackc

Fairisle

The Fairisle pro ject on which the author was employed for a p erio d of a year

before b eginning research has built a switch based ATM network for the lo cal

area Leslie Blackd Since this network was to b e used to investigate the

management of quality of service in a lo cal area multiservice network one of the

key features of the design was exibility and programmability

Field programmable gate arrays FPGAs are used extensively throughout the

design which allows the b ehaviour of the hardware to b e changed and many deci

sions are made in software The rmware also supp orts a great deal of telemetry

The overall design of the switch is based around a space division fabric with a



number of port controllers group ed around it This is shown in gure The

switch is input buered

Transmission Port Switch

Line Controller Fabric

Figure Fairisle switch overview

The switch fabrics are available as by by or by These are made

up of self routeing roundrobin priority by crossbar elements based on the

design presented in Newman and implemented on a Xilinx The



Also known as line cards

by fabric uses of these elements arranged in a delta network using two

interconnected PCBs A single PCB gives an by fabric with no use made of

the p otential additional paths The arrangement for an by fabric is shown

in gure For a by the unconnected paths are interconnected with an

identical b oard

8 bit data path 1 bit acknowledge

From 4x4 4x4 To Input Crossbar Crossbar Output Ports Element Element Ports

From 4x4 4x4 To Input Crossbar Crossbar Output

Ports Element Element Ports

Figure Schematic of an by fabric

Eachportcontroller card contains one input and one output p ort which together

make up one line On early versions of the p ort controller the transmission system

was a daughterb oard but on later versions this was included on the main PCB

TransmissionOutput FIFO Backplane

SRAM DMA Buffer Xilinx IO Memory Bus Bus

ARM DRAM

CPU

Figure Port controller overview

The Port Controller itself consists of a standard Arm based micro computer in

cluding DRAM and ROM together with the Fairisle network section The network



section is comprised of Kbytes of SRAM and a Xilinx FPGA The Xilinx

chip is resp onsible for reading the arriving cells into the buer SRAM and trans

mitting cells into the switch fabric itself Software is in control of these op erations

dictating the cell scheduling and retry p olicies After early exp erimentation the

management of the free queue the header remapping and a small number of

automatic retries in the event of contention were committed to the rmware

The Arm pro cessor runs the Wanda microkernel see section together with

a handcrafted assembler interrupt routine which is resp onsible for handling the

cell interface The Arm pro cessor provides a sp ecial lowlatency interrupt class

known as FIQ which has access to a private bank of registers Within the Wanda

system this FIQ interrupt is never disabled and it interacts with Wanda using

conventional interrupts In this waytheFairisle hardware is presented as a virtual

device The full details of the op eration of the Fairisle interface may b e found in

Haytera Hayterb

Since the p ort controller includes the standard IO bus for the Arm chip set it

is p ossible to interop erate with other networks by plugging in the appropriate

adaptor eg Ethernet or CFR

The Fairisle p ort controller was used as the exp erimental platform for the work

describ ed in this dissertation and further discussions of its features o ccur through

out the text

Op erating System Research

Wanda

The Wanda microkernel lo osely derived from Amo eba Tanenbaum is a

lo cally develop ed op erating system designed as an aid to networking multimedia

and op erating system research It has b een implemented on b oth uni and multi

pro cessors including the VAX Firey the Arm and MIPS R It was



p orted to the Arm range of pro cessors and the Fairisle hardware by the author

Wanda provides a multithreaded environment using kernel threads The kernel

implementation is multithreaded and includes the virtual address system the



Awas used on earlier versions



Mike Ro e of the Computer Lab oratory assisted in the original p ort to the Archimedes

IPC system device drivers and the MultiService Network Architecture proto

col family McAuley The scheduler uses a static priority runtocompletion

p olicy Synchronisation uses counting semaphores which have separate imple

mentations at user and kernel levels to aid p erformance A user space emulation



library for the Posix threads primitives exists

The system provides logically separate virtual address spaces for each pro cess

but its fo cus on emb edded network or op erating system services precludes virtual

memory On many platforms there is no MMU hardware and Wanda has a sp ecial

Single Address Space SAS mo de for these platforms It was partially exp erience

with the Wanda SAS system which led to the consideration of a single virtual

address space for the Pegasus pro ject see section

There is usually no multiplexing of user threads over kernel threads since the

user level synchronisation primitives reduce the cost of kernel level scheduling

An exception is in the ANSA Testb ench ANSA which is used as one of the

available RPC systems ANSA threads are multiplexed over tasks with the tasks

b eing implemented using Wanda threads

The principal distinguishing feature of this system is that it uses the MSNA suite

of proto cols as its only builtin means of communication b oth intra as well as

intermachine The original design is presented in Dixon subsequently the

networking co de was rewritten by the author in a similar metho dology to the

original as a requirement to supp ort the Fairisle network The IO scheme of

the Wanda microkernel is considered in detail in section Other proto col

families such as the internet proto col are provided through user space server

pro cesses as is common on many microkernels

The xkernel

The xkernel Hutchinsonisakernel that provides an explicit architecture for

constructing and comp osing network proto cols A proto col is viewed as a sp eci

cation of a communication abstraction through which a collection of participants

exchange a set of messages Three primitive ob jects are provided proto cols

sessions and messages

Proto col ob jects represent the static passive parts of conventional network proto

cols ie the ob ject co de Their relationship is statically congured Session ob



This was implemented by Robin Fairbairns of the Computer Lab oratory

jects are also passive but they are dynamically created A session ob ject contains

the data structures that represent the lo cal state of some network connection

Messages are active ob jects whichmove through the proto col and session ob jects

in the system

The system also provides buer management routines a map manager for sets of

bindings b etween identiers and alarm clo cks Messages ow from one proto col

to another by making invo cation op erations on the proto cols The proto cols may

use the mapping function to track the related session information using elds in

the headers etc

The kernel contains a p o ol of threads which are used to shepherd arriving packets

up from the device driver through the various proto cols For the sake of unifor

mity user pro cesses are required to present a proto col interface which consumes

or generates packets as required The kernel threads shepherding a message make

an up call into the user pro cess to deliver data To prevent a pro cess capturing

kernel resources the numb er of threads which can b e active concurrently in a user

pro cess as a result of an up call is limited

The main feature of this system is the ease with which proto cols can be manip

ulated exp erimented with and tested The measured p erformance overhead is

very low compared to monolithic implementations The xkernel do es however

presume that all proto cols are implemented in the kernel recently supp ort for

device drivers external to the xkernel has b een added to allow the xkernel to

op erate as a single Mach server

Pegasus

The Pegasus pro ject Mullender Mullenderisajoint pro ject funded by

the Europ ean Community Esprit programme involving the University of Cam

bridge Computer Lab oratory and the UniversityofTwente Faculty of Computer

Science

The aim of the pro ject is to develop an architecture for a distributed multimedia

system and thereby to construct an op erating system which supp orts multimedia

applications The architecture calls for sp ecial purp ose multimedia pro cessing

servers connected to Unix platforms The Unix platforms are used for the out

ofband control of multimedia applications and for non multimedia applications

such as compilers

The op erating system supp ort is comprised of the Nemesis microkernel which

supp orts multiple protection domains within a single virtual address space Sys

tem services are viewed as ob jects When ob ject and invoker are in the same

pro cess then invo cation is via pro cedure call Otherwise remote pro cedure call is

used taking advantage of lo cal machine transp orts where p ossible

The b enets of a single address space include ease of sharing between pro cesses

and p erformance enhancements on machines with virtually addressed caches The

costs are principally that pro cesses must be relinked when loaded and that

sharable co de must use some metho d other than static addresses to access data

The architecture also calls for split level scheduling to allow the applications to

have control over their own scheduling requirements without compromising the

kernels control of application resource usage

The Nemesis system is currently under development at Cambridge where it

is referred to in this work the reader must understand that this refers to the

situation at the time of writing mid substantial changes in b oth design

and implementation are likely in the future

Nemo

The Nemo system Hyden was an early prototyp e of a system in the Pega

sus Architecture and has b een used as a starting platform for it This work

principally considered the interpro cess scheduling required to promote Quality

of Service Hyden considers realtime problems in detail

A real time system is one in which the correct op eration of the system is dep endent

not only on the logical correctness of some computation but also on the time

that it is completed A hard realtime system is one in which temp oral deadlines

must never b e missed A soft realtime system is one in which a small probabilistic

chance of missing deadlines is acceptable A soft realtime system may op erate

successfully under much greater load than a hard realtime one

Hyden contemplates the feasibility of using hard realtime scheduling technology

for solving the problems of such a soft realtime system which is sub ject to over

load He concludes that in such a system where the demand may not b e known

in advance a negotiated share of the CPU over a regular time period is more

appropriate

In the Nemo system the pro cesses are guaranteed some numb er of milliseconds of

pro cessor time over some longer p erpro cess p erio d Unallo cated or unused time

is given to pro cesses desiring more on an environment sp ecic basis Scheduling

decisions are made by a privileged system pro cess known as the kernel The

Nemo Trusted Sup ervisor Co de NTSC which represents the co de to which

the term kernel is more usually applied is resp onsible for dispatching all system

events to the kernel for consideration and for executing various pro cesses on

demand

The key feature of the Nemo system is that the applications are always aware of

the amount of CPU they have been allo cated and are informed via an up call

when they are given the CPU Use of layered pro cessing is shown to improvethe

results of computations if more time is available eg deco ding a JPEG image

The Nemo system do es not include device drivers consideration of interpro cess

communication or intradomain synchronisation Nevertheless it is a seminal

work in this eld and forms the basis for the lowest levels of the Fawn system

presented in this dissertation

Related Scheduling Work

Sumo

The Sumo pro ject at Lancaster UniversityCoulsonisaddressingmultimedia

supp ort by mo difying the Chorus microkernel This work has the advantage that

Chorus already supp orts Unix applications through the provision of a Unix sub

system The features which are b eing revised include the intro duction of stream

based rather than message communication primitives supp ort for using quality

of service to exert control over communications parameters and changes to the

existing coarse grained relative priorityscheduling

Likethe Pegasus architecture this work adopts split level scheduling primitives

multiplexing user pro cess threads over kernel provided threads Toavoid priority

inversion the maximum number of kernel threads active within a pro cess is limited

to the underlying number of CPUs on the machine This is also achieved using

nonblo cking system calls and up calls or software interrupts to indicate various

events

The Sumo pro ject uses earliest deadline rst scheduling although no absolute

guarantee is made that the deadlines will b e met the QoS parameters are treated

as forming a softrealtime system The realtime deadline scheduling coexists

with the Chorus system using the standard scheduling classes mechanism Each

multimedia pro cess presents the earliest deadline of its userlevel threads to the

kernel level scheduler

The kernel level scheduler always runs the pro cess with the earliest deadline The

user level schedulers are exp ected to run the thread with the earliest deadline

The system relies on the user level pro cesses not to abuse this mechanism to take

more of the CPU than the quality of service they have negotiated The deadline

of each virtual pro cessor is read on every rescheduling op eration bythekernel to

compute the next kernel thread to schedule This has a cost which is linear in

the number of pro cesses in the system

Device drivers are implemented entirely within the kernel and no multiplexing

over kernel threads o ccurs The bulk data IO mechanism relies on using the

virtual memory system to remap buers from the device driver to the receiving

pro cess Applications are frequently nished with the buers b efore the device

driver needs to use them again In the event that they are not the system relies

on Chorus copyonwrite Ascheme to compromise b etween the drivers require

ment always to have buers available and minimising the need to copy is being

researched

MetaLevel Scheduler

A variant on the idea of the split level scheduler is prop osed in Oikawa In

that system a layer known as a MetaLevel Scheduler MLS is inserted b etween

kernel functionality and the pro cesses This layer consists of a page of memory

for each pro cess which is used by that pro cess to store its intrapro cess scheduling

requirements A pro cess do es not have write access to any other pro cesses area

The MetaLevel contains the thread control data structures for events and time

outs along with information ab out priorities and p olicies for scheduling When

a kernel event o ccurs the kernel calls into the MLS and the MLS makes the

scheduling decision The MLS may then instruct the kernel which pro cess to

run changing protection domain if necessary Various dierent policies may be

implemented within the MLS and the system is considered to be easy to exp eri

ment with as only the MLS co de need be altered

In essence this design abstracts the kernelscheduling and virtual pro cessor in

terface from the kernel and user pro cesses and incorp orates it into an abstract

ob ject which is constructed in suchaway that metho ds maybeinvoked from the

user pro cesses and the kernel concurrently

The MLS is a metalevel in the sense that it is designed to encourage mo dularity

of thought when considering the scheduling decisions of the system Clearly the

implementation will include a kernel user level split

Summary

It is now accepted that multimedia systems are a soft realtime problem Various

scheduling mechanisms are being used to try and provide qualityof service sup

port within such a system Some of these eg Sumo are adapting technology

derived for the hard realtime environment Most systems are using split level

scheduling to separate the CPU resource usage of the various applications from

the internal scheduling requirements of each application

High p erformance ATM networks are b ecoming available for the transp ort of

voluminous multimedia data Recent hardware and op erating system research

has been based on improving the throughput from the network interface to the

application rather than the traditional CPU to memory concerns of systems

engineers

Chapter

Pro cess Scheduling

Priority is normally considered to be the solution to the problem of providing

quality of service supp ort to multimedia applications This chapter will consider

some of the problems with priority based scheduling and then move on to consider

the solution in the Fawn system derived from the Nemo system

Interpro cess communication and the virtual pro cessor interface of the Fawn ker

nel are then describ ed Finally the intrapro cess synchronisation mechanisms are

describ ed

Priority

First note that a simple priority scheme is rarely sucient Simple priority

schemes suer from priority inversion leading to the priority inheritance proto

col and the priority ceiling proto col Sha which requires static analysis the

determination of the complete call graph at compile time In fact Namakura

argues Nakamura that even more complex schemes are required again re

quiring static analysis

Static analysis of co de however is an op eration which can only be applied to

co de with a static call graph This is most untypical of all but the most esoteric of

executives In a system using dynamic binding of services closures Saltzer

for library routines and esp ecially RPC even if restricted to same machine it

simply cannot b e applied Furthermore in a general purp ose op erating system it

is imp ossible to reconcile the claims of dierent comp eting user pro cesses Even

in a single user single purp ose system the strict use of priority is not always

helpful

Priority in the Internet

In Jacobson Jacobson rep orts on the eect of routers prioritising the pro cess

ing of routeing up dates over forwarding duties on the internet The observation

is that the probability of packet loss during this p erio dic pro cessing is much

higher It is imp ortant that a very high packet arrival rate do es not delay the

pro cessing of routeing up dates indenitely To avoid this routers give the pro

cessing of routeing up dates priority over data packet pro cessing The resultant

gap in pro cessing has been measured at over ms Jacobson for a single

router The eect on a multimedia stream is made worse by the synchronisation

of routeing up dates which causes multiple routers to cease pro cessing at the same

time This eect is particularly severe on audio streams as regular dropouts are

very p erceptible to the human ear

Priority in Wanda

Using the Wanda system Dixon found that jitter over a lo cal area connection

could be signicantly reduced if the Wanda threads running at each end had a

higher priority than any other thread in the system Dixon The improvement

gained must be weighed against the cost crucial kernel housekeeping threads

were no longer guaranteed to run also the number of streams to which this

technique can be applied is clearly limited

Additionally the fact that interrupts caused by arriving data can indenitely

starve the system of any threadbased activity leads to an implementation where

the entire signalling stack is executed as an up call from the interrupt service

routine rather than in a kernel thread This requires that the signalling stack

be entirely nonblo cking adding signicantly to the proto cols implementation

complexity Even for the extremely lightweight ATM signalling in Wanda this

inband pro cessing adds jitter to the streams passing through the machine for

the elab orate signalling proto cols required for BISDN eg Q QSAAL

the jitter intro duced would b e enormous assuming that these proto cols could b e

implemented in this manner at all

Whilst up calls are a very ecient way of dispatching arriving data the time for

which such a call captures the CPU should b e strictly curb ed to ensure that eec

tive control of CPU scheduling is not lost Dixon also notes that pervasive

use of up calls can have detrimental eects in certain environments

Priority between Applications

In Nieh a comprehensive study is made of the eects of various scheduling

schemes and parameters on a mixture of interactive multimedia and batch ap

plications This work was p erformed on Solaris an SVR Unix They found

that the p erformance was unacceptable without a great deal of tweaking of the

priorities of the various pro cesses the resulting conguration b eing highly fragile

with the job mix of the machine Further the use of the real time priority based

scheduler did not lead to any acceptable solution This is a direct quotation from

Nieh

Note that the existence of the strictpriority realtime scheduling class

in standard SVR in no wayallows a user to eectively deal with these

typ es of problems In addition it op ens the very real p ossibility of

runaway applications that consume all CPU resources and eectively

prevent a user or system administrator from regaining control without

reb o oting the system

Their solution was to develop a new timesharing class which allo cates the CPU

more fairly and attempts to ensure that all runnable pro cesses do make steady

progress

Priority within Applications

In Sreenan Wanda is used as a platform for implementing a lo cal area syn

chronisation service A number of problems with priority scheduling in Wanda

arose during this work It was found necessary to split the clo ck interrupts into

two separate pieces reminiscent of the hardclock versus softclock distinction in

Unix Leer The high priority interrupt which can interrupt other device

drivers ensures that no time skew o ccurs The lower priority interrupt is used

to examine the scheduler data structures to determine if any threads need to be

awoken This split solved the problem of high network load causing missed timer

interrupts but implementation of this dynamic prioritychange in timer handling

necessitated an interrupt context switch Dixon rep orted that the cost of

timer interrupts in the common case was s The cost of split timer interrupts

is not rep orted but a likely estimate is double at s or of the machine for



his ms ticker Despite using the machine in single application mo de it to ok

considerable exp erimentation to congure acceptable priorities

Another example of an application where priority caused problems is describ ed

in Jardetzky This application used a dedicated Wanda machine as a net

worked continuous media leserver Two problems were encountered in the design

of this system Firstly the problem of selecting the priority for control threads

ie the threads handling out of band control via RPC These threads do not re

quire immediate execution since that would intro duce jitter on the data streams

They do however require timely execution and must not b e delayed indenitely

by the data streams which they control

Secondly a similar problem o ccurs with the disk write threads which wish to run

infrequently writing clusters of buered data to the disk These should b e of lower

priority than the disk read and network threads whose scheduling is imp ortantfor

observed application jitter but they must get some pro cessing time otherwise the

buers in the system will b ecome exhausted This particular concern is examined

thoroughly in Bosch In that work concerned with conventional le IO

the interruption of disk write requests by disk read requests is recommended In

addition writes are delayed altogether for as long as seconds or until the

buer cache b ecomes more than dirty This together with a very large

cache is used to reduce leserver latency That work makes the assumption that

the read requests are suciently bursty that the disk eventually b ecomes idle and

writes can occur

For the Pandora system a set of guiding principles was develop ed to help the

system behave reasonably under overload and to ensure that the user was best



placed to notice and take corrective action These were

On MHz p owered machines



One of the devices on this platform has a feature that if the device handler is interrupted

at an unfortunate moment the device has a small probability of crashing the machine Theo

retically such short sections of the device driver should also change their dynamic interrupt

priority to avoid this problem Although this is a p eculiar device similar requirements do

o ccasionally present themselvesintheformofcache ECC errors and the like



This list is abridged from Jones



 Outgoing data streams have priorityover incoming ones

 Commands have priorityover audio which in turn has priorityover video

 Newer data streams have priorityover older ones

 Where a stream is b eing split multicast b ottlenecks should not aect

upstream no des

Priority in Devices

Problems also occur in using priority for handling devices In Unix network de

vice drivers op erate at a high interrupt priority known as splimp with proto col

pro cessing b eing p erformed at the lower interrupt priority splnet This leads to

p erformance loss due to the context switches dispatching overhead and register

le saves and loads and reconguration of device interrupt masks One discus

sion of this p erformance cost can b e found in Blackc A related cost is due to

the eect of livelock where excessive data arriving can hinder the progress of re

quests already in the system Livelo ck in Unix has b een rep orted in Burrows

and Mogul This particular livelo ck problem is now so severe with high per

formance network devices that some hardware Ro deheer includes sp ecial

supp ort whichallows the device driver to disable high priorityinterrupts from the

adaptor so that low priority proto col pro cessing can b e p erformed Burrows

In eect the scheduling of the system has b een moved from the scheduler into

the device driver co de with the resultant p otential for pathological interactions

between multiple such device drivers

Another problem with giving device interrupts priorityover pro cesses is that the

interrupt dispatching overhead can consume substantial amounts of the CPU An

extreme example of this is the Fairisle p ort controller rep orted in Blackd

where an ATM device generates an interrupt for every cell received The overhead

of the interrupt dispatching in the system is such that no pro cess execution o ccurs

once the system is approximately half loaded the extra CPU resources b eing lost

Tackling this problem is aspecic aim of this dissertation

It is interesting to note that mainframe computers avoid the problems asso ciated

with devices by having a large number of dedicated channel processors In the

case of overloading a particular channel pro cessor may suer from these problems



Except at a leserver

but the rest of the pro cessors in the system will continue to make progress This

amounts to a limit alb eit a physically imp osed one on the amount of the total

CPU power of the machine which can be consumed by a device That is the

device do es not have priorityover the computations of the system as a whole

Perio dicity

As a result of these problems with priority many hardrealtime control systems

use the Rate Monotonic algorithm presented in Liu This provides a metho d

for scheduling a set of tasks based on a static priority calculated from their



periods It relies on the following assumptions

A The requests for all tasks for which hard deadlines exist are p erio dic with

constant interval b etween requests

A Deadlines consist of runability constraints only ie each task must be

completed b efore the next request for it o ccurs

A The tasks are indep endent in that requests for a certain task do not dep end

on the initiation or the completion of requests for other tasks

A Runtime for each task is constant for that task and do es not vary with

time Runtime here refers to the time which is taken by a pro cessor to

execute the task without interruption

A Any nonp erio dic tasks in the system are sp ecial they are initialization or

failurerecovery routines they displace p erio dic tasks while they themselves

are b eing run and do not themselves have hard critical deadlines

Of these assumptions thread indep endence constant p erio d and constant re

quirements are considered inappropriate for a network based uctuating service

Further consideration of the inappropriate application of hard realtime solutions

to soft real time systems can b e found in Hyden

Earliest Deadline First

The EDF algorithm also presented in Liu is a dynamic priority scheduling

algorithm which also relies on the assumptions of section The deadline of



Quoted directly from Liu

a task is considered to be the time at which the next request for that task will

o ccur It is shown that this scheme will p ermit a feasible schedule where the

CPU utilisation is less than or equal to

This also suers from inapplicability of the assumption of indep endence In an

environment where pro cesses make use of services provided by others a task

is not asso ciated with a pro cess simply knowing what the current deadline is

can b e very dicult An equivalent of the priority inheritance proto col would b e

required Every interaction with every other pro cess could involve recalculation

of the deadlines of the pro cesses Furthermore the servers must be trusted to

p erform work only for the task with the earliest deadline and to pass on the

deadline correctly to any other services

Even if such a scheme were conceivable for a single machine it fails completely

in a distributed environment where remote services must be invoked where a

deadline carries no authority and unpredictable network delays exist In general

a device driver may have to exert a substantial amount of eort when a packet

arrives b efore it is even aware which pro cess is the recipient and therefore which

deadline is appropriate

Summary

Priority based mechanisms are ideal for certain applications sp ecically those

for which an explicit precedence of welldened tasks exists Such an absolute

ordering do es not frequently o ccur however in general purp ose systems forcing

such a system into the priority mould leads to instability unmaintainability and

loss of exibility

Sharing the CPU

As noted earlier Fawn uses a split level scheduling p olicy This section considers

the lowest kernel level scheduling of pro cesses First it will be constructive to



examine this pro cess scheduling within the Nemo and early Nemesis systems

considering some of the infelicities



Mid

Interpro cess scheduling in Nemo

In the architecture of Hyden system pro cesses whichhave guaranteed access



to the CPU do so by negotiating a certain amount of CPU time called a slicein



another larger timer p erio d called an interval For example a pro cess displaying

NTSC video may require a contract for a slice of ms of CPU every interval of

ms



The scheduler keeps three dierent run queues Q is for pro cesses with contracted

c

time Q is for pro cesses whichhave consumed all their contracted time but wish

w

for more CPU if it is available and Q for b est eort background pro cesses

f

The kernel keeps an alarm timer for every pro cess in order to allo cate that pro

cesss contracted amount Whenever this alarm o ccurs the appropriate pro cess

has its contracted amount set to its negotiated value and is made runnable

The design calls for the use of an EDF algorithm for scheduling pro cesses on

Q Whilst Q is empty the kernel runs pro cesses from Q and then from Q

c c w f

according to an environment sp ecic algorithm whichwas left for further study

The principle of this design is that pro cesses acquire a fair share as deemed by

the resource manager of the CPU with regular access to it Remaining time is

allo cated in a way which attempts to maximise the p ossibility of it being used

eectively

This design suers from the use of EDF scheduling whichwas discussed in section

Also the number of clo ck interrupts and context switches involved can be

quite large b ecause allo cation of each slice to each pro cess p otentially requires a

separate clo ck interrupt a context switch and a further context switch later to

nish the slice of the interrupted pro cess

A related problem comes from the transfer of resources between clients and

servers When a client requires some service from another pro cess it may ar

range for some of its CPU allo cation to be transferred to the server in order to

p erform the work If the client and server have dierent intervals it is not clear

exactly how the servers scheduling parameters should be altered



Hyden do es not use this terminology I intro duce it here for clarityonly

Interpro cess scheduling in Nemesis

The mid version of the Nemesis system adopted a particular prototyp e of

Nemo which used diering scheduling algorithms to those describ ed ab ove

For Q the Nemesis scheduler context switches to a pro cess as so on as it is al

c

lo cated a slice For Q and Q it allo cates in a rstcome rstserved fashion

w f

ie they are run to completion unless some contracted time is subsequently al

lo cated

There are a numb er of problems with this design The primary problem is related

to the b eating of pro cesses with dierentintervals Consider the scenario of gure

where pro cess A is allo cated a slice of over an interval of ticks ie

of the CPU over a short interval Pro cesses B and C are allo cated a slice of

over amuch longer unimp ortant interval

Contract A

B

C 0 5 101520

Allocation Without A Carry B

C 0 5 101520

Allocation With A Carry

B

C

0 5 101520

Figure Allo cation of CPU by the Nemesis scheduler

For Nemesis the allo cation of the CPU will be as shown for Without Carry

When B gets allo cated its slice shortly after A has started running the sched

uler will defer A in favour of it Likewise A will regain the CPU at the b eginning

of its next interval Shortly after that C gains the CPU from A The result is

that over two intervals A has had only two very small p erio ds of access to the

CPU If there were more pro cesses in the system D E this starvation could

be much worse

Obviously the long term starvation can be addressed by adding the slice to the

contracted time of a pro cess at the b eginning of each of its intervals rather than

assigning it This is shown in gure by the With Carry case Over a long

time scale A will now b e guaranteed its share It is still true however that

pro cessing by A can in the worst case b e delayed by the sum of the slices of all of

the other pro cesses in the system p otentially a very large amount in comparison

to As interval

For a hard realtime system these problems would b e resolved by simply applying

the Rate Monotonic algorithm We have already seen in section though

that in a general purp ose system the rate monotonic algorithm is inappropriate

Note also that the long term starvation is exactly the case that will o ccur in an

EDF scheduled system under consideration for p ossible addition to Nemesis in

late when the applications lie ab out their deadlines in order to try and

access the CPU as so on as p ossible EDF was also shown to be illsuited in

section ab ove

Interpro cess scheduling in Fawn

The Fawn system is based on the Nemo design and also uses a single virtual

address space for all pro cesses with multiple protection domains

In the Nemo system as describ ed ab ove allo cation of CPU resource is made to

pro cesses at a frequency of their own choice with no correlation between the

points at which this crediting is made

For the Fawn system EDF scheduling was rejected and it was decided instead

to p erform the reallo cation of CPU resources to pro cesses simultaneously at a

single systemdened frequency Toavoid overloading the word epoch the p erio d

between these allo cations is known as a Jubilee A jubilee has a period derived

from the minimum signicant time p erio d for the platform This will typically b e

a few tens of milliseconds dep ending on the p erformance and the exact nature

of the exp ected applications

The use of the jubilee system ensures that pro cesses get a fair share of the CPU

with regular access to it In eect all pro cesses in the system have the same

interval Access to the CPU at a ner granularity than the jubilee is probabilistic

rather than guaranteed Their use also reduces the number of interrupts and

context switches in comparison with the Nemo system

Like the Nemo system the kernel has a number of scheduling levels which rep

resent a strict priority order The rst level is for guaranteed CPU the others

for successively more sp eculative computations Each pro cess in the system is

allo cated an amount of CPU at zero or more levels by the Quality of Service

manager

Allo cations are measured in ticks which is the base unit of time on the system

The guaranteed ie rst level allo cations of all pro cesses must sum to less than

of the number of ticks per jubilee Since the kernel overheads are accrued

against the currently selected pro cess allo cations should not b e made to o small

The kernel maintains a record per pro cess of the number of ticks consumed at

each level and also the total number of ticks consumed in the system by pro cesses



in general at each level These measures provide feedback to the QoS manager

which may consequently alter the allo cations

Asso ciated with eachscheduling level is a queue whichcontains all the pro cesses

for which that level represents their highest current entitlement A pro cess only

resides on a single queue

The kernels scheduling algorithm is very simple It always selects the rst pro cess

from the highest scheduling level An alarm is initialised to interrupt the system

after the number of ticks currently remaining to that pro cess in that level have

elapsed In this way the CPU used by each pro cess can b e strictly controlled If

a pro cess blo cks it is moved o the levels run queue and its remaining time is

decremented by the time for which it did run If it is subsequently unblo cked it

will b e returned to that queue If a pro cess is still running when the alarm o ccurs

then it is moved to the next lower queue

At the b eginning of each jubilee all pro cesses are made runnable their allo cations

at eachlevel are reinitialised and they are restored to the highest queue for which

they have an allo cation



The QoS manager is of course exp ected to provide itself with some guaranteed time

Pro cesses are thus guaranteed to have b een given a certain share of the CPU over

the interval of a jubilee without the complexityofEDFscheduling

One of the advantages of the jubilee scheme is that it b ecomes trivial to transfer

resources from a client to a server as a result of some contract this merely requires

the subtraction from the clients allo cation and addition to the servers

Inter Pro cess Communication

In an op erating system one of the requirements is to provide a mechanism for the

various pro cesses to communicate There were a number of imp ortant consider

ations for the design of the inter pro cess communication mechanism

 For eciency reasons pro cesses in the system will transfer the large part

of their data using memory addresses in the single address space to which

they b oth have access

 The mechanism must b e suciently exible to p ermit pro cesses to construct

arbitrary communication proto cols in such memory

 The mechanism must not force synchronous b ehaviour on pro cesses which

would nd an asynchronous mechanism more convenient

 It must b e p ossible to communicate in a nonblo cking manner This is im

p ortant for device drivers or servers which are QoS conscious In Unix and

Wanda device drivers have a sp ecial set of library routines which provide

nonblo cking access to the facilities of the system

 Invo cation of the mechanism must convey sucient semantics ab out any

p otential communication that it need be the only point at which explicit

memory synchronisation need be p erformed in a lo oselycoupled multipro

cessor system This requirement is to enable p ortable use of partially or

dered memory systems such as an Alpha AXP multipro cessor or the Desk

Area Network

 Termination of a communication channel by the pro cess at one end should

not necessitate synchronous recovery on the part of the pro cess at the other

end

 A thread scheduler within a pro cess can map communications activities to

scheduling requirements eciently this necessitates that the communica

tions primitives b e designed in conjunction with the concurrency primitives

These requirements dictate a solution which is asynchronous nonblo cking and

can indicate an arbitrary number of communications which have o ccurred to

the receiver It must never lose communication notications nor require large

structures to b e stored

For these reasons the design adopted a scheme based on events An event is a

monotonically increasing integer whose value may be read and mo died atom

ically by the sender and which can be forwarded to the recipient by a kernel

system call The recipient has a copy of the event whose value will be up dated

some time later by the kernel The relationship b etween these events and event

counts and sequencers Reed is discussed in section b elow

This propagation is called an event channel For each pro cess the kernel has

a protected table of the destinations of the event channels originating at that

pro cess A management pro cess is resp onsible for initialising these tables and

thereby creating communication channels

Virtual Pro cessor Interface

The virtual pro cessor interface VPI is similar in some ways to that in the Nemo

system These features are

 The kernel system calls are nonblo cking

 The kernel up calls the pro cess when it is given a time slice rather than re

suming it The pro cess is then able to make its internal scheduling decisions

This is called an activation

 The pro cess may disable activations when op erating within its own critical

regions When disabled the kernel will resume a pro cess instead of activat

ing it This avoids the need to write reentrant activation handlers and

greatly eases thread scheduling This is the corresp onding op eration in the

virtual pro cessor to disabling interrupts in the physical pro cessor

It has also b een extended as follows

 A mechanism for the kernel to indicate which event channels have b een

active is provided

 There is supp ort for virtualising hardware interrupts for device drivers

 System time is unied with event handling

Activations

The concept of activations is derived from some of the ideas of Anderson

When a pro cess is given the CPU after some time without it the pro cess is

up called or activated by a sp ecial means rather than simply being resumed at

the p oint where it lost the CPU This allows the pro cess to takescheduling actions

based on the real time at which it obtains CPU resource

Most of the time a pro cess will wish to be activated when it is given the CPU

Sometimes however it will b e op erating within a critical section where such an

activation would be dicult to cop e with entailing reentrant handlers The

pro cess must be able to control whether it is activated or merely resumed This

control over activations is implemented using twoevent counts known as disable

and enable and three context save pointers save restore and init

When a pro cess is to b e giventheCPUthekernel compares the values of disable

and enable If the value of disable is the greater then the kernel will resume the

pro cess by loading the context from restore otherwise it will activate the pro cess

by increasing the value of disable preventing reentrant activations and loading

the context from init When a pro cess loses the CPU involuntarily if the value

of disable is the greater then the context will be stored in restore otherwise it

will b e stored in save The purp ose of init is discussed below

The context save p ointers point into a sp ecial area of memory which is mapp ed

both kernel and user readwrite and which is non pageable The thread scheduler

within a pro cess will typically change the save pointer to point to the context

of the currently running thread The activation handler will ensure that the

pro cessor state p ointed at by save is preserved b efore reenabling activations

A system call exists for restoring the pro cessor state from the save pointer and

increasing the value of enable atomically

The user pro cess can disable activations at any time by incrementing the event



disable Such a critical region can be ended by incrementing the event enable

These indivisible regions can therefore be nested if required The kernel also

provides system calls to give up the CPU voluntarily b oth with and without

increasing the value of enable if the pro cess decides that it has no more work to

do

If a pro cess is concerned ab out starvation of activations eg b ecause it frequently

has them disabled when it happ ens to lose the CPU then it can detect this by

reading the value of the jubilee or incoming events counters after increasing

enable These are discussed in the next two sections

When a pro cess is to b e activated rather than resumed it is necessary for security

reasons for the kernel to clear the pro cessor registers b efore p erforming the up call

This ensures that no privileged information may leak from the kernel nor private

information leak from other pro cesses When the pro cess is activated it must

makeintrapro cess scheduling decisions see section Tomake these decisions

it must load some scheduling state into the pro cessor registers

In Fawn these two op erations are combined In the virtual pro cessor interface

each pro cess must supply an init context which is stored in the same format as the

saved or resumed context When the pro cess is to be activated the kernel loads

this context which includes the program counter stack etc Thus the pro cess

b egins immediate execution of the intrapro cess scheduler with the full context

necessary This means that the cost of clearing the registers can b e eliminated

Losing the CPU Gaining the CPU

disabl e enabl e restore restore

b a

save init disabl e  enabl e

a

Disable less than enable is not likely to b e useful

b

This represents an activation which will automatically increase the

value of disable

Table Summary of VPI context area usage

As a further optimisation it can b e noted that the kernel must have already exam

ined the disable and restore meta events when saving a pro cesss context Thus

at this time it can calculate whether the pro cess will b e subsequently resumed or



This may necessitate akernel trap on platforms with no supp ort for appropriate atomic

primitives

activated and store a single p ointer in the pro cess kernel data structure When

that pro cess is next given an allo cation of CPU time the kernel need merely load

the register le there is no case analysis required

A summary of the use of init restore and save is given in table

Events

The VPI contains a pointer to the pro cesss event table which is written by the

kernel when delivering an event and also two metaevents called pending and ac

know ledged which are used in the delivery of normal events to a pro cess Pending

is incremented by the kernel when it up dates an event value and acknow ledged

is incremented by the pro cess when it has taken due consideration of the ar

rival of that event These metaevents are used to prevent a pro cess losing the

ability to act on an event arriving just as it voluntarily relinquishes the CPU

When a pro cess gives up the CPU the kernel compares the value of the two meta

events and if pending is the greater then the pro cess is considered for immediate

reactivation

The VPI also includes an implementationsp ecic mechanism for indicating to

the pro cesss scheduler which of its events have been up dated Two mechanisms

have b een considered

Circular Buer

The two metaevents are used as a pro ducer consumer pair on a small circular

buer The kernel writes into the buer the index of the event that it is mo difying

If the buer b ecomes full the kernel writes a sp ecial co de which indicates overow

in which case the pro cess must check all the incoming events

Bitmask

A bitmask is used to indicate which events have been up dated by the kernel

The kernel sets the bit when up dating the event and the pro cess clears it b efore

considering the event The pro cess must scan the bitmask when activated to

Incrementing disable if necessary as describ ed ab ove

determine which events must be checked to determine if waiting threads should

be unblo cked

Avariant on this scheme is to have one bit in the bitmask for a set of events which

share the same cache line A set bit indicates that at least one of the events in the

set has been changed and all should b e examined by the intrapro cess scheduler

This would reduce the size of the bitmask without changing the number of cache

lines of the p otentially large event table which must be loaded into the cache

during the scan

Discussion

Of these twoschemes the circular buer is preferred on two counts First of all the

bitmask scheme intro duces a requirement for an atomic op eration whichwas not

otherwise needed whereas the circular buer scheme relies on the atomic up date

already present for events Second the exp ected CPU and cache p erformance

of the circular buer will be sup erior as even a buer of a few cache lines will

rarely overow Hybrid schemes are also p ossible where the bitmask could be

used b oth to reduce the cost when the circular buer do es overow and to avoid

the same event b eing placed in the circular buer more than once

As an example gure shows the value of event number n in pro cess A being

propagated by a system call to pro cess B where it is event number m The

mapping for event channels from A has the pair Bm for entry n so the kernel

copies the value from As event table to the appropriate entry in Bs event table

and places Bs index in this case minto Bs circular buer fo

Growing the event tables

When a pro cess needs to extend the size of its event table it must also p otentially

extend the size of the kernels eventchannel table Neither of these op erations is

problematic if the table maybe extended in place relo cation is more dicult

A pro cess may extend its own event table by allo cating a new area of memory and

copying the contents of the old table into it When this is complete the change

can b e registered in the VPI by p erforming an RPC see later to the management

pro cess resp onsible for managing event channels During this time the pro cess

could have b een the recipient of new events whichmayhave b een placed in either Process A Process B

sys_send_event(n)

updated value process event table value of event n m n

process event table m User Kernel pending process index event list

Bm

n kernel event dispatch

table for process A

Figure Example of sending an event up date

the new or the old areas The pro cess can use the event notication mechanism

to determine which events may potentially need to be recopied The fact that

the events are monotonically increasing allows it to determine easily whichisthe

correct value

If the pro cess determines that the kernels event channel table requires extension

then it must make an RPC call to the management pro cess The management

pro cess has authority to manipulate the kernels table and so can copy the table

to a new area It sets a ag so that if during this action it p erforms any event

channel management for that pro cess it can make the changes to b oth the new

and the old copies This RPC may b e amalgamated with the one for the pro cesss

own table

During b oth of these op erations all pro cesses may continue as normal the explicit

notication of up dates p ermits the change to pro ceed asynchronously

Time

One remaining imp ortant part of the virtual pro cessor interface is the asp ect

of time A pro cess needs to have some idea of the current time since it needs

to schedule many of its activities related to time in the real world This is

particularly imp ortant in a multimedia environment A pro cess may also need

to know when activated whether it is getting an additional slice of CPU over and

ab ove its guaranteed amount or not

In the Nemo architecture an accurate hardware clo ckvalue is available read only

to all pro cesses Unfortunately such a hardware register rarely exists and typical

implementations use the kernels clo ck value giving a time value accurate to the

time of the last interrupt Each pro cess is exp ected to keep track of the number

of times that its own interval has elapsed by counting the number of o ccasions

that a particular bit is set in a status register when it is activated or by regularly

reading the clo ck register This mechanism was considered inappropriate for the

Fawn system for multiple reasons

 The pro cess could o ccasionally be in a critical section when it is given

a new time allo cation This could lead to the indication being missed

equivalent to missing a timer interrupt on a physical pro cessor As discussed

in section this p ossibility is from exp erience carefully avoided in

current systems

 There is no homogeneous way of incorp orating timeouts or alarms for a

pro cess with the event mechanism

 There is no active notication of the fact that time is passing a pro cess

must include regular checks to detect it

 The available time would be a wall clo ck time and would be sub ject to

adjustments to keep it synchronised with other systems eg by using NTP

Mills rather than the time which is being used for lo cal scheduling

Since adjustments may add as much as a skew the lo cally based clo ck

which is likely to b e accurate with resp ect to instruction execution is more

useful to many applications

In the Fawn system scheduling time is indicated to each pro cess by using event

zero to denote the numb er of elapsed jubilees The event is up dated bythekernel

on the rst o ccasion within a jubilee that the pro cess is given the CPU Within

the pro cess co de can read the current time in jubilees by reading the value of

event zero The system makes no guarantees ab out how the CPU is allo cated

with smaller granularity than a jubilee but a metho d is provided for reading a

platform dep endent high granularity time of day clo ck The implementation

of this may vary from a single instruction access to a pro cessor cycle counter

register to a reasonably costly system call Of course the result maybewrongas

so on as the value is obtained due to context switches Providing scheduling time

as an event means that time is now easily integrated with the event mechanism

in a way suitable for thread primitives see section

At of mid the Nemesis system do es not have a well dened concept of

time the current implementation has adopted Fawns metho d of describing time

without having adopted the jubilee concept Instead the value which is up dated

on each activation represents the total numb er of system tickinterrupts Nemesis

assumes that scheduling is based on a high frequency regular clo ck

Interrupts

Finally pro cesses which manage devices may need to be informed ab out under

lying device interrupts These interrupts which may be enabled or disabled via

privileged system calls are delivered to the pro cess via an event The initiation

of this interrupt event channel is handled by the Basic Resource Allocator see

section

The system p ermits but do es not require the multiplexing of hardware interrupt

conditions onto events This is because platforms frequently have devices which

raise dierenthardware interrupts on dierent conditions but whichhave a single

driver The kernel contains a table which maps from hardware interrupt bits to

an event number in a particular pro cess Exactly one entry in the table exists

which refers to an enabled interrupt

When an interrupt o ccurs the corresp onding entry in the table is found and all

the interrupts listed in that entry are disabled this list is usually a bitmask of the

systems interrupt mask register This op eration requires the absolute minimum

of co de to make these hardware interrupt conditions cease usually a write to the

systems interrupt mask register The listed event is sent to the pro cess and the

kernel uses a heuristic based on the length of time that the current pro cess has

b een running to determine if the pro cess receiving the interrupt should b e given

the CPU This heuristic reduces interrupt dispatching latency at the p otential

cost of slightly more context switches Thus a pro cess which has a sucient

allo cation of CPU and which always handles interrupts quickly is likely to get

avery low latency

The index into the kernel interrupt table may also be used by the pro cess man

aging the asso ciated device to request the enabling or disabling of the individual

hardware interrupts in the system interrupt mask register The kernel checks

that the pro cess making the request is the one listed in the pro cess eld of the

interrupt table providing security

IntraPro cess Scheduling

The intrapro cess scheduler is the co de which sits ab ove the virtual pro cessor

interface The co de is not privileged and can dier from pro cess to pro cess It

may be very simple in the case of a single threaded pro cess or more complex

Although four dierent thread schedulers have b een written the interface they

provide to the rest of the pro cess is the same

Since there is exactly one intrapro cess scheduler for a pro cess it is not necessary

to pass a closure see section to this scheduler b etween functions in a pro cess

or within the shared libraries Instead a sp ecial machine dep endent metho d

exists to access the equivalent closure for the current thread and pro cess This

metho d is typically to indirect on some esp ecially reserved register This allows

library implementations to build on the functions in the virtual pro cessor interface

and those of section to utilise any or all of the synchronisation schemes

presented in section The functions which are provided by the intrapro cess

scheduler are known as the pro cess metho ds

The interface that was adopted within pro cesses was to extend the use of events

already present for interpro cess communication and provide primitives similar

to Reed namely event counts and sequencers The events which are used

in this way are entirely lo cal to the domain and so are called local events This

distinguishes them from outbound events those which can be propagated to an

other pro cess using a system call and inbound events those whichchange asyn

chronously as a result of some other pro cess issuing such a system call

Event counts and sequencers

There are three op erations available on an event count e and two on a sequencer

s These are

reade This returns the currentvalue of the event count e More strictly

this returns some value of the event count between the start of

this op eration and its termination

awaitev This op eration blo cks the calling thread until the event count e

reaches or exceeds the value v

advanceen This op eration increments the value of event count e by the

amount n This may cause other threads to b ecome runnable

reads This returns the current value of the sequencer s More strictly

this returns some value of the sequencer b etween the start of this

op eration and its termination

tickets This returns the current member of a monotonically increas

ing sequence and guarantees that any subsequent calls to either

ticket or read will return a higher value

In fact there is little dierence between the underlying semantics of sequencers

and event counts the dierence is that the ticket op eration do es not need to

consider awaking threads whereas the advance op eration do es therefore it is

wrong for a thread to await on a sequencer The initial value for sequencers

and event counts is zero this may be altered immediately after creation using

the ab ove primitives

These event primitives may be used for many sorts of synchronisation and com

munication b oth intrapro cess and interpro cess this is shown below

The representation of time as a monotonically increasing event count makes it

very easy for intrapro cess schedulers already op erating in terms of events to

provide time based control of events Two additional op erations are supp orted

untilevt whichwaits until event e has value v or until time These are await

event zero has value tand sleep untilt

By convention an advance on an outb ound event will cause the new value to b e

propagated by issuing the requisite system call Only read and await should b e

used on incoming events as their value may be overwritten at any time

Concurrency primitives using events

In contrast to many other systems where implementing one style of concurrency

primitives over another set can b e exp ensiveBirrell Fairbairns itisvery

ecient to implementmany schemes over event counts

SRC threads

In SRC threads Birrell concurrency is controlled by two primitives These

are mutexes and condition variables A mutex is held for a short p erio d of time

to protect the examination of some state A thread may then decide it needs

to blo ck while waiting for something to happ en It do es this by blo cking on

a condition variable which atomically releases the mutex When some other

thread has changed the condition then it signals the waiting thread to wake up

A thread must always check the condition again inside the mutex as some other

thread could have made the condition false again in the mean time

Mutex

An implementation of a mutex over event counts and sequencers has two elds

ms and me These are used as a sequencer and an event count resp ectively A

mutex initially has the sequencer with value zero and the event countwithvalue

one Then the implementation b ecomes

lockm

awaitmeticketms

releasem

advanceme

Condition Variables

The op eration of condition variables is almost as simple I will rst give the

implementations and then argue their correctness A condition variable contains

the same elds as amutex namely an event count ce and a sequencer cs

waitcm

t ticketcs

releasem

awaitcet

lockm

signalc

advancece

broadcastc

t readcs

f readce

if t f

advancece tf

Note that in wait since the sequencer value is obtained inside the mutex threads

will always b e awoken in the order in whichtheywere blo cked This means there

is no out of order race when there is a simultaneous wait and signal no thread

could be restarted in preference to the one already blo cked Also the wakeup

waiting race is handled simply b ecause if a blo cking thread is interrupted b efore

the await but after the release then when it is restarted it will still await for

the same value and hence not blo ck

Finally there needs to be some sp ecial discussion of broadcast There may be

some cause for concern due to the unlo cked reading of b oth cs and ce The

imp ortant point to remember is that by the semantics of condition variables

threads maybewoken up unnecessarily This means it is only necessary to show

that at least the correct set of threads gets woken up Any thread which has

already got a ticket value in wait of less than the current value will b e woken

b ecause the eventcounter will b e incremented to at least that value If any other

thread signals in parallel b etween the read of cs and the read of ce then ce

will b e advanced and the broadcast will p erform one less increment but note

that the signal will itself have incremented ce once and hence unblo cked one

thread Also note that as far as can b e observed the signal could have happ ened

before the broadcast since no action of the broadcast is observable at that

point Note that once ce is read the required set of threads is guaranteed to b e

started and even if they or other threads p erform other signals or broadcasts

the only action is for the unnecessary wakeups to o ccur in the future since ce

may exceed cs

Posix threads

Posix mutexes are similar to SRC mutexes except that Posix mutexes may be

lo cked recursively This makes them a little more complicated to implement

First there must be a way provided in the pro cess metho ds to nd out a

unique identier for the current thread Secondly each mutex must have an

owner eld which is only up dated under the mutex itself One observation

helps if some particular thread holds a mutex then the owner value will be set

to that thread and will b e stable If however a particular thread do es not hold

that mutex then the owner value though p otentially changing cannot be set to

the identier for that thread This leads to the following implementation

plockm

if mowner me

mrefs

else

lockm

mowner me

preleasem

assertmowner me

if mrefs

mrefs

else

mowner NULL

releasem

Note that since release actually the underlying advance must ensure memory

consistency at the point that mutual exclusion is terminated the mowner eld

is guaranteed to b e observable as set to NULL b efore any other CPU can attempt

to grab the mutex Thus this implementation works on a multipro cessor with a

partially ordered memory system

Wanda threads

Wanda uses Dijkstra counting semaphores for its concurrency primitives As

well as the P and V op erations Wanda supp orts timed waiting b oth relative

TimeP and absolute UntilP These return an indication of whether they were

Ved or whether they timed out They can also be implemented eciently over

event counts again using a pair of lo cal events one as a sequencer s and one as

an event counter e viz

Vsem

advanceseme

Psem

awaitseme ticketsems

UntilPsem time

t ticketsems

awaituntilseme t time

if t readseme

advanceseme

return TimedOut

return

TimePsem delta

t ticketsems

awaituntilseme t read delta

if t readseme

advanceseme

return TimedOut

return

These work in the obvious way when the thread returns from blo cking it checks

to see if it has b een Ved ie if the event counter has reached the required

until sequencer This mayhave o ccurred while it was returning from the await

due to the time being reached Under Wandas semantics this is not considered

to b e a timeout If it really did time out then the value of the event countmust

be corrected by incrementing it

Wanda also has a function called Broadcast whichunlocks all the threads wait

ing on a semaphore Unfortunately the semantics of this are not well dened

what is all in the context of a simultaneous P op eration Thus no co de ever

written for Wanda uses Broadcast and so it is not necessary to provide an

emulation for it

One further problem with providing the Wanda interface to concurrency is that

in typical Wanda co de semaphores are not explicitly destroyed they are allo cated

space on the stack initialised used and then implicitly destroyed by returning

from the scop e in which they were declared Such programming metho dology is

correct in Wanda b ecause native Wanda semaphores whilst idle do not utilise

any resources When a Wanda semaphore is synthesised over lo cal events then

each semaphore which is discarded consumes two such events p ermanently A

deallo cation function would have b een a nullop eration on current Wanda plat

forms but would have made the porting of Wanda co de much easier

Priority

The three synchronisation schema describ ed all provide a sequential service to

synchronising threads A criticism might be despite the eorts of section

that there is no supp ort for priority

Note however that any of the ab ove could b e reco ded such that a waiting thread

listed itself and its priority and then waited on an individual private lo cal event

The waker could consider the set of threads waiting and advance the chosen

threads event

In general this is how thread packages are implemented in order to avoid race

conditions where the underlying scheduling control primitives are to blo ck oneself

or wake a particular thread Examples include user space Wanda semaphores

over the Wanda scheduler system calls SRC threads over the Taos system calls

Birrell Posix threads implemented using Wanda threads Fairbairns

and SRC threads over Windows NT primitives Birrell Again it is interesting

to consider how this complexity is not required when implementing over event

counts and sequencers

There is in fact a single use in the Internet library but it is adjacent to a commentwhich

notes that it shouldnt b e b eing used b ecause of the problem noted

Summary

This chapter has considered the problems with using prioritytocontrol the allo

cation of CPU resource within an op erating system Instead a mechanism based

on a guaranteed share of the CPU over a ne time scale has been prop osed

Interpro cess communication using events has b een recommended as an ecient

solution to many problems of asynchrony and nonshared memory multipro ces

sors

A virtual pro cessor interface has b een describ ed which allows a pro cess to make

its own internal scheduling decisions based on information ab out the availability

of the CPU the events o ccurring as a result of interpro cess communication and

where appropriate hardware interrupts Events have also b een shown to b e eec

tiveforintrapro cess synchronisation and ecient implementations of traditional

concurrency paradigms have b een demonstrated

Chapter

InterPro cess Communication

Fawn is a microkernel and uses a single virtual address space As a result the

sharing of co de and the invo cation of services in dierent pro cesses are b oth

highly relevant This chapter considers these issues

The system makes considerable use of closures Saltzer to access state rather

than addresses built into the co de This allows for sharing of co de Interpro cess

communication uses shared memory for data and events for signalling

Interfaces of services within the system either in a shared library or in another

pro cess are sp ecied with machinegenerated interface references which can be

passed between pro cesses A binding function exists for initiating interpro cess

communication with a service and a trading ANSA function exists for nam

ing interface references with humanreadable names typ es and prop erties

Language

On setting out to design a new op erating system the question of which language

to use is always a consideration In addition to the usual factors there were a

number of others particular to this work

 concurrency The language must not make assumptions ab out howconcur

rency control op erates or worse that it do esnt exist Nor must it embed

concurrency primitives in output co de whichwere not present in the source

It must b e p ossible to implement concurrency control as deemed t by the

Fawn system and not as the designers of the language guessed

 exceptions The Pegasus pro ject had determined to standardise on the use

of exceptions for failure conditions so exceptions were a requirement for

interop eration

 closures Since the op erating system uses a single address space the lan

guage must p ermit the use of shared co de with closures used for accessing

state and p erforming op erations

 comparability Finally it must b e p ossible to compare this work with other

systems such as Unix or Wanda without language issues masking the per

formance impact of the design features

Mo dula Cardelli was rejected due to lack of availability The number

of pro cessor architectures supp orted is extremely limited it did not supp ort the

platform intended for practical work and showed no signs of b eing more available

in the future the most recent port to ok three years despite the authors of the

language working for the company pro ducing the new hardware

The C language Stroustrupwas rejected due to problems with the stabil

ityoftheg compiler and the tendency for the language to hide the p erformance

and memory costs of the underlying op erations

The C language ISO was adopted together with a p ortable exception mech



anism using setjmp

Without compiler supp ort the p ossibility of exceptions contributes a certain

amount to the costs of service invo cation Their advantage is that they lo calise

the blame for erroneous co de an exception will halt the thread if ignored As

a result within Fawn error co des are usually returned for co de within a self

contained mo dule and exceptions used across mo dule b oundaries

Recently much greater platform availability has b ecome available but this change was to o

late for this work



The exception mechanism was implemented in conjunction with David Evers of the Com

puter Lab oratory

Shared Libraries

If a service is suciently trusted by the client then it is p ossible that it can be

implemented as a shared library A shared library is dened as one consisting of

pure co de ie it contains exclusively readonly fullyresolved text segment or

fullyresolved references to other shared libraries

All state manipulated by the library even if opaque to the client is passed into

functions using closures Saltzer except the closure for pro cess and thread

sp ecic functions as describ ed in section

Some library routines may op erate entirely without state and have well known

names suchasstrlen Other examples of such functions are ones which create

closures for subsequent use For these stateless functions lo cation in a shared

library with early binding ie direct linking with the textual name is used in

Fawn rather than the later binding of Rosco ea

A closure is a pair of p ointers One of these p ointers p oints to a table of metho ds

the other points to the state record on which these metho ds op erate The state

record maycontain any private or public state including other closures

Rosco ea gives a full discussion of the use of closures and shared library mo d

ules within the Nemesis op erating system Within the Fawn system a slightly

dierentscheme was used This was due to this work preceeding the implementa

tion of the necessary supp ort to ols in the Nemesis system and also due to various

bugs in the binary to ols for the chosen platform the Arm pro cessor

When a shared library is built all the separate ob ject les are linked together

at the appropriate virtual address and then a to ol called makeabs is run over it



whichconverts the public symb ols to absolutes It can then b e used as input for

either the image builder mkimage a b o ot image must contain the shared libraries

used by the pro cesses in the image or the stub generator makestubs The latter

creates a static library archive with one stub ob ject le per public symbol in

the shared library The co de for each stub consists of a far jump so that the

library may b e placed anywhere in the address map without necessarily b eing in

reach from client co de with a pro cessor branch instruction Base addresses for

shared libraries are allo cated by a human administrator co de which is under

development and likely to change frequently is placed in static rather than shared

libraries



Also known as external symb ols

This shared library mechanism may b e p ermitted to include co de which requires

external routines provided that those routines are in another shared library and

that there is a partial order of shared libraries any library with no dep endencies

on other libraries is a b ottom element libraries with dep endencies are higher than

the libraries on which they dep end An example is the marshalling co de and



stubs libraries for various interfaces which relies on the basic C library Programs



should b e linked against shared libraries in order from lowest to highest to avoid



indirecting on stubs more than once this is the opp osite requirement to Unix

As a result of this construction it is p ossible for a program to have a routine

which has a name clash with something in a shared library which it uses and

still be able to use the shared library Other uses of that function within the

shared library are not aected

IPC Mo del

The mechanism for lo cal ie same machine RPC has b een the sub ject of a

great deal of research and exp erimentation over the last few years For the Topaz

microkernel Bershad rep orts that almost of RPC is lo cal so in a

highly decomp osed microkernel system its p erformance can be crucial to the

overall p erformance

Trust

In a conventional monolithic kernel such as Unix the various dierent parts make

assumptions of trust ab out others and user pro cesses make assumptions of trust

ab out kernel services Within the kernel servers make only rudimentary checks

on arguments and trust clients with shared memory Similarly clients trust the

server to return the CPU Likewise nonkernel clients trust kernel services with

their memory and that they will return the CPU



Some of the co de for the basic C library was p orted from the Wanda library which was

written byvarious authors including Jo e Dixon Sai Lai Lo and Tim Wilson of the computer

lab oratory



In the partial order sense



This is b ecause stubs for a lower level library whichhave b een included in a higher library

will be public symbols and so will themselves have stubs in the stub archive for the higher

library

One of the advantages of a highly decomp osed system is that the mo dularity

improves the isolation of faults and debugging Frequentlyhowever it is reason

able to consider reintro ducing some of the trust found in a monolithic system for

crucial or stable co de This is similar in concept to compiling a mo dule without

debugging co de For example if a client is willing to trust a server which itself

only needs readonly access to nonsecret state to implement the op eration to

return the CPU then the server may b e implemented as a shared library There

are p otentially many dierent typ es of trust between parts of the system and it

is the resp onsibility of the trading and binding services to ensure that the same

interface is presented irresp ective of the exact implementation

In most systems designed to achieve high p erformance lo cal RPC is implemented

using pairwise shared memory which is allo cated at bind time This may be a

single blo ck of memory mapp ed readwrite to b oth pro cesses or two areas of

memory each mapp ed writable by only one pro cess The size of the memory is

determined at bind time Of course although this may b e the dened semantics

the implementation may dier all such buers may b e writable by all pro cesses

and some platforms maylack protection hardware

Migrating mo del

In Bershad a migrating mo del for lo cal RPC is describ ed In this mo del

also used by Spring Hamilton and in some versions of MachFord the

thread of control which makes the call in a client is transferred via the kernel

and upcalled into the server pro cess The client loses its thread until the server

returns control If the server terminates or fails to return control after some long

p erio d then the kernel creates anew thread to handle the fault condition in the

client pro cess This scheme works b est with kernel threads In some systems the

designers note that the remaining cost can be signicantly reduced by avoiding

the kernel scheduler on this path charging the resource usage of the server to the

client on whose benet it is working thus reducing the overhead still further



For Spring this path was reduced to instructions in the common case

This mo del do es have impressive p erformance statistics esp ecially given kernel

threads but has a numb er of problems First the client thread maybeblocked

for a substantial p erio d of time Second the server may not use the up call solely



in the kernel and in user space however this cost do es not include the delayed eect

of register window traps estimated at cycles eachinSun

for p erforming work for that client the implicit billing may be highly inaccu

rate As was noted earlier the use of up calls can lead to loss of control over

and accountability of scheduling Third there may be diculties in the event

of a failure somewhere in a long chain of interpro cess calls the various calls

may not b e timed out in the optimal order preventing intermediate servers from

p erforming intelligent recovery on behalf of their client After careful consid

eration of this resource recovery problem the designers of the ARX op erating

system McAuley prohibited nested RPC by adding the constraint that no

thread which had already made a migrating RPC call could make a further such

call equivalently replacing n kernel context switches with n  user ones This

problem of resource recovery is also found in equivalent capability based systems

where services exist indep endently of pro cesses and are invoked using enter ca

pabilities A full discussion of the complex ramications of this mo del may be

found in Johnson

Additionally and particularly on a multipro cessor system the cache contents

on the migrating pro cessor may be adversely aected Finally the mo del of

a blo cking RPC is inappropriate for the implementation of various real time

activities and device drivers which require nonblo cking means of communication

without thread overhead for hardware or p erformance reasons

Switching mo del

The switching mo del is found in more conventional systems such as Unix and

Wanda In this mo del the client signals to a blo cked server thread that it should

wake up and p erform an op eration The client thread mayormay not blo ckun

der its own control In avariant called scheduler activations Anderson the

thread blo cks but control of the CPU is returned to the pro cess internal sched

uler The client may then at some later date blo ck on or poll for the indication

from the server that the op eration is complete

This mo del is recommended in Bershad for multipro cessor machines That

work also notes that this mo del is sup erior when threads are provided at the

user level either over a non threaded or a kernel threaded base due to the lower

synchronisation costs Although this mo del has a higher cost for lo cal RPCs when

measured simplistically it do es give the client much greater exibility and can

in fact improve p erformance bybatching up multiple requests to a server so that

when it runs it will have b etter cache and TLB p erformance Various examples

of similar p erformance improvements on a multipro cessor machine maybefound

in Owicki

The switching mo del is used in Fawn Section discusses a p otential per

formance enhancing heuristic where the kernel scheduler keeps a hint of the last

pro cess which has b een the recipientofanevent and when reallo cating the CPU

it will favour that pro cess if it has CPU allo cation remaining in that jubilee

IPC Op eration

Architecture

The architecture of the IPC system follows that of ANSA and Eversb

and is connection oriented clients must explicitly bind to servers b efore op era

tions on interfaces can b e invoked Interface denitions are written in the MIDDL

language Rosco eb the mapping onto the C language is known as MiddlC



Stubs are generated automatically by a stub compiler whose back end is written

in python making for a quick prototyping language The marshalling is pro cedu

ral and the stubs are generated p ertransp ort so they can take advantage of the

lo cal data and call mechanisms The data representation is based on that used

in MSDR Crosby

Calling conventions

The calling conventions for MiddlC interfaces are summarised in table The

C language only supp orts returning a single value directly so for op erations with

more than one result pointers to the variables to receive the results are passed

as additional arguments

Only the creator of some information can allo cate the correct amount of space

for it thus the clientmust allo cate space for arguments and the server for results

What is more imp ortant is the consideration of when that memory gets released

For arguments the client may release or alter the memory any time after the

server returns thus the server must copy any part of the arguments which it



The front end of the stub compiler was written byDavid Evers and Timothy Rosco e of the

Computer Lab oratory

wishes to keep in internal state On the other hand any memory used by results

returned from the server b ecomes the resp onsibility of the client so the server

may need to allo cate memory and copy its internal state into it in order to return

it to the client In b oth these cases there is p otential for an unnecessary copy of

data Also in the latter case there is the inherent problem of the client knowing

the correct heap closure to free the memory on

Memory Allo cated Freed Size

argument Small by value

Large Client Client

clients stack result Small

Large Server Client

Table Calling conventions for MiddlC

This is particularly inecient in the RPC case because the server stubs must

allo cate memory to hold the unmarshalled arguments which will then be copied

by the server Likewise the server will allo cate memory to put the result into

which will immediately b e freed by the server stub

These problems would be eliminated if sharing semantics were adopted between

client and server with memory being reclaimed by garbage collecting metho ds

as is used in the Network Ob jects distributed system Birrell In such cases

sharing of the memory is reasonable since if one trusts the server or client to

be in the same pro cess then it is reasonable to trust it to share correctly the

information in the remote case it is reasonable to trust the stubs This problem

is similar to the problems with mbufs discussed in section

As a corollary of these complications it is necessary in MiddlC that all exceptions

are shallow which means that none of the arguments to an exception must need

memory to b e allo cated

IPC Low Level

An IPC channel is implemented using a pair of memory areas One for argu

ments is mapp ed read only into the server the other for results is mapp ed read

only into the client To signal that a set of arguments or results has b een placed

in the appropriate buer there is also a pair of event channels between the two

pro cesses When a client makes a call on a server the arguments are put in the

argument area and the outgoing event is incremented The thread then typically

waits for the incoming event to b e incremented with some p erbinding timeout

When it is awoken it reads the results from the other buer

ANSA and MIDDL have the concept of an announcement as distinct from an

op eration An announcement is a call which has no results and which can b e con

sidered to have atmostonce semantics For announcements the server uses the

returning event to indicate when it has nished reading the arguments The caller

need not blo ck unless it tries to reuse the same IPC channel b efore the server

has read the arguments Since announcements and op erations are supp orted on

the same interface the low level IPC primitive to prepare the argument buer

also waits for the acknowledgement of the previous call If the previous call was

an op eration then this is satised trivially

The IPC low level provides an encapsulation of the IPC channel its buers

tx for preparing and event counts with three op erations These are prepare

the marshalling state record and ensuring that the server has nished with the

rx which waits for an indication that new previous calls arguments prepare

data has arrived and initialises the unmarshalling state record and sendwhich

increments the outgoing event Note that at this low level there is no dierence

between the active ie client and passive ie server sides of an IPC chan

nel they are entirely symmetric A send is used by the higher levels to send

arguments to send results or to acknowledge an announcement

IPC Stubs

The machine generated stubs op erate on public elds in the IPC channel state

record These are the current marshalling pointer the remaining space and a

closure for a heap to be used for memory for outofline structures The mar

shalling elds are initialised as describ ed ab ove Marshalling of simple typ es is

done by macros which op erate on the IPC connection state record Marshalling

for standard complex typ es is provided in the standard IPC shared library the

stub compiler generates pro cedural marshalling co de for user dened complex

typ es

Client stubs

The client stubs provide a stub closure for their asso ciated interface The state

record of this closure contains the IPC channel information which is used to for

ward any function requests to the server The metho ds record contains function

pointers to the individual stubs for each of the op erations or announcements in

the interface A clientinvoking metho ds on an interface need not know whether



the implementation is in the same pro cess or not Since the stub op erations

op erate on closures the co de is pure and and can b e shared b etween all clients of

interfaces of that typ e

Server Stubs

The server stubs consist of an IPC connection closure and the closure of the

actual implementation A thread created by the server at bind time waits

inside the stubs for the event which indicates that a new argument has b een mar

shalled When the thread awakes it dispatches on the rst integer in the buer

to determine which op eration number in the interface is b eing called Individ

ual functions exist for each op eration which unmarshall the arguments call the

servers implementation via its closure and then marshall and send the results

The structure of the outer co de is a lo op with exception handlers outside it A

further lo op surrounds this Such an arrangement means that only one instance

of the co de for handling a particular exception exists p er interface Also the cost

of the TRY is not executed in the common case This is illustrated in gure

Stubs whose server implementations may raise exceptions and which have

allo cated memory from the IPCs heap use a TRY FINALLY blo ck to ensure

that the storage is reclaimed

Since the stubs invoke all op erations via closures they are shared between all

implementations of a particular typ e of interface

Binding

There is a sp ecial pro cess in the system known as the Binder For convenience

this pro cess is combined with the Basic Resource Allo cator When a client wishes

to make an IPC connection to a server it calls the binder sp ecifying the interface



Except that in the remote case the stubs may raise RPC engineering failure exceptions

lo op

try

lo op

wait for ip c

dispatch to function sp ecic stub

endlo op

catch exception

catch exception

endtry

endlo op

Figure Server stub dispatcher pseudoco de

reference of the interface that it wishes to connect to and the indices of a pair

of events it has allo cated The interface reference contains a pro cess identier

which the binder uses to call back on an IPC connection registered bythatserver

pro cess As well as passing the interface reference the binder passes in the

sizes and addresses of the memory areas allo cated The call back co de at the

server has the option to vet the client or the sizes of the memory buers chosen

b efore returning the eventnumb ers that it has allo cated If the callback returns

successfully then the binder will up date the kernel event tables to connect the

pairs b etween the endp oints and return to the client The IPC connection is then

established

Service Management

Within each pro cess is a mo dule known as the ob ject table which mediates

between that pro cess and the binder It is this with which all services in a pro cess

register and which all clients use to request connection to such services

When aserver wishes to exp ort some interface it registers the interface with the

ob ject table together with a closure for a callback function The ob ject table

returns an interface reference which the server may then publish The call back

is activated when a request from a client o ccurs The call back function has the

last word in vetting the client and may create any p erbinding state required

This name comes from Eversb

within the service It returns the actual closure for the service to the ob ject table

which is subsequently used by the server side stubs All the other work of setting

up the IPC channel is done by the ob ject table

The ob ject table keeps internally a note of all the services which exist within its

pro cess When a client requests an IPC connection to a particular service then

the ob ject table checks to see if that service is in fact lo cal If so then it will

return the actual closure directly to the client If the service is remote then the

ob ject table makes the request via the binder and instantiates a set of client

stubs of the appropriate typ e a closure for the interface for creating stubs of the

correct typ e is handed in by the client requesting to connect to the server of that

typ e

Trading

What needs to b e named in an op erating system In what resp ects are the inter

face naming system and the le naming system integrated How much control

do es any particular pro cess have over its own view of the name space

Considering the Unix name space for a moment we can see that the le system

name space is used not just to name les but also to name interfaces Interfaces

named within the Unix name space are character and blo ck sp ecial devices these

are a particular class of interfaces which have op erations that are similar to the

op erations that can be used on a le but they are interfaces nonetheless and

named pip es which explicitly name interfaces b ehind an IPC channel Further

more in more recentversions of Unix les suchasproc again name things which

in some ways are closer to interfaces than les

Again considering Unix even for les themselves it is considered normal to have

les of dierenttyp es data typ e directory typ e and symb olic link typ e named

in the same context with a stat op eration b eing used to distinguish them

Further it is normal for NFS leservers to prohibit the creation of character

and blo ck sp ecial les by clients of exp orted le systems This is for security

reasons In other words there may need to be a restriction in general on the

sorts of things which can b e published in any given context

Finally Unix is not clear on dierentiating the typ e of an interface from an

named pip es are usually p ermitted since they do not aect security but dont work remotely

interface reference ie a particular instance of service of that typ e This is one



of the reasons for the somewhat esoteric naming used on Unix so ckets

Following the ANSA mo del the system describ ed here provides a traded name

space for b oth services and les Anynumber of Traders or Name Servers exist in

which appropriately authorised clients may store interface references for ob jects

and various ancillary information such as the typ e Traders may also implement

directories and symbolic links A trader service may cho ose to satisfy requests

any way it wishes and may in fact itself consult various other traders In the

ANSA mo del this is known as federation

Each mount point which a trader is prepared to exp ort to clients desiring a

naming context is represented by an IPC endp oint The context within the name

space ro oted at that point is used as an argumentto op erations This keeps the

number of IPC endp oints exp orted by a trader to a manageable amount Also

with the exception of a symb olic link which needs to be lo oked up in the client

to allow crossing of mount p oints this p ermits a trader to resolve an arbitrary

number of comp onents in a path name in a single op eration

Name Spaces

The system provides for pro cesses to congure their name spaces as they wish

Various traders maybemounted at various p oints in the name space of the pro cess



as the pro cess sees t This is similar to Plan from Bell Labs Pike

Merging of le names provided by separate traders within a single directory as

provided in Plan maybe implemented using aproxy trader



Another example is the instantiation muddle in the Unix packet lter co de where the

sp ecial device op en function mo dies the minor device numb er of the ino de of the le handle

which has b een op ened from the minor device numb er of the sp ecial le in order to ensure that

it is informed ab out subsequent op ens by other pro cesses This is required b ecause the packet

lter co de must keep p erledescriptor state



The Plan implementation is rather restricted by a lack of the intermediate interface

reference concept This leads to an inability to exp ort a congured name space from one

pro cess to another their system relies on a p eruser congured name space generated from the

profile le This is likely to b e addressed in a forthcoming system named Brazil Trickey

Restriction of Name Space

In some systems it can be the case that the owner of a pro cess may wish at the

time of its creation to restrict the name space which is visible to that pro cess

In Unix this is known as the chro ot op eration In Fawn there are a number of

ways of p erforming this op eration

Any pro cess b eing created must inherit some form of name space from its parent

The parent could p opulate this name space with proxies which ensure that the

lo okups do not move ab ove some particular point Alternatively some traders

could on request create a mount p oint for this purp ose

All of these rely on the interface references for ob jects in the system being

unguessable This is b ecause the system do es not enforce any sp ecial mechanisms

on the passing of interfaces b etween pro cesses This is in contrast to systems such

as Spring Hamilton where the passing of doors between pro cesses requires

sp ecial consideration on the part of the op erating system kernel Even if an un

privileged chro oted pro cess had a proxybinding service interp osed between

itself and the real binder this do es not p ermit a solution Consider the case of

a server which the pro cess do es have access to The server may create a new

interface reference for use by that pro cess There is no computationally ecient

way for the proxy binder to decide whether that interface reference should be

blacklisted or not

The Pegasus pro ject has now adopted unguessable interface references in order

to p ermit name space restriction

Bo otstrapping

A number of sp ecial op erations o ccur when the system b o ots to allow normal

op eration to b egin A boot image is constructed from a kernel and a number of

pro cesses A sp ecial to ol known as mkimage constructs an image which is capable

of running By convention the rst two pro cesses in the system are the console

daemon and the binder resp ectively Mkimage initialises the kernels event tables

so that every domain in the system shares an event pair with b oth of these sp ecial

pro cesses

Binder

On system startup the Basic Resource Allo cator or Binder uses a very basic

simplex channel over the pair of events setup by mkimage to transmit the ad

dresses and sizes of the memory areas it has allo cated for IPC to each pro cess

Once this information has b een sent the same event pair is used for controlling a

normal IPC channel between the pro cess and the Binder as describ ed in section

This is initially used by servers to initialise the callback channel needed

by the ob ject table

Trading

Two of the op erations provided by the binders interface are to set and get the

interface reference of the b o otstrap trader Any pro cess requesting the interface

reference is blo cked until it has b een set

Other communication

This chapter has discussed the provided IPC mechanism the next chapter dis

cusses high p erformance IO supp ort The use of events as interpro cess syn

chronisation primitives however enables many other typ es of communication

Examples include circular buers which may be used for a stdio stream and

versioning

Versioning is when a pro cess makes a tree data structure publicly available and

p erforms up dates to that tree by providing altered records from any change up

wards as far as the ro ot of the tree A similar data structure was used for supp ort

of some atomic op erations in the Cambridge File Server Needham Such a

server can then send an event to inform the clients to change to the alternate ro ot

pointer Clients acknowledge the change by using an event in the other direction

freeing the server to make subsequent up dates Such a mechanism maybe most

suitable for name servers

In any communication scheme devised the event channel is used to indicate the

synchronisation p ointbetween the users This lo calisation of the memory system

coherence and synchronisation ensures ecient access to the shared memory This

is in stark contrast to systems such as Wanda where to obtain consistency on a

multipro cessor co de must b e compiled with the f volatile option which forces

the compiler to consider all memory accesses as referencing volatile variables

Summary

This chapter has considered the impact that programming language can have

on IPC and RPC systems The C language together with an exception mecha

nism was adopted The MIDDL interface denition language and MiddlC calling

conventions were adopted for compatibilitywith the Nemesis system

Closures were adopted for ecient op eration in a single address space and a

shared library mechanism was describ ed which uses automatically generated stubs

for shared functions The shared libraries may make use of other such libraries

in a partial order manner

A mo del for an IPC system in the context of the previous chapter was develop ed

Consideration was given to both the migrating and the switching mo dels The

migrating mo del was considered inappropriate b ecause it led to problems in ac

counting for CPU poor cache lo cality and an unacceptable interface for device

drivers or other pro cesses which require nonblo cking communication primitives

The transp ort uses pairwise shared memory and event channels

The binding and naming schemes were describ ed based on interface references

and trading concepts from the standard literature Finally some issues to do with

b o otstrapping and eciency of calling conventions were discussed

Chapter

Input Output

The inter pro cess communication mechanism describ ed in the previous section

ts the needs of RPC quite well but is not appropriate for stream bulk transfer

of data Before presenting the scheme adopted in Fawn some other metho ds will

be presented

Previous Schemes

Unix

In Unix IO is handled dierently dep ending on whether the data is disk based

or network based Inherent in the system is that the paging co de must deal in

blo cks that are the same size as le system blo cks

For disk based activity data is stored in blo cks whose sizes are powers of two

The size of the disk sectors is not larger than the size at which allo cation

and requests are done on the blo ck sp ecial device which is in turn not larger

than the page size or the size of buers in the buer pool In BSD derived Unix

sp ecically Ultrix these are and bytes resp ectively

For network activity memory buers are allo cated using Mbufs Leer

Mbufs are a two level hierarchy of linked lists One list is used to indicate a

sequence of packets on some queue the other to indicate a sequence of mbufs

which store the data for one particular packet An mbuf usually contains a

controlling header and a small number of bytes eg but for larger amounts

can also b e extended in one of two forms in which case a normal mbuf is used

to administer the larger area of memory The rst form is a cluster mbuf in which

the data area is a page allo cated from a pool managed by the virtual memory

system The second form known as a loaned mbuf is a kludge added for NFS

supp ort where the data part of the mbuf is actually pointing into some private

memory usually the disk buer cache In that case the header mbuf contains a

function p ointer which must be called when the mbuf is freed

Normal Data

Cluster Data or Loaned

vm/fs info

Figure Unix Mbuf memory arrangement

Mbuf control elds include a length and an oset into the data where the logical

start is For normal mbufs the data may be mo died but for cluster or loaned

mbufs the data is read only Both the length and the oset of the data are

sp ecied in bytes and there are no alignment constraints

When data is b eing transmitted the semantics are dened such that the receiver

of an mbuf may alter it as it sees t sub ject to the ab ove constraint and frees

or reuses the mbuf when transmission has taken place

In Unix all data is copied between user space memory and kernel buers On

system calls the user pro cess gives the kernel a variable length array of pointer

length pairs which make up the request The kernel copies these into or out of

disk blo cks or mbufs appropriately When copying into a chain of mbufs the

so cket level implementation makes a heuristic decision ab out whether to allo cate

normal or cluster mbufs The heuristic is that if more than bytes remain

This value is given for Ultrix other systems may use dierent heuristics

to b e copied then a cluster mbuf will b e allo cated This copying of the data into

kernel memory is quite exp ensive It is not the only copy op eration however

Since the device driver may alter the mbufs any transp ort proto col which may

need to resend the data must copy the mbuf chain b efore giving it to the device

driver For normal mbufs this means allo cating a new mbuf and copying the

contents For cluster mbufs this additionally entails entering the virtual memory

system to increase the reference count on the page frame underlying the cluster

For loaned mbufs there is no way to increase the reference count as only a freeing

metho d is stored in the normal mbuf header Instead a blo ck of memory is

allo cated by the virtual memory system from the NFS p o ol and the entire data

area copied into it This gives a loaned mbuf with the memory loaned by the

virtual memory system rather than the le system further copies of this new



mbuf will still require further copies of the actual data

Proto cols add headers to the chain by placing the header in an mbuf which is

inserted at the head of the list

Furthermore on many mo dern platforms many devices excepting the most ex

p ensive and higher bandwidth are not able to interface well with wide memory

busses DEC DEC and so data is copied again in the device driver into a

format which can be accessed by these devices

This means that data is p otentially copied in software three times between user

space and b eing acceptable to device hardware

On the receiving side device drivers typically must copy the data into mbufs b efore

passing it up to higher level software to decide whether it should be accepted

or not Further this higher level software runs at a lower software interrupt

priority and thus the mbufs are placed on a queue by the device drivers Under

periodsofoverload this queue can overow leading to wasted eort Additionally

the device driver has no way of knowing what prop ortion of each packet is headers

and what prop ortion is user payload leading to some additional copying Finally

the data will b e copied from the kernel buers into the user space memory

Various research work has considered the feasibility of using page remapping



In OSF there is an additional external structure allo cated in an area of memory which

is neither part of the mbuf header nor the clusterloaned data whichcontains the free metho d

for the external memory and a doubly linked list of all the headers which refer to this data

area This reduces these costs somewhat with the tradeo of the additional allo cation and

management of the external structure

between user space and kernel buer for data whichhappens to b e page aligned

This can b e complicated due to the asynchronous nature of transmission and the

synchronous nature of acknowledgement for data moving op erations on common

transp ort proto cols such as TCP Copy on write schemes andor the pinning of

pages in the virtual address space can b e used if the complexityoverhead is low

enough Unfortunately the cost of this page ipping is frequently greater than

the cost of the copy This is esp ecially the case on a machine with a virtually

addressed cache or on a multipro cessor Dixon notes that coherence of TLBs

in a multipro cessor is usually not addressed at all unlike the concern for memory

coherence necessitating an interpro cessor interrupt whenever page remapping

takes place

One nal problem with Unix is that for proto cols with many small packets eg

audio a system call is required for the pro cess to collect each such packet This

can be a ma jor p erformance p enaltyforsuch proto cols

Wanda

Buer management in Wanda picks up many attributes of the Unix system and

some ideas from the Topaz system Thacker Schro eder Communication

primitives in Wanda are discussed more fully in Dixon

Wanda has a set of kernel managed buers called IOBufs Each IOBuf is of

xed size various sized buers are available An IOBuf is some multiple of the



page size of the machine Applications may acquire and free IOBufs The send

op eration takes as an argument a single IOBuf and the recv op eration returns

a single IOBuf The semantics of IOBufs dene that the IOBuf is mapp ed into

the address space of the owner and unmapp ed when the IOBuf is sent or freed

The implementation however is that all of the buers are globally accessible all

of the time

IOBufs are identied by their number which is an index into a read only table

of pointers Each pointer points to a read only IOBuf control structure which

records the actual address of the data area the owner and if the IOBuf is b eing

manipulated within the kernel IPC system the oset and length of the data

Within the data area the rst twowords are sp ecial containing the user pro cesss

view of the oset and length When an IOBuf number is passed from user to

kernel the kernel checks that that pro cess owns the buer and that the oset and



Wanda is vaxo centric in assuming that the page size is bytes

length elds are valid When an IOBuf is passed from kernel to user the users

elds are up dated

Table

Control

Data

Figure Wanda IOBuf memory arrangement

Although IOBufs are of xed size supp ort for headers is enabled by leaving an

area unallo cated at the front of the buer when it is requested Thus headers may

be added to the front of a contiguous area of data by reducing the initial oset

On Wanda the amount of this initial oset is bytes enough for substantial



headers

Like Unix Wanda also suers from the problem that any transp ort proto col

which may need to send the data more than once must copy the IOBuf b efore

it is sent This is slightly mitigated by the fact that the data is guaranteed to

be aligned in the worst case on Unix it may not be Also RPC is the normal

paradigm for programming on Wanda rather than use of transp ort proto cols

RPC systems prefer to b e in control of their own retry p olicies dep ending on the

semantics of the op eration

On reception Wandas p erformance relies on two asp ects One is a generalisation

of a result noted in Bershad that RPC programmers tend to ensure that

interfaces are such that arguments and results t in a single Ethernet packet

Since Wanda machines are usually used for sp ecic services the sizes of the buers

present in the system may b e congured appropriately For example the Pandora

video le server Jardetzky congured the kernel to havemany IOBufs of the



The fat cell packet format on the Ethernet uses byte headers

same size as video and audio segments

The other asp ect is that for cell based networks the use of the MSSAR proto col

McAuley allows the size of the packet to b e determined from the header of

the rst cell Wanda allo cates an IOBuf of sucient size Recently to supp ort

AAL Forum where this indication is not available the Wanda MSNA im

plementation has b een enhanced to allow an application to hint to a device driver

the exp ected size of arriving packets

Like Unix Wanda can exp erience livelo ck problems if data is arriving faster

than the application can consume the IOBufs Also there is no resource control

on the number of IOBufs that can accumulate for a pro cess an o ccasionally

observed fault condition on Wanda machines is where data is arriving for a pro cess

which is not reading from that so cket due to a bug or lack of priority which

eventually consumes all of the available buers This is a similar problem to

priorityinversion

Fbufs

Fbufs Druschel are another example of op erating system research on buer

management An Fbuf is a page of memory allo cated from a reserved area of the

virtual address space of the machine which is common across all pro cesses The

system notes the dierence b etween Proto col Data Units PDUs and Application

Data Units ADUs thus a transfer from one domain to another is in the form

of an aggregation of Fbufs each of which can have congurable length and oset



Figure shows an example aggregation This aggregation is most similar to a

Unix mbuf chain consisting solely of cluster mbufs Like IOBufs when an Fbuf

is sent it is unmapp ed or reduced to read only access from the sending domain

and added to the receiver

The designers suggest a number of optimisations to improve p erformance First

they note that Fbufs are likely to have lo cality in the paths they travel between

various domains in the system If this path is converted to a ring ie Fbufs are

returned to the originator then some free list management has b een avoided and

also the buer need not be cleared security is not compromised by returning

data to the domain that created it

Another p otential optimisation is to declare Fbufs to be volatile the pro cess



This gure is based on a diagram in Druschel Aggregate Object

Data

Fbufs Data Data

Figure Fbuf memory arrangement

which initially created the data retains write access to the Fbufs and so they

may conceivably change whilst b eing read by a receiving pro cess In many cases

this is not a problem for example since a pro cess must trust the device driver

to correctly giveitthe data which arrives for it trusting the same device not to

corrupt the data after sending the Fbufs to it is reasonable

Fbufs provide high p erformance access to buer memory for network IO as

exemplied in Druschel However there are a number of disadvantages

First there is no resource control on the number of Fbufs a pro cess may collect

as a result of IO activity The Fbuf area is pageable and hence an access may

result in a page fault which may be unacceptable for device drivers esp ecially

if it is the disk device driver which takes the fault Furthermore there is no

supp ort either for applications to sp ecify the way in which they want received

data to be partitioned between Fbufs or for an Fbuf aggregate to be given to

multiple receivers eg for multicast packets

Application Data Unit Supp ort

There are a number of previous schemes for supp ort of ADUs within device

drivers The general idea is that the device driver will arrange for the data to

arrive in a manner that will reduce the cost of its presentation to the receiving

user pro cess

IP Trailers

The IP trailers Leer scheme found in some versions of BSD Unix used a set



of sp ecial Ethernet typ e co des to indicate that the headers of the IP packet

had been placed at the end instead of at the start A range of Ethernet typ e

co des was used which indicated the number of byte blo cks of user data in

the packet The purp ose of this mechanism was to allow the receiving machine

to use pageipping instead of copying Unfortunately it was heavily vaxo centric

assuming that all machines had byte pages and that user pro cesses requests

were page aligned For many machines it was inappropriate since the cost of

reorganising the virtual memory map far exceeded the cost of copying

The principal reason for moving the headers to the end was that they were variable

sized the TCP header is bytes longer than the UDP header However it would

have b een much simpler merely to pad the IP header of UDP packets with

bytes of dummy IP options appropriate padding options already exist This

would have converted the variable sized headers into xed size ones allowing the

receivers to congure their receive buers appropriately This would haveavoided

the implementation costs in many receivers for whom the trailers proto col was no

b enet in any case This padding scheme is to be used in the network interface

for the Desk Area Network Prattb where an internal message may need to

be converted into an external IP datagram The IP and UDPTCP headers

are padded to ll a byte ATM cell so that the user data remains cellaligned

Optimistic Blast

The Optimistic Blast proto col Carter tried to avoid data copying by assum

ing that the transmission medium is likely to be idle The data packets are sent

back to back on the network when the rst packet arrives the receiver assumes

that the blast proto col is b eing used and changes the descriptor rings for the

Ethernet chip to put the following packets in memory in such a way as to avoid

copying When contention arises the data must be copied This scheme has

the problem that it only works for xed sized packets the device driver must

know which proto col is in use in the implemented case only one proto col was

supp orted and the rst packet of a blast must still b e copied

Another problem with this implementation was that the device driver attempted



In theory it could b e used for any higher level proto col

to redirect the Ethernet descriptors concurrently with data arriving Regularly

interleaved accesses by the driver and the hardware would cause the hardware to

lo ck up reinitialisation was costly and the entire blast would b e lost

A nal concern is related to security If pro cesses are p ermitted to p erform non

blo cking IO then it is p ossible that when a packet interrupts a burst and is

accidentally placed temp orarily in the wrong area of memory that the pro cess

that owns that area of memory may be able to access data which it should not

be able to For example this would allow a rogue pro cess on one machine to

delib erately attempt to cause burst transfers when a message from a leserver

was exp ected in order to gain unauthorised access to data for other pro cesses on

that machine

Transparent Blast

An improved transparent blast proto col was prop osed in OMalley In this

scheme a control packet is sent in advance of the blast This contains enough

information to allow the receiver to reprogram the interface chip in order to

put the blast data in the correct place The Ethernet chip is disabled while

the reprogramming takes place which leads to a deaf period of approximately

s Data packets are either padded to ensure that the headers are all the

same length or else the length of the headers is indicated in the control packet

This implementation suers from a dierent problem which is that the bus latency

caused by accessing the descriptors used to break the received packet up into the

header and data comp onents can cause the Ethernet chip to overrun

The Cambridge Ring

The Media Access Proto col on the Cambridge Ring Hopp erallows a receiver

to indicate a Not Selected indication to a transmitter This allows the construction

of a very simple interface which can receive from only a single source at a time

The indication prevents the data b eing lost the transmitter will retry that mini

packet This corresp onds to explicit supp ort for blast proto cols in the hardware

The Cambridge Fast Ring Temple supp orted a similar selection and channel

mo de where bandwidth was reserved for the duration of the burst

Discussion

In general attempts to date to make an ecient implementation of Application

Data Units have failed This has b een mostly due to either network hardware

proto cols that made rapid decisions of recipient dicult or due to op erating sys

tems which added substantial complexity in the path from device to application

The combination of ATM networking where the VCI can b e used to demultiplex

at the lowest level and a microkernel where applications may communicate

directly with device drivers leads to a p ossibility of reversing this failure This

is addressed in the Fawn buer system describ ed below

Requirements

Apart from the attempt to supp ort ADUs the requirements for an IO buering

system in Fawn were slightly dierent from all of the ab ove systems In the

Pegasus philosophy applications negotiate for real rather than virtual resources

This means that an application has a certain amount of buer memory which will

not b e paged If the system is short of memory then it will require the application



to free a certain amount the application is free to cho ose the optimal memory

to retain from its p ersp ective Like the Fbuf system there was no need for highly

dynamic reallo cation of buers between dierent IO data paths Also it would

be preferable if multirecipientdata need not b e copied

Considering Device Hardware

As mentioned in section network hardware is frequently one of two typ es

An interface is either high bandwidth eg ATM in which case DMA is usually

supp orted and the device is reasonably intelligent in using the VCI or similar

in the header of arriving data to access the correct buer control information

such interfaces are selfselecting Otherwise the interface is usually of low band

width and requiring software copying of data eg Ethernet such interfaces are

non selfselecting Of course there are exceptions eg Greaves but it is

reasonable to optimise for the common cases



If an application do es not free memory when required within a reasonable length of time

then the resource manager may simply kill it

Examples of selfselecting interfaces include the Aurora TURBOchannel inter

face Druschel and the Jetstream Afterburner combination Edwards

In Jetstream the arriving packets enter a sp ecial buer memory based on the

arriving VCI The device driver then reads the headers and instructs a sp ecial

DMA engine to copy the data to the nal lo cation Knowledgeable applications

maymake sp ecial use of the buer p o ols in the sp ecial memory

It has b een recent practice in op erating systems to supp ort a proto col indep en

dentscheme for determining the pro cess for whichpackets arriving at an interface

are destined This is known as packet ltering Mogul and this technology

is now highly advanced McCanne Yuhara For nonselfselecting inter

faces packet ltering can determine which IO path the data will travel along as

easily as it can determine which pro cess will be the receiver This prop erty is

assumed in the Fawn buer mechanism derived below

On older hardware many devices which used DMA required a single noninter

rupted access to a contiguous buer On more recent platforms such as the TUR

BOchannel DEC the bus architecture requires that a device burst for some

maximum p erio d b efore relinquishing the bus This is to prevent the cache and

write buer b eing starved of memory bandwidth and halting the CPU Devices

are exp ected to have enough internal buering to weather such gaps Also the

high bandwidth that is available from DMA on typical workstations dep ends on

accessing the DRAMs using page mo de Such accesses mean that the DMA cycle

must be reinitiated on crossing a DRAM page b oundary Furthermore most

workstations are designed for running Unix with its noncontiguous Mbuf chains

The result of this is that most high p erformance DMA hardware is capable of at

least limited scattergather capability

Proto col Software

Most commonly used proto cols wish to op erate on a data stream in three ways

These are

To add aheader eg Ethernet IP TCP UDP XTP

To add a trailer eg XTP AAL

To break up a request into smaller sizes

Headers

Headers are usually used to ensure that data gets to the right place or to signify

that it came from a particular place We can consider howsuch op erations aect

high p erformance stream IO particularly in resp ect of security In the Inter

net much of the little security which do es exist relies on secure p ort numb ers

These are p ort numb ers which are only available to the highest authorityonthat

machine and receivers may assume that any suchpacket b ears the full authority

of the administrators of the source machine rather than a particular user It is

similarly imp ortant that machines accurately rep ort their own addresses For this

reason the transmission of arbitrary packets must be prohibited transmission

must include the correct headers as authorised by the system This has b een

the traditional reason for having such networking proto cols in the kernel or in a

microkernel implemented in a single networking daemon However this is not

a foregone conclusion

It is p ossible instead to have proto col implementations within the user pro cess

and still retain the required security The device driver must then p erform the

security control There is a broad sp ectrum of the p ossible ways of engineering

such a solution In one extreme the device drivers actually include co de via a

trusted library which understands the proto col and checks the headers which

is close to implementing the proto col itself in each device driver

Alternatively the device driver could include a mechanism similar in concept

to packet ltering co de which determines if the packet is valid for transmission

rather than reception This pro cess can be highly optimised Most of the



headers of such proto cols are well dened xed length elds which can b e easily

checked under a mask and compared within the device driver The mask and

compare values would b e initialised by the privileged centralised server when the

IO channel was initiated

For any implementation the volatility of the buer memory must be taken into

consideration the driver must protect against the pro cess corrupting the headers

after they have b een checked This mayentail copying the security related elds

of the header b efore checking them Another solution may rely on caching the

secure part of the header in the device drivers private memory and up dating

the p erpacket elds Some common headers are shown in gure the IP and

TCP headers can p otentially include a variable numb er of additional bit words



IP options are extremely rare and dicult to use from any op erating system

sp ecied using the hlen elds

32bits 32bits IP HEADER XTP HEADER ver hlen service length route identification flags offset time to live ttl protocol hdr checksum command source key destination sync sequence ETHERNET HEADER data sequence destination sort address data length source header checksum address type / length TCP HEADER UDP HEADER source port destination port source port destination port sequence number udp length checksum acknowledgement number hlen code window

checksum urgent pointer

Figure Headers for various proto cols

For many other elds suchaschecksums the user pro cess is the only one to suer

if they are not initialised correctly Considering the proto cols shown in then

we see that for UDP and TCP only the p ort values need to be secured For IP

all but the length and checksum elds must be secured and for Ethernet all the

elds must b e secured For XTP the elds whichmust b e secured are route time

to live command and key The sort eld may need to be checked although its

exact semantics are not clearly dened

One nal p ossible concern would be with resp ect to ow control or congestion

avoidance conceivably a user pro cess could have private co de which disob eyed the

standards on TCP congestion control There are various answers to this First a

malevolent user pro cess could simply use UDPwhich has no congestion control

instead if it wished Second since the op erating system is designed with quality

of service supp ort the system could easily limit the rate at which a pro cess is

p ermitted to transmit Third the application may in fact b e able to make b etter

use of the resources in the network due to application sp ecic knowledge or by

using advanced exp erimental co de Recent research Brackmo shows that

improving ones own resp onsiveness to network behaviour in order to improve

ones own p erformance may in fact have a b enecial eect on others using the

network

Trailers

Unlike headers trailers do not usually contain any security information They

usually contain checksum information Trailers for common proto cols may be

found in gure For AAL the pad is of variable length up to a maximum of

bytes

32bits 32bits AAL5 TRAILER ETHERNET TRAILER pad checksum pad length XTP TRAILER

checksum checksum

Figure Trailers for various proto cols

Trailers are most easily dealt with by requiring the user pro cess to provide enough

space or the correct padding for the packet on b oth receive and transmit If

there is not enough the packet will simply be discarded a loss to the user

pro cesses Providing this space is not dicult for a pro cess once it is known how

much is necessary this value can be computed by a shared library or discovered

using an IPC call

Fragmentation

Like trailers there is no security consideration for lo cal or remote systems as a

result of incorrect length transfers on an IO channel If a transmit request is

to o large the data can simply be discarded If a receive buer is to o small the

data may b e discarded or truncated Truncation may b e used delib erately in the

case of network monitoring applications which are frequently only interested in

the headers of the passing packets

Application Software

Many applications have application sp ecic basic data units which may be to o

large for individual network packets For example NFS blo cks over Ethernet are

usually fragmentedattheIPlevel Ideally a system should p ermit the application

to sp ecify receive buers in such a way that the actual data of interest to the

application ends up in contiguous virtual addresses

On the other hand for some applications the applications basic data unit ie

the unit over which the application considers loss of any sub part to be loss of

the total maybevery small This maybefound in multimedia streams such as

audio over ATM and compressed tiled video Prattc For such streams the

application should not have to suer very large numb ers of interactions with the

device driver it should b e able to handle the data stream only when an aggregate

of many small data units is available

Scheduling

Within Fawn the use of jubilees means that IO channels must have a substantial

amount of buering to cop e with the delays until the receiver of the data is

next scheduled Due to the heuristic used for dispatching interrupts describ ed in

section the delay can be up to twice the length of the jubilee minus twice

the guaranteed CPU allo cation for the pro cess in each jubilee For example with

a jubilee of ms an application which is guaranteed of the CPU to handle a

Mbitsec stream will require approximately Kbytes of buering Of course

delays within the pro cess itself due to thread contention or remote invo cation

need also b e considered

A device driver pro cess may have sp ecic scheduling requirements in order to

meet its quality of service contracts In particular it is likely to require a non

blo cking access methodtoIOchannels

Streaming Memory

In Hayter the concept of a stream cache was dened A stream cache is

a sp ecial area of the cache on a system which is used directly for IO without

the data actually b eing represented in the underlying memory It is particularly

suited to the pro cessing of multimedia streams The MMU on the system pre

vents applications accessing data in the stream which is stale ie has b een

overwritten in the cache by newer data If the application accesses data which

has yet to arrive then either the pro cessor is halted just like a conventional cache

miss or if the accessed data is suciently far into the future the MMU traps the

access and allows the op erating system to schedule some other pro cess

The Adopted Solution

The design for IO buering adopted in this system called Rbufs will now be

presented together with a discussion on the op eration of IO channels within the

system

Op eration

The Rbuf design separates the three issues of IO buering namely

 The actual data

 The osetlength aggregation mechanisms

 The memory allo cation and freeing concerns

An IO channel is comprised of a data area for the actual data and some con

trol areas for the aggregation information The memory allo cation is managed

indep endently of the IO channel by the owner of the memory

Data Area

The Rbuf Data Area consists of a small numb er of large contiguous regions of the

virtual address space These areas are allo cated by the system and are always

backed by physical memory Revocation of this memory is sub ject to out of

band discussion with the memory system To as large an extent as p ossible the

memory allo cator will keep these contiguous regions of virtual addresses backed

by contiguous regions of physical addresses this is clearly a platform dep endent

factor

The system provides a fast mechanism for converting Rbuf Data Area virtual

addresses into physical addresses for use by drivers that p erform DMA On many

platforms a page table mapping indexed by virtual page number exists for use

by the TLB miss handler on such platforms this table can b e made accessible to

device driver pro cesses with read only status

Protection of the data area is determined by the use of the IO channel It must

b e at least writable in the pro cess generating the data and at least readable in the

pro cess receiving the data Other pro cesses mayalsohave access to the data area

esp ecially when an IO channel spanning multiple pro cesses see section is

in use

One of the pro cesses is logically the owner in the sense that it allo cates the

addresses within the data area which are to b e used

The Rbuf data area is considered volatile and is always up dateable by the pro cess

generating the data This was justied in section

Data Aggregation

A collection of regions in the data area may be group ed together eg to form

a packet using a data structure known as an IO Record or iorec An iorec is

closest in form to the Unix concept of an iovec It consists of a header followed

by a sequence of base and length pairs The header indicates the number of such

pairs which followitand is padded to make it the same size as a pair

This padding could be used on some channels where appropriate to carry addi

tional information For example the exact time at which the packet arrived or

partial checksum informationifthisiscomputedby the hardware Sreenan

points out that sometimes it is more imp ortant to know exactly when something

happ ened than actually getting to pro cess it immediately

Control Areas

A control area is a circular buer used in a pro ducer consumer arrangement

A pair of event channels is provided between the pro cesses to control access to

this circular buer One of these events channels going from writer to reader

indicates the head p osition and the other going from reader to writer indicates

the tail

A circular buer is given memory protection so that it is writable by the writing

pro cess and readonly to the reading pro cess A control area is used to transfer

iorec information in a simplex direction in an IO channel Two of these control

areas are thus required to form an IO channel and their sizes are chosen at the

time that the IO channel is established

Figure shows a control area with two iorecs in it The rst iorec describ es

two regions within the Rbuf data area whereas the second describ es a single

contiguous region

Tail Head

Control Area 21

Data Area

Data Data Data

Figure Rbuf memory arrangement

Usage

Figure shows two pro cesses A and B using control areas to send iorecs be

tween them Each control area as describ ed ab ove basically provide a fo queue

of iorecsbetween the two ends of an IO channel Equivalently an IO channel

is comp osed of two simplex control area fos to form a duplex managementchan

nel The control areas are used indistinguishably no matter how the IO channel

is b eing used Control Area for iorecs from A to B

Process Process A Data Area B

Control Area for iorecs from B to A

Figure Control Areas for an IO channel between A and B

A typical IO channel is in fact a simplex data channel op erating in one of two

mo des The purp ose of these two mo des is to allow for the supp ort of ADUs

in various contexts Note that there is no requirement for either end of the IO

channel to pro cess the data in a FIFO manner that is merely how the buering

between the two ends is implemented

In Transmit Master Mo de TMM the originator of the data cho oses the ad

dresses in the Rbuf data area places the data into the Rbufs and places the

records into the control area It then up dates the head event for that control

buer indicating to the receiver that there is at least one record present As

so on as the downstream side has read these records from the control buer it

up dates the other tail event freeing the control buer space for more records

When the downstream side is nished with the data it places the records into

the control area for the queue in the other direction and signals its head eventon

that control buer The originator likewise signals when it has read the returned

acknowledgement from the control buer The originator is then free to reuse the

data indicated in the returning control buer

In Receive Master Mo de RMM the op eration of the control areas is indistin

guishable from TMM the dierence is that the Rbuf data area is mapp ed with

the p ermissions reversed and the data is placed in the allo cated areas by the

downstream side It is the receiver of the data whichcho oses the addresses in the

Rbuf data area and passes iorecs which indicate where it wishes the data to be

placed to the downstream side The downstream side uses the other control area

to indicate when it has lled these areas with data

The Master end which is cho osing the addresses is resp onsible for managing

the data area and keeping track of which parts of it are free and which are

busy This can be done in whatever way is deemed appropriate For some

applications where FIFO pro cessing o ccurs at b oth ends it may b e sucient to

partition the data area into iorecs at the initiation of an IO channel p erforming

no subsequent allo cation management

Table presents a summary of the dierences between TMM and RMM for

the diagram shown in gure without loss of generality A is the master it

cho oses the addresses within the data area

TMM RMM

Cho oses the Addresses A A

Manages data area A A

Write access to data A B

B A Read access to data

Table TMM and RMM prop erties

Since the event counts for b oth control areas are available to a user of an IO

channel it is p ossible to op erate in a nonblo cking manner By reading the event

counts asso ciated with the circular buers instead of blo cking on them a pro cess

can ensure b oth that there is an Rbuf ready for collection and also that there will

b e space to disp ose of it in the other buer This functions reliably b ecause event

counts never lose events Routines for b oth blo cking and nonblo cking access are

standard parts of the Rbuf library

Longer channels

Sometimes an IO channel is needed which spans more than two pro cesses An

example may be a le serving application where data arrives from a network

device driver pro cess passes to the leserver pro cess and then passes to the disk

driver pro cess

When such an IO channel is set up it is p ossible to share certain areas of Rbuf

data memory which are already allo cated to that pro cess for another IO chan

nel A pro cess may wish to have some private Rbufs for each direction of the

connection ie ones which are not accessible to pro cesses in the other direction

for passing privileged information In the leserver example the leserver may

have Rbufs which are used for ino de information which are not accessible bythe

network device driver

The management of the channel may either be at one end or it may be in the

middle In the example of the leserver it is likely to be in TMM for commu

nicating with the disk driver and RMM for communicating with the network

driver The imp ortant p oint is that the data need not b e copied in a longer chain

provided trust holds

Figure shows the IO channels for a leserver For simplicity this only shows

the control paths for writes The iorecs used in the channel b etween the leserver

and the disk driver will contain references to b oth the network buer data area

and the private ino de data area Only the network data buer area is used for

receiving packets The leserver op erating in RMM will endeavour to arrange

the iorecssothatthediskblocks arriving probably fragmented across multiple

packets will end up contiguous in the single address space and hence in a suitable

manner for writing to disk

Complex channels

In some cases the ow of data may not be along a simple IO channel This

is the case for multicast trac which is being received by multiple pro cesses on

the same machine For such cases the Rbuf memory is mapp ed readable by all

the recipients using TMM IO channels to each recipient The device driver

places the records in the control areas of all the pro cesses which should receive

the packet and reference counts the Rbuf areas so that the memory is not reused

until all of the receivers have indicated they are nished with it via their control

areas

Apart from the lackofcopying b oth pro cesses b enet from the buering memory

provided by the other compared with a scheme using copying

A problem p otentially arises if one of the receivers of suchmulticast data is slower

at pro cessing it than the other and falls b ehind Ideally it would not be able to Disk Device Driver

iorecs

Inode Data and Written Memory Inode Blocks Blocks No MMU access to Network iorecs TMM

Write-Only Fileserver

RMM Data iorecs Memory Writeable Where by Network Arrived to put readable by Packets packets Disk iorecs

Network Device Driver

Figure A longer Rbuf channel Control path for Fileserver Writes

have an adverse aect on the other receiver This can be done by limiting the

amount of memory in use by each IO channel When the limit is reached the

iorecs are not placed in that channel and the reference count used is one less The

buers are hence selectively dropp ed from channels where the receiver is unable

to keep up An appropriate margin may be congured based on the fanout of

the connection

One approximate but very ecient way of implementing this margin is to limit

the size of the circular control buer Iorecs are then dropp ed automatically

when they cannot be inserted in the buer in a nonblo cking manner Even if a

more accurate implementation of the margin is required the Rbuf scheme ensures

that the cost is only paid for IO channels where it is required rather than in

general

Out of band control

For IO channels in other op erating systems there exist mechanisms for out of

band mo dications or information requests In Unix this is done via the io ctl



mechanism In Wanda this is done by marshalling into an IOBuf and using

WandaIPCControl In Rbufs an IO channel is normally established with a

parallel IPC channel whichisused for out of band RPCs to the other end of the

IO channel IPC is discussed in section

Summary

Mbufs IOBufs Fbufs Rbufs

Page Faults Possible No No Yes No

AlignmentOK No Yes Yes

Copy to user pro cess Yes No No No

Copy to clever device Yes No No No

Copy for multicast Yes Yes Yes No

Copy for retransmission Yes Yes No No

Supp ort for ADUs No No No Yes

Limit on Resource Usage Yes No No Yes

 

Must b e cleared for security No Yes No No

Table Comparison of Buering Prop erties

This chapter has considered the demerits of various buering schemes Many

of these schemes are what they are for historical reasons for example mbufs were

designed when maximum memory eciency was preeminent The requirements

for a buering scheme in a high p erformance microkernel were presented in detail

The Rbuf IO channelling and buering mechanisms have been describ ed The

principal feature is the separation of the three issues of data transmission struc



Setso ckopt and Getso ckopt are cleaner interfaces to the intrakernel identical io ctl

This limit is actually as a result of so cket buering

This is b ecause of the copy to user pro cess memory However some networking co de rounds

up the sizes of certain buers without clearing the padding bytes thus included this can cause

an information leak of up to three bytes



Buers must b e cleared when the memory is rst allo cated This allo cation is not for every

buer usage in Fbufs but is still more frequent in Fbufs than in Rbufs

ture aggregation control and memory allo cation This separation allows for

great exibility in the managementofcommunications within the machine Rbufs

are designed to minimise copying in order to supp ort high bandwidth applica

tions and also the explicit scheduling of network device drivers Table shows

a comparison of the various schemes considered

Chapter

Exp erimental Work

There are three primary goals of the exp erimental programme describ ed in this

chapter First to evaluate the eectiveness and p erformance of the interpro cess

scheduling intrapro cess scheduling and pro cess communication mechanisms de

scrib ed in chapter Second to quantify the op erational benet of using this

scheme to tackle device driver scheduling in general by confronting the particular

problem of the Fairisle Port Controller identied in section Third to con

sider the suitability of Rbufs for IO channel communication in an event based

microkernel system

Exp erimental Platform

The exp erimental platform chosen was the ARM pro cessor ARM ARM

and the Fairisle Port Controller version Haytera Hayterb This was

due to the authors detailed knowledge of this hardware gained during the p orting

of the Wanda microkernel to this platform and the writing of the co de for the

Fairisle ATM switch

The ARM pro cessor on this platform has a MHz clo ck with a Kbyte

internal shared instruction data writethrough readallo cate cache which is

way set asso ciative with byte cache lines There is a two address and eight

data write buer b etween the pro cessor core and the bus The bus on the FPC

runs at MHz and a cache line can be loaded in cycles This gives a peak



memory to cache bandwidth of approximately Mbitssec The ARM do es

not have separate instruction and data paths to the cache so all data accesses

stall the pip eline Also there are no branchdelayorloaddelay slots so these also

stall the pip eline

The Fairisle network interface is optimised for controlling data passing through

the p ort controller rather than for data sourced or sunk at that device Accesses

to the cell buer memory take three cycles per word assuming no contention or

using cache line sized bursts a p eak of Mbitssec This is particularly relevant

to the exp eriments of section

The port controller provides Mbytes of DRAM and two bit programmable

counters which decrement every ns tick These counters are only accessible

bits at a time via costly IO accesses and require a sp ecial latch command to

be written b efore they can be read

It can b e seen that this is not a fast machine by current standards in particular

the cache is very small It will be seen below that the slow sp eed and partic

ularly the small cache of this machine show the Fawn design under somewhat

unfavourable conditions For a more mo dern machine the Fawn system would

p erform even b etter

System Conguration



The system is congured with a jubilee size of ticks This is using the max

imum size of one of the timers on this hardware and equates to ab out ms

Clearly this was fairly arbitrarily chosen rather than giving any particular sp ecial

eect for these exp eriments Other jubilee sizes are p ossible by simply mo difying

the system conguration A discussion of the costs asso ciated with each jubilee

are found b elow in section

The conguration of the kernel normally includes the console daemon the Basic

Resource Allo cator and the b o otstrap trader plus whatever test programs are

relevant The number of ticks of CPU allo cated to each pro cess is test sp ecic

The other timer is reprogrammed on each o ccasion that a pro cess is given the

CPU This will interrupt after a numberofticks which represents the remaining

allo cation of the selected pro cess Unfortunately due to a clo ck synchronisation

problem the previous interrupt must be cleared no less than four times This

programming op eration takes an average of s When a pro cess loses or gives

up the CPU reading the timer to measure how long the pro cess had takes roughly

either s or s dep ending on whether the timer expired or not

Measurement Details

The timing measurements presented in this chapter were made by reading the

timer which is used for generating the jubilee interrupt This timer counts in

ticks of ns The overhead of reading the time has been measured as ticks

or s This overhead is included in all the times rep orted

Measurements are written to a table in memory usually of elements with

no attempt b eing made to pro cess the data as it is being logged by generating

reference counts or otherwise Since the Arms cache is readallo cate only this

logging has no eect on cache b ehaviour The data is p ost pro cessed b efore b eing

output on the console

Interpro cess scheduling

Interrupt Latency

One of the p otential concerns with moving device drivers out of the kernel and

into user space pro cesses is that the latency for interrupts may be increased to

an unacceptable amount The purp ose of this exp eriment is to measure interrupt

latency in this system

As well as the obvious dierence in latency between the idle and busy cases

there is also an exp ected dierence dep ending on the length of time b etween the

pro cess relinquishing the CPU and the interrupt o ccurring This is b ecause of

the heuristic describ ed in section When an interrupt o ccurs for a pro cess

other than the currently active one the kernel compares the length of time that

the current pro cess has b een running against a congured threshold If that time

is less than the threshold then the kernel considers it to o inecient to take the

CPU away this is to prevent a large amount oftimebeingwasted in the system

switching to o frequently between pro cesses If the pro cess has been running for

more than the threshold then the kernel susp ends it and initiates a reschedule

Figure shows the exp ected sequence of events for the idle case and for the

busy cases when the threshold is exceeded and when it is not exceeded In the

exp erimental system this threshold is s

ie ticks Idle Interrupt Latency

Thread

Process

Kernel

Programmed Delay

Interrupt Latency Long Delay

Thread

Process Another Process Kernel

Programmed Delay

Interrupt Short Delay Latency Thread

Process Another Process Kernel

Jubilee

Programmed Delay

Figure Interrupt progress for three scenarios

Since both timers available in the system are already in use by the low level

scheduler another source of timed interrupts was required for this exp eriment

The Fairisle Xilinx chip contains some telemetry hardware amenable to this task

It provides a bit free running counter FRC and a bit down counter TC

The latter raises an interrupt when it reaches zero which remains asserted until

it is reinitialised by software The clo ck used for these counters is actually the

Fairisle cell synchronisation pulse This o ccurs every p erio ds of the fabric

MHz clo ck or once every s

In this exp eriment the FRC is rst read then the TC is programmed with a value

and the thread blo cks on the interrupt event When the interrupt o ccurs the event

is triggered when the thread awakes it reads the FRC again The dierence in

the FRCvalues minus the programmed TC sleeping time is considered to b e the

interrupt latency

The system was measured both when idle and when under overload conditions

The system is overloaded by creating a daemon pro cess which lo ops b eing allo

cated a full jubilee p erio d of time in the highest nonguaranteed scheduling level

see section for a discussion of scheduling levels

Programmed Delay

Small Large System

Idle s s

Busy Jubilee s

Table Interrupt Latency

The results of this exp eriment can b e shown in table It can b e seen that when

the system is idle then the interrupt dispatch latency is small In particular it is

not excessively worse than might b e exp ected on a traditional system eg Unix

When the system is busy however we see the exp ected eect of the heuristic

describ ed ab ove

If the interrupt went o before the threshold was up for the comp eting pro cess

then the measurement pro cess would havetowait until the next reschedule b efore

gaining the CPU Since in this conguration the comp eting pro cess is lo oping this



will not be until the start of the next jubilee

What can be observed from this exp eriment is that even when the system is in

overload the interrupt dispatching latency is only ab out s worse than when

idle in the typical case and the worst case is b ounded by the length of the Jubilee



In comparison for a more p owerful DECstation Shand measures

the interrupt latency to a kernel device driver on Ultrix A to b e usually b etween

s and s In the worst case they rep ort delays of over ms on a machine



It could b e argued that the system ought to force a reschedule b ecause the current pro cess

is op erating in a lower scheduling level than the pro cess which has received the interrupt this

is disabled to allow this test See section for a discussion on scheduling levels



Bus clo ck frequency MHz twin Kbyte caches

p erforming no other IO and higher instance sp ecic delays for a machines

with networking hardware

Fawn is structured so that it is always able to handle an interrupt but it cho oses

using the heuristic to do so at an appropriate time The decision is made as

so on as the interrupt o ccurs obviously the heuristic could if necessary b e easily

altered for particular circumstances

Jubilee Startup Costs

One of the costs of the system prop osed is the time taken to reinitialise the various

data structures at the b eginning of each Jubilee The purp ose of this exp eriment

is to quantify the eects of this mechanism

A test pro cess was written which intercepted its own application startup vector

and read the value of the jubilee timer This was run on a system with the

usual daemon pro cesses and a variable number of other dummy pro cesses The

system daemons were congured without any guaranteed time the test pro cess

and dummy pro cesses were given guaranteed time As a result the test pro cess

would b e the rst pro cess to run in each jubilee A diagram of the exp ected order

of execution during each jubilee is shown in gure

Jubilee

System Daemons

Dummy Processes

Test Process

Kernel

Jubilee Latency

Figure CPU activityduringaJubilee

The additional dummy pro cesses consist of a simple program whose initial thread

exits in a way that means that the pro cess do es not exit thus the pro cess is still

activated each jubilee but immediately surrenders the CPU

The system was run with the two normal daemons the system trader having b een

eliminated and some number of these dummy programs The Jubilee initialisa

tion latency was recorded The latency was found to be very variable with the

exact lo cations of various pro cesses in the memory map One machine with an

Arm pro cessor was available late during exp erimentation Figure shows

the results for b oth the Arm and the Arm running identical system images

Latency from jubilee start to first activation 70 65 60 Arm 610 Arm 710 55 50 45 40 35 30 Microseconds 25 20 15 10 5 0 0 1 2 3 4 5 6 7 8 9 10

Number of processes

Figure Latency from jubilee start to rst activation

The dierence b etween these two lines is entirely due to cache artifacts The

has a larger K cache with larger byte cache lines but the asso ciativityis less

way For the the cost app ears to be s per pro cess with an intercept

of s For the the cost is ab out s per pro cess with an intercept of

s With the cache disabled the cost is sper pro cess with an intercept

of s Interestingly the cold cache case derived by ushing the cache as so on

as a new jubilee is detected has a per pro cess cost of s with an intercept

of s indicating that the cache is highly eective for the co de but may be

thrashing on the data structures

Various other congurations of pro cesses were tried many of which showed dis

tinct knees related to caching behaviour

Clearly the fact that all the pro cesses data segments and hence VPI information

are page aligned on this system is causing substantial cache conict Although

this unfortunately makes the direct jubilee cost dicult to measure the salient

point is that the jubilee cost app ears to b e mostly dwarfed by the indirect cache

costs of the ne grain sharing provided

A few researchers have observed p erformance problems due to cache conicts

in clientserver environments although this is not a standard b enchmark and

therefore not greatly studied For example Chen presents studies of X

clients and server for the DECstation running Ultrix This problem

has tended to be regarded as soluble using shared libraries which Ultrix lacks

and greater cache asso ciativity This study having b oth shared libraries and a

highly asso ciativecache shows that further considerations maybe necessary

One p ossibility is to relax the alignment requirements Another more interesting

approach of providing Quality of Service supp ort in the cache implementation is

prop osed in Pratta Further consideration of this issue is beyond the scop e

of this dissertation

Same machine RPC

This exp eriment attempts to measure the cost of invoking a service in another

pro cess on the same machine A testing interface includes an op eration called

ping which provides a nullRPC service The invocation uses the IPC channels

describ ed in section

In order to exclude the jubilee eects mentioned in section the test p erforms

pings per jubilee and then blo cks until the next ensuring that no calls are

outstanding from one jubilee to the next Figure shows the distribution of the

round trip times a mean of s with sample interval b etween s and

s The long tail is a result of cache misses the p oints ab ove the gap b etween

sands represent the rst invo cations after a jubilee which suer p o orer

cache p erformance An approximate breakdown of the time taken was obtained

by emb edding LED changing co de throughout the system and watching with a

logic analyser This is presented in table

Of the time in the stubs a TRY FINALLY clause in the server stubs takes

s This clause is sup eruous in this context the pro cedure has no arguments

which require memory allo cation but the stub compiler is not currently capable

of optimising it out Same Machine Null RPC 60

50

40

30

20 Number of Occurrences

10

0 100 125 150 175 200 225 250 275

Microseconds

Figure Same machine Null RPC times

System Comp onent Percentage

Kernel Scheduler

Kernel event propagation

User pro cess schedulers

User pro cess in stubs

Table Approximate null RPC latency breakdown

As a related exp eriment the time taken to advance an event which must be

propagated to another pro cess was also measured The sample interval for

this was to ticks averaging s This is an imp ortant measurementsince

it is crucial to general system p erformance in general eg when using Rbufs

there will b e many such events sent between interpro cess scheduling decisions

Intrapro cess scheduling

The design of Intrapro cess scheduling and synchronisation primitives was dis

cussed in section During the development of the prototyp e implementation

no fewer than four distinct instances of intrapro cess schedulers meeting this

sp ecication were develop ed The p otential for many more applicationsp ecic

schedulers is a b enet of the Fawn design

The simplest scheduler known as triv is one which supp orts a single user thread

It runs with the disable event count always higher than enable at all times so it

is always resumed and not activated When the single thread blo cks either on

an event or for a time or b oth the pro cess lo ops checking the conditions and

yielding the CPU to the kernel The pro cess will be resumed and check the

condition if any events including a time change arrive

The rst scheduler to b e implemented known as corwas one which deals with



multiple threads but is strictly coroutine based It do es not supp ort time except

by busywaiting around a call to yield For each event it has a linked list of

threads waiting on that event ordered by the value that they are waiting for

When an event is changed lo cally by advance or remotely in which case it

will b e indicated to the domain as describ ed in section this list is examined

and any runnable threads are then moved to the run queue This scheduler was

used for the console daemon while the rest of the system including the other

schedulers was b eing debugged

Another scheduler known as cortisthe same as the one just describ ed except

that it supp orts time fully This entails each blo cked thread being queued on

up to two dierent lists These lists are now doubly linked to ease removal since

threads may now be awoken due to activityonthe other list and need removing

from the middle of the list

Finally due to the observation of the costs asso ciated with a thread context

switch in the doubly linked double list implementation a scheduler known as

cornwas implemented whichkeeps a xed sized array of threads and when idle

scans the array and reevaluates the condition on which each thread is waiting

This is clearly of linear cost in the number of threads but has a small constant



In fact it was written b efore the kernel supp orted time at all

Comparative p erformance

This section presents the result of measuring the time for a same pro cess thread

context switch using each of the ab ove schedulers Each pro cess has just the

two threads involved in the exp eriment The context switchwas p erformed

times each of which was measured by reading the jubilee timer Measurements

which cross jubilees are excluded from the results This is presented in gure

The time in ticks is measured for a switch from one thread to the other and

the switch back and the time to read the time counter which is known to be

ticks Removing the time to read the counter which is a nontrivial fraction

for these measurements a single context switch for these three schedulers takes

and microseconds resp ectively This compares with approximately



ab out sec for Wanda with all checks disabled on the same platform

Context switch times for various schedulers 700

corn cort 600 cor

500

400

300

Number of Occurrences 200

100

0 20 22 24 26 28 30 32 34

Microseconds

Figure Context switch times for various schedulers

For the platform used in this investigation the crossover point for the corn

scheduler compared to the cort scheduler turns out to be threads Thus for

any pro cess which exp ects to have a small numb er of threads the corn scheduler

provides a p erformance b enet This is an example of the sort of b enet that

is gained by allowing application sp ecic scheduling rather than a system wide

generic and therefore complex scheduler

Eects of Sharing

The purp ose of this exp eriment was to quantify the eect of sharing scheduling

co de by placing it in the shared library against each pro cess having a private

copy of the co de of its intrapro cess scheduler

Three identical pro cesses are given a small amount of guaranteed time in the

system and the others are not This ensures that they will b e run in succession

at the b eginning of each jubilee Each pro cess contains a thread which notes

the elapsed time from the start of the jubilee at which it is woken do es a small

amount of housekeeping and then sleeps until the next jubilee This is rep eated

times The exp eriment is run with each pro cess having its intrapro cess

scheduler in the static library and compared with the same exp eriment where

the scheduler is in the shared library The distributions of elapsed time from the

jubilee maybe seen in gure

It can b e seen that the dierence for the rst pro cess is marginal a few s at b est

However for successive pro cesses the dierence is quite substantial This is due to

the increased lo cality of reference outweighing the cost of having the co de in the

shared library Note that systems which do not use a single virtual address space

maybe unable to make use of this benet esp ecially if the hardware provides a

virtually addressed cache

For each pair the spread of the distributions is similar indicating that the oset

is solely due to caching eects For the same exp erimentwithcaching disabled on

the pro cessor the numbers for the three pro cesses are deterministically

and s resp ectively the lo cation of the scheduling co de b eing immaterial

Clearly the exact b enet of sharing is dep endent on the typical runtime of pro

cesses within the system and particularly their own cache behaviour

Although the exp erimental system do es not p erform memory protection the in

clusion of MMU domain switching co de do es not alter the eect shown here The

addition of protection co de adds the cost of changing the MMU domain status

and in the worst case a TLB ush but do es not aect the cache b ehaviour Comparison of static and shared schedulers

100 static scheduler shared scheduler

80

60

40 Number of Occurrences

20

0 50 100 150 200 250 300 350 400

Thread wake-up time in microseconds since Jubilee

Figure Comparison of static and shared schedulers

Fairisle Throughput

In this exp eriment the aim is to compare the throughput p erformance of the

Fawn system against the Wanda system when p erforming cell forwarding duties

as aportinaFairisle switch

The Fairisle device driver is written as a normal pro cess in the system which

happ ens to have access to the Fairisle hardware see section It p erforms

an RPC to the basic resource allo cator to initialise a hardware interrupt event

for the Fairisle device hardware interrupts

The scheduler chosen for this device driver is a coroutine scheduler in which the

threads yield at optimum p oints This p ermits various lo cks to b e held for long

p erio ds reducing the number of times they are obtained and released while the

threads are known to execute only nonblo cking op erations This reduces the

concurrency overhead

To make comparison with the Wanda p erformance as realistic as p ossible the

device driver is as close as p ossible to the Wanda form whose p erformance is

rep orted in Blackd in particular the cell scheduling p olicy is the same the

FIFO queue for forwarded cells has priorityover lo cally generated cells which in

turn has priorityover cells to b e freed with the following dierences

 The sp ecial purp ose bank of FIQ registers available to the lowlevel Wanda

interrupt co de is not available to this user pro cess

 The co de is restricted to using the C calling convention

 When a cell arrives a queue p ointer is read from a table indexed by the

incoming VCI This may indicate that this cell is for lo cal consumption this

includes cells with unknown VCIs and free cells In this case an up call

takes place via the VCI sp ecic closure using the C calling convention On

Wanda such cells are put on aqueue indistinguishable from any other and

are pro cessed later at alower interrupt priority

 The free cell buer pool is managed using lo cks and condition variables

In Wanda concurrency is controlled bychanging dynamic interrupt priority

level

Throughput p erformance is measured in user payload bits only For an



Mbitssec line this corresp onds to Mbitssec

The amount of CPU time per jubilee for this pro cess is congured in the image

build le This was varied over a large number of values for the contracted time

value and the pro cess was always allo cated zero additional ie nonguaranteed

time During the time that the Fairisle device driver pro cess is not running the

queues in the Xilinx buer memory build up and during the time that the pro cess

is running it must catch up

There are two hard limits on the maximum trac that can be pro cessed by

the device driver These are shown in gure where x is the bandwidth of

the arriving stream and y is the amount of CPU not dedicated to the Fairisle

device driver First the Xilinx buer memory of cells must not overow

during the p erio d that the pro cess is descheduled In this environment there is

a slightly tighter b ound than the one of section the device driver is always

run as the rst pro cess of any jubilee The limit is that the total arrivals over

the descheduled p erio d must t within the cell buers



Two TAXI command symbols and bytes gives bytes of user payload this gives a



factor of

 60 experimental system wanda

50

40

limit1

30

Percent CPU Unused limit2

20

10

0 30 35 40 45 50 55 60 65 70 75 80 85 90

Bandwidth (Mbits/sec)

Figure CPU Usage vs Fairisle Throughput

Second since both the output fo on the switch and the transmit decoupling

queue on the Xilinx chip are very small the software is essentially limited to a

maximum of the outgoing line rate over the duration that it is running In other

words cells cannot b e pro cessed at faster than the line rate when the device driver

has the CPU so the overall throughput is directly limited by the percentage of

the CPU times the line rate

Because Wanda always gives highest priority to the interrupts from the Fairisle

hardware it sp ends a great deal of time in interrupt dispatching co de This leads

to the entire CPU being dedicated to the task of cell forwarding for throughput

values as low as Mbitsec It is able to sustain the maximum line rate

Fawn is massively sup erior b ecause although the optimum line is not quite

reached anywhere the CPU required for a particular throughput is much lower

allowing the system to be congured to leave a reasonable amount of CPU to

other pro cesses such as routeing up dates whichwould b e starved in the Wanda

system

In addition to the tradeo between CPU allo cation and raw throughput per

formance the jitter exp erienced by the trac will be directly aected by the

duration of the p erio ds when the Fairisle pro cess is descheduled For this reason

a service system may wish to allo cate more CPU to the Fairisle pro cess than is

strictly necessary for the exp ected throughput alone This is a policy decision

which is likely to vary dep ending on the other tasks required of the no de and its

p osition in the network The choice of the length of the jubilee is also directly

relevant this cost was discussed in section

Fairisle Host Interface Performance

The exp eriments describ ed in this section test b oth the eectiveness of the Rbuf

mechanism and also the appropriateness of constructing the Fairisle device driver

as a pro cess from the p oint of view of data receiving or transmitting As describ ed

in section the theoretical transfer rate is limited by the bus bandwidth rather

than the line rate so the best comparison is with the p erformance achieved

for the identical MSSARFDL Blackb Blacka proto col on Wanda using

identical hardware

Two of the simple test programs on Wanda are a data sink and a data source The

former receives IOBufs from a so cket and frees them as fast as p ossible without

referencing the data the latter allo cates an IOBuf of the default maximum size

of bytes and sends it without initialising the data Obviously on such tests

the interaction with the cache is imp ortant For the receiver there is no p erbyte

cache penalty since the Arm cache only allo cates lines on a read miss and the

data in the IOBuf is never read For the transmitter since Wanda uses a MRU

p olicy for allo cating IOBufs the same IOBuf will b e used over and over again so



the data is likely to be in the cache During these exp eriments there was only

one pro cess on the Wanda machine

For the same tests on Fawn the Fairisle device driver pro cess which is where all

the work is done was congured with a guaranteed time of ticks ab out

of each jubilee The application was congured with ticks Apart

from ticks for the console driver no other activity of the system was given



Note that the buer size is small compared to the cache and that Wanda do es not clear

buers on allo cation

guaranteed time

The test programs were congured to use a size identical to the Wanda number

of bytes Further the same Rbuf was used for every packet to rule out any

caching dierences from the Wanda co de The Rbuf control areas used had a size

of bytes which represents a maximum pip eline of packets enough to last

from one jubilee to the next

Transmit

Using the Rbuf mechanism and the conguration describ ed ab ove transmission

is p ossible at Mbitssec This compares favourably with Wanda which can

achieve Mbitssec

Receive

The Rbuf receiver can sink data in excess of Mbitssec Again this is

considerably higher than the value attained by the Wanda system it can sink

data at approximately Mbitssec

Of particular interest is the comparison b etween the Rbuf and Wanda co de when

receive overload o ccurs On Wanda the system continues to takeinterrupts from

the Fairisle hardware all IOBufs are lled up with data queued for the over

loaded connection no other network activity can take place and the machine

runs no pro cesses either other ones or the sink Eventually the Fairisle cell

buer overows a fatal system error

On the Rbuf scheme if the receiving pro cess is congured to blo ck o ccasionally

for a p erio d of a few Jubilees such that the pip eline go es dry the system remains

stable with the cells b eing eciently discarded in the device driver without b eing

copied to main memory and with minimum p enalty to the system

The FawnRbuf combination is obviously better b oth in p erformance and in

stability under load b oth imp ortant considerations for a multimedia system

Summary

This chapter has presented the exp erimental evaluation of the design prop osed

in this dissertation This was considered in three main sections

First the cost of using a microkernel with all op erations including device drivers

scheduled as normal pro cesses was considered It was shown that the interrupt

latency was comparable with a conventional system in most cases and b ounded

ab ove in the worst case by the congurable length of a Jubilee The cost of each

Jubilee was assessed and found to b e mostly attributable to the eect of the ne

grain sharing of the CPU on the cache Null RPC time was also measured and

although this exp erimental platform do es not include virtual address protection

was encouraging

Second the eects of using events for intrapro cess scheduling were investigated

For lo cal thread context switches the p erformance wasfoundtobemuch sup erior

to that of Wanda Additionally it was shown that extra b enets were available

due to the potential for application sp ecic schedulers

Third the p erformance of the Fairisle device driver b oth in the cell forwarding

case and for received and transmitted data was measured For cell forwarding

the explicit scheduling of the device driver by the system was shown to reduce

the CPU requirements For lo cal IO the device driver was shown to outp erform

Wanda and distinctively to remain stable in the presence of overload

Chapter

Further Work

The Fawn system describ ed in this dissertation is of necessity only a prototyp e

The measurements made in chapter suggest that this is a powerful way to

structure such an op erating system As further work there are many issues

involved in turning the prototyp e into a more robust and interesting system

There are also a number of further research topics which have been raised as a

result of this work

Op erating System Development

A large amount of the design presented has already b een adopted by the pro

gramming team working on the Op erating System work package of the Pegasus

pro ject Eversa It is hop ed that within that context the following can be

resolved

Event value overow

The event system as describ ed relies on monotonically increasing values in order

to op erate On a bit architecture such as the Alpha this can be easily

implemented directly in a single machine word For bit architectures suchas

the Arm used for the practical work the p ossibility of overow in a longlived

system is nontrivial This can b e solved by the usual mechanisms eg that used

for transp ort sequence number rollover

Virtual Address System

The system as presented do es not provide any form of virtual address manage

ment or protection Instead the requirements and exp ectations of such a system

have been taken into consideration where appropriate In addition to the fea

tures describ ed in section the observations of section should be taken

into consideration

Resource Recovery and Higher Level Issues

This exp erimental implementation lacks some of the supp ort for recovering re

sources allo cated to pro cesses Since this system do es not include a le system or

shell there is no dynamic demand for this activity The system do es provide for

full recovery for resources which are allo cated while establishing a communication

channel either Rbuf or IPC which subsequently fails The way to present recov

ery of resources to the application is strongly dep endent on higher level mo dels of

longevity and renegotiation pro cedures which are b eyond the scop e of this work

Section calls for a higher level pro cess to administer the share of the CPU

over a longer term interval This pro cess known as the QoS manager is a topic

for further investigation This system delib erately excludes a prototyp e for such

a pro cess since a static system is easier to analyse

One system for renegotiation is to include delib erate ination in the value of

CPU resources Larmouth In this system the allo cations of pro cesses are

regularly downgraded by small amounts Pro cesses would need to renegotiate

regularly based on observed loss of resource to maintain their allo cation This

metho d has a tendency to defragment the spare cycles in the system returning

them to the allo cating authority

Values of events are always used in pairs so they can b e compared by lo oking at the sign

following a subtraction This gives bits of signicance

Heuristic Event Hints

There are various p ossible heuristics for interpro cess scheduling within a partic

ular level which could be investigated One example would be to give the CPU

to the pro cess whichwas most recently the recipientofanevent Another would

be the pro cess which has the largest number of outstanding event notications

Evaluation of the p otential b enets and costs of such heuristics is futile unless

carried out in the context of a much more general implementation and so has

been left for further study

Proto col Supp ort

The exp erimental evaluation of the Rbuf architecture presented in chapter

was limited to considering the very basic op erations A fuller implementation

including supp ort for many heterogeneous proto cols and hardware interfaces is

bound to include much scop e for researching exact system structure for optimal

p erformance and jitter

Exp onential jubilees

When a much more complete system is available for exp erimentation it is p ossible

that one of the observations may b e that nonguaranteed pro cesses could b enet

by b eing scheduled at a coarser granularity If much of the CPU resource is

committed at the guaranteed scheduling level then to share the remaining CPU

fairly amongst pro cesses whose allo cations are solely at lower levels the QoS

manager may have to allo cate very small slices For some pro cesses such as

compilers the ne grain sharing may provide no b enet and merely consume

resource due to context switching

One p ossible alteration which could be researched is the use of exp onential ju

bilees for lower scheduling levels In suchascheme the reinitialisation of schedul

ing parameters for levels b elow the highest guaranteed level would o ccur at

exactly half of the frequency of the immediately higher level This would reduce

the costs of fair sharing amongst nonguaranteed pro cesses without destroying the

fundamental principles of the jubilee concept such as the deterministic fairness

over abounded p erio d and the ease of transfer of resources between pro cesses

Clearly such a subtle alteration could only b e evaluated in the presence of genuine

reallife loading

Sp ecialist Platforms

Apart from p orting to a more general workstation with frame buer and disk

systems there is much to be investigated with resp ect to the op eration of this

system on shared and non shared multipro cessor hardware

Shared memory multipro cessors

One of the key design aims of using events for communication b etween pro cesses

was that they indicated explicitly when synchronisation was b eing p erformed

and that on a multipro cessor this could be used for causing synchronisation at

the memory system level One of the areas that will need investigation on such

a platform is the virtual pro cessor interface between the kernel and the intra

pro cess scheduler Negotiation of the maximum amount of true parallelism within

a pro cess is also a matter for further research

Nonshared memory multipro cessors

Two particular instances of such a machine are the Fairisle Switch considered

as a whole and the Desk Area Network One of the particularly interesting p os

sibilities is when a single pro cess is running on an individual no de for example

the X server on the frame buer no de The similaritybetween the virtual pro

cessor interface of section and the op eration of a physical pro cessor suggests

that it may be p ossible to op erate such a no de without any kernel whatsoever

Clearly the intrapro cess scheduler would contain some sp ecialist co de however

the feasibility of such an implementation is much higher than for systems such

as Wanda or Unix In those systems integrating applications into the kernel has



been rarely attempted and when p erformed requires substantial changes to the

application co de and or adaptation layers



Such as MSRPC in Wanda and NFS in Unix

Microthreaded pro cessors

The anaconda pro cessor Mo ore is a microthreaded dataowcontrolow

hybrid It uses interleaved execution of blo cks of co de known as microthreads

from dierent threads A separate unit known as a matching store p erforms

executability analysis based on availability of previous results Multiple loads

and stores may be p osted from a microthread and the next microthread of the

same thread will b e scheduled when these have completed

In this system the fundamental unit of concurrency control is a spinlo ck com

prised of two microthreads The rst issues a load from the lo ck and a store of

the busy indication to the lo ck The second checks the value of the load that to ok

place in the previous microthread and determines if it was successful in gaining

access to the spinlo ck lo oping back if it was not

In such a system atomic up date of an event count can also be p erformed in two

microthreads by protecting each event with a spinlo ck The rst microthread

loads b oth the value of the event count as well as initiating the two transfers of

the spinlo ck If the second determines that the lo ck was acquired then the new

value can be written to the event count and the lo ck can be freed otherwise a

branch back must o ccur in the same manner as for a spinlo ck

Whilst this architecture provides many esoteric metho ds of communicating once

events have b een implemented co de which uses generic concurrency control prim

itives may b e p orted easily building on the mechanisms discussed in chapter

Interpro cess Communication

The stub compiler used in the prototyp e generates poor quality stubs This in

cluded redundant exception handling co de failure to optimise rep eated tests and

separate identity function marshalling for elds of xed length structures The

interaction b etween the clients and servers and the stubs has also b een shown to

b e quite inecient One of the reasons for this is the interface denition language

used Although it is very strictly typ ed sometimes irritatingly so it provides

little expression for the semantics of op erations ie what a particular op eration

actually does with the arguments and results leading to large descriptions in

comments This is of course an issue for research in its own right It is also

very dicult to use it to interface with other interface denition languages or

RPC schemes A particular problem with MIDDL whichwas exp erienced during

the implementation of Fawn was the fact that MIDDL typ es are all designed for

use in an external context every typ e is sp ecied in a particular number of bits

As a result use of typ es whose size is dep endent on the natural sizes of various

constructs on a particular platform can b e dicult One p ossible alternativeisa

generic stub compiler such as USC OMalley which can interconvert easily

in a heterogeneous environment

Desk Area Network

Within the Desk Area Network op erating system pro ject one of the concerns has

b een to consider op erating system scheduling of the ATM interconnect resources

One p ossible wayofachieving this would b e to run Fawn on all the no des where

the fabric clo ckwas used to generate jubilees and control pro cess scheduling The

kernel scheduling algorithm could p ossibly be altered to run pro cesses guaran

teed time at particular welldened osets from the jubilee and p ossibly include

heuristics ab out when to run the DAN device driver co de

A further consideration is related to the use of the stream cache see section

where a sp ecial area of the virtual address range is used for direct access

to a multimedia stream which is routed to the pro cessor cache rather than to

main memory This concept is taken further and generalised in Pratts cache

system Pratta where cache allo cation may be pro cess sp ecic In such an

environment the Rbuf data area may in fact represent an area of cache rather

than an area of memory The Rbuf control areas could b e used to indicate when

a pro cess is nished with a particular part of the stream in the cache where to

put newly arriving material within the cache memory and also to indicate in

a homogeneous way the arrival deduced from control information of particular

network packets within the cache for pro cessing This is a particularly interesting

further research topic

Other Observations

Whilst not related to the sub ject under consideration a number of language

issues have been observed during the exp erimental phase of this work which are

recorded here for the b enet of future researchers

As well as the problem noted in section where availability of garbage collec

tion could greatly increase the p erformance of many IPC calls some surprising



insights were gained from examination of co de generated by the C compiler

Co de generation was close to optimal for this simple RISC pro cessor ie it was

dicult in general for a human to nd optimisations which could be made to

the resulting co de except in one resp ect The exception was pointer aliasing

where the semantics of C mean that the compiler cannot tell when an up date

at the target of one pointer variable may alter the value stored at another In

some p ointer intense data structures the additional knowledge that the human

has ab out whichpointers can p ossibly b e aliases for others could make a surpris

ing dierence As an example of this the kernel level scheduler was considered in

detail and instruction costs were calculated which determine that knowledge of

pointer aliasing could reduce pro cessor time by approximately

Together these indicate that the semantics of the C language may now be the

principal barrier to b etter co de rather than compiler technology It is interesting

to note that it is these same semantics which make formal verication of C so

dicult

The use of closures to pass necessary state into pro cedures entails some cost over

a static linking metho d due to the extra arguments but it was observed that in

many cases this cost is comp ensated for by lower cost access to that state This

is b ecause on RISC pro cessors it is more ecient to generate a memory access to

an address which is a small delta from an address in a register than to access a

full address which is bound into the co de at link time and necessitates multiple

instructions to build up as it is to o large for a single immediate value Of course

for stateless functions static linking even when it requires stubs is still sup erior

Summary

This chapter has identied some further p otential work in this eld Some of the

issues identied are to be addressed within the Pegasus pro ject Of particular

interest is examining this system for use as the op erating system of the Desk Area

Network



gcc version

Chapter

Conclusion

This dissertation has considered various problems asso ciated with the scheduling

and network IO organisation found in conventional op erating systems for eec

tive supp ort for multimedia applications which require Qualityof Service within

a general purp ose op erating system A solution for these problems has b een

prop osed in a microkernel structure The prop osed design is known as Fawn

Chapter describ es the background to this work Asynchronous Transfer Mo de

technology was discussed in particular the Fairisle switch with which the au

thor is particularly acquainted The particular features of ATM switching which

are relevant to this work are its high bandwidth which is necessary for mul

timedia data the provision of ne grain multiplexing to reduce jitter even in

the presence of bursty sources and that multiplexing is p erformed at the lowest

p ossible level Previous op erating system and scheduling research is presented

showing the increasing adoption of splitlevel scheduling and supp ort for Quality

of Service Multimedia systems were typically used in a distributed environment

and were considered to be soft realtime Particular attention was drawn to the

Nemo system which was the inspiration for much of the work presented in this

dissertation

Chapter b egan with a consideration of the use of priority for scheduling By

considering manytypical examples from the literature and the authors exp erience

it was shown that the use of prioritywas entirely inappropriate for a multimedia

system In particular the Rate Monotonic and Earliest Deadline First algorithms

were shown to have high costs to havedelusive assumptions for a general purp ose

system and to be inappropriate for a distributed system

The Fawn design presents a system where pro cesses are guaranteed a share of

the CPU over a system wide p erio d called a Jubilee Access to the CPU at a

ner granularity is probabilistic rather than certain This is exactly the sort of

guarantee required by soft realtime applications This jubilee mechanism makes

scheduling decisions extremely simple and facilitates contracts between clients

and servers

The virtual pro cessor interface of the Fawn system is describ ed concentrating on

those areas where it diers from or extends the Nemo system A mechanism for

interpro cess communication based on eventchannels is devised It is shown to b e

also highly suitable for the virtualisation of device interrupts and the delivery of

information ab out system time to the application an imp ortant concern of any

multimedia application

Intrapro cess scheduling is then considered For homogeneity with the inter

pro cess event mechanism eventcounts and sequences are chosen for intrapro cess

synchronisation primitives Implementations of common synchronisation schemes

using these primitives is shown to b e simple and ecient

Chapter describ ed some mechanisms used for same machine RPC in other sys

tems The migrating mo del was considered inappropriate b ecause it led to prob

lems in accounting for CPU poor cache lo cality and an unacceptable interface

for device drivers or other pro cesses which require nonblo cking communication

primitives The mechanisms used for samemachine RPC in Fawn based on the

switching mo del were discussed This comprises pairwise shared memory and

use of eventchannels for synchronisation The interface denition language used

was adopted from the Pegasus pro ject

The chapter also considered the naming and binding schemes used in Fawn

and additionally shared libraries exceptions b o otstrapping and other general

infrastructure issues were discussed

Chapter examined bulk stream IO Three previous schemes are considered in

detail The Rbufs scheme is prop osed for use in Fawn The principal feature is

the separation of the three issues of data transmission structure aggregation

control and memory allo cation This separation allows for great exibility in

the management of communications within the machine Rbufs are designed to

minimise copying in order to supp ort high bandwidth applications and also the

explicit scheduling of network device drivers

Various situations were considered in the context of typical workstation hardware

Rbufs were shown to b e highly eective particularly in the supp ort of application

data units and multicast b oth areas where previous work has been decient

Chapter describ ed a platform on which an instance of the Fawn design has

been constructed the Fairisle p ort controller Various features of the design

were evaluated exp erimentally Interrupt latency to the device driver thread was

shown to b e approximately the same as what would b e exp ected for the entry of

the device driver handler for a conventional Unix system In the worst case it

was b ounded by the length of the jubilee

The costs of adopting the jubilee mechanism for CPU allo cation were measured

and shown to be small and primarily due to the caching eects of the ne grain

sharing that it provides The p erformance of same machine RPC was also mea

sured

Three dierent implementations of the intrapro cess scheduler were then consid

ered It was shown that applications could increase their p erformance by cho os

ing a scheduler which met their particular requirements One of the benets of

a single address space system the sharing of co de was also considered It was

shown that placing the intrapro cess scheduler within the shared library reduced

scheduling latency

This chapter then moved on to consider the impact of the Fawn system on IO

The p erformance of the system when used as a cell forwarding engine in a Fairisle

switchwas measured The Fawn system was shown to require much less CPU to

forward at almost any rate than the dedicated Wanda device driver due to explicit

network scheduling the ab olition of unnecessary interrupt dispatching costs

found in other systems The suitability of Rbufs was considered by measuring

the IO throughput of the system in comparison with the same proto cols in the

Wanda microkernel and shown to sustain higher rates In particular the Rbuf

Fawn combination was observed to b e stable under periods of overload

Finally chapter considered many of the future areas of research in which this

work mayprove b enecial and other issues whichhave come to light as a result

At least one measure of the success of this work is the extent to which it has b een

adopted by the Pegasus pro ject All of the work of chapter has been adopted

except the use of jubilees The IPC transp ort mechanisms of chapter and the

Rbuf design in its entirety have already b een included in the current design for

the Nemesis system

The pivotal features of the Fawn design are that the pro cessing of device inter

rupts is p erformed by userspace pro cesses which are scheduled by the system like

any other that events are used for b oth inter and intra pro cess synchronisation

and Rbufs an esp ecially develop ed high p erformance IO buer managementsys

tem

It is the thesis of this dissertation that the fundamental job of an op erating

system kernel is to implement ne grain sharing of the CPU between pro cesses

and hence synchronisation between those pro cesses System p erformance b oth

with resp ect to throughput and soft realtime dep endability has been shown to

b e enhanced This is due to the emp owering of pro cesses to p erform task sp ecic

optimisations

Bibliography

Anderson TE Anderson BN Bershad ED Lazowska and HM Levy

Scheduler Activations Eective Kernel Support for the User

Level Management of Paral lelism Technical Rep ort

Department of Computer Science University of Washington

April Revised Octob er pp

ANSA Architecture Pro jects Management Ltd ANSA Reference

Manual release March pp

ARM Advanced RISC Machines ARM Macrocel l Datasheet

edition November p

ARM Advanced RISC Machines ARM Datasheet edition

p

Barham P Barham M Hayter D McAuley and I Pratt Devices

on the Deak Area Network IEEE Journal on Selected Areas

in Communication January To App ear in the

sp ecial issue on ATM LANs p

Bershad BN Bershad TE Anderson ED Lazowska and HM Levy

Lightweight Remote Procedure Cal l Technical Rep ort

Department of Computer Science University of Washing

ton April pp

Bershad BN Bershad TE Anderson ED Lazowska and HM Levy

UserLevel Interprocess Communication for Shared Memory

Multiprocessors ACM Transactions on Commputer Systems

May p

Birrell AD Birrell and JV Guttag Synchronization Primitives

for a Multiprocessor A formal specication Technical Re

p ort Digital Equipment Corp oration Systems Research

Center pp

Birrell A Birrell Taming the Windows NT TM Thread Primi

tives A presentation b efore the Systems Research Group of

the University of Cambridge Computer Lab oratory Octob er

pp

Birrell A Birrell G Nelson S Owicki and E Wobb er Network

ObjectsTechnical Rep ort Digital Equipment Corp oration

Systems Research Center February p

Blacka R Black FDL Cel l formats and MetaSignal ling In ATM

Do cument Collection The Blue Bo ok chapter University

of Cambridge Computer Lab oratory March p

Blackb R Black Segmentation and Reassembly In ATM Do cument

Collection The Blue Bo ok chapter University of Cam

bridge Computer Lab oratory March p

Blackc R Black and S Crosby Experience and Results from the Im

plementation of an ATM Socket Family In USENIX Winter

Conference pages January pp

Blackd RJ Black IM Leslie and DR McAuley Experiences of

building an ATM switch for the Local Area In Computer

Communication Review volume pages ACM SIG

COMM September pp

Bosch P Bosch A Cache Odyssey Masters thesis University of

Twente Faculty of Computer Science June Also avail

able as Pegasus rep ort number p

Brackmo L Brackmo S OMalley and L Peterson TCP Vegas New

Techniques for Congestion Detection and Avoidance In Com

puter Communication Review volume pages ACM

SIGCOMM September p

Burrows M Burrows Ecient Data Sharing Technical Rep ort

University of Cambridge Computer Lab oratory December

pages PhD Dissertation p

Burrows M Burrows The software of the OTTO ATM network inter

face Personal Communication Octob er p

Cardelli L Cardelli J Donahue L Glassman M Jordan B Kalsow

and G Nelson Modula Report revised Technical Re

p ort Digital Equipment Corp oration Systems Research

Center November p

Carter J Carter and W Zwaenep o el Optimistic bulk data transfer

protocols Sigmetrics and Performance May

p

Chen J Chen Memory Behaviour of an X Window System In

USENIX Winter Conference pages January

p

Coulson G Coulson G Blair P Robin and D Shepherd Extending

the Chorus Microkernel to support Continuous Media Appli

cations In Pro ceedings of the th International Workshop on

Network and Op erating Systems Supp ort for Digital Audio

and Video pages November p

Crosby S CrosbyRHayton and T Rosco e MSRPC User Manual

In ATM Do cument Collection The Blue Bo ok chapter

University of Cambridge Computer Lab oratory March

p

DEC Digital Equipment Corp oration Workstation Systems Engi

neering MAXine System Module Functional Specication Re

vision February p

DEC Digital Equipment Corp oration Workstation Systems Engi

neering Flamingo Macrocoders Manual November

p

DEC Digital Equipment Corp oration TURBOchannel Industry

Group TURBOchannel Specications Version

p

Dixon MJ Dixon System Support for MultiService Trac Tech

nical Rep ort University of Cambridge Computer Lab ora

tory September PhD Dissertation pp

Druschel P Druschel and L Peterson Fbufs A HighBandwidth Cross

Domain Transfer Facility In Pro ceedings of the fourteenth

ACM Symp osium on Op erating Systems Principles pages

December p

Druschel P Druschel L Peterson and B Davie Experiences with a

HighSpeed Network Adaptor A SoftwarePerspective In Com

puter Communication Review volume pages ACM

SIGCOMM September pp

Edwards A Edwards G Watson J Lumley D Banks C Calamvokis

and C Dalton Userspace protocols deliver high performance

to applications on a lowcost Gbs LAN In Computer Commu

nication Review volume pages ACM SIGCOMM

September p

Eversa D Evers Nemesis Structure and Interfaces Pegasus Pro ject

Internal Memorandum August p

Eversb DM Evers Distributed Computing with Objects Technical

Rep ort University of Cambridge Computer Lab oratory

March PhD Dissertation pp

Fairbairns R Fairbairns Experience of implementing POSIX threads on

Wanda Personal communication March pp

Ford B Ford and J Lepreau Evolving Mach to a Migrating

Thread Model In USENIX Winter Conference pages

January p

Forum The ATM Forum ATM usernetwork interface specication

Version Prentice Hall p

Fraser A Fraser C Kalmanek A Kaplan E Marshall and R Re

strick Xunet A Nationwide Testbed in HighSpeed Network

ing In IEEE Info comm pages May p

Fraser S Fraser Early Experiments with Asynchronous Time Di

vision Networks IEEE Network January

p

Greaves D Greaves The Olivetti Research Yes V option module

In ATM Do cument Collection The Blue Bo ok chapter

University of Cambridge Computer Lab oratory March

p

Hamilton G Hamilton and P Kougiouris The Spring nucleus a mi

crokernel for objects Technical Rep ort SMLI TR Sun

Microsystems Lab oratories April Also in USENIX

pp

Hayter M Hayter and D McAuley The Desk Area Network ACM

Op erating Systems Review May Also avail

able as University of Cambridge Computer Lab oratory Tech

nical Rep ort number p

Hayter M Hayter A Workstation Architecture to Support Multime

dia Technical Rep ort UniversityofCambridge Computer

Lab oratory Septemb er PhD Dissertation pp

Haytera M Hayter and R Black Fairisle Port Control ler Design and

Ideas In ATM Do cument Collection The Blue Bo ok chap

ter UniversityofCambridge Computer Lab oratory March

pp

Hayterb M Hayter and R Black FPC Xilinx Xi Design and Notes

In ATM Do cument Collection The Blue Bo ok chapter

University of Cambridge Computer Lab oratory March

pp

Hopp er A Hopp er Local Area Computer Communication Networks

Technical Rep ort University of Cambridge Computer Lab

oratory PhD Dissertation pp

Hopp er A Hopp er Pandora an experimental system for multimedia

applications ACM Op erating Systems Review

April p

Hutchinson N Hutchinson and L Peterson The xKernel AnArchitecture

for Implementing Network Protocols IEEE Transactions on

Software Engineering January p

Hyden E Hyden Operating System Support for Quality of Service

Technical Rep ort University of Cambridge Computer Lab

oratoryFebruary PhD Dissertation pp

ISO International Organisation for Standardization Programming

languages C Draft international standard ISOIEC

DIS UDC p

Jacobson V Jacobson The Synchronisation of Periodic Routing Mes

sages In Computer Communication Review ACM SIG

COMM September p

Jardetzky PW Jardetzky Network File Server Design for Continuous

Media Technical Rep ort University of Cambridge Com

puter Lab oratory Octob er PhD Dissertation pp

Johnson M Johnson Exception hand ling in domain based systems

Technical Rep ort UniversityofCambridge Computer Lab

oratory September PhD Dissertation p

Jones A Jones and A Hopp er Hand ling Audio and Video Streams

in a Distributed Environment Pro ceedings of the fourteenth

ACM Symp osium on Op erating Systems Principles pages

December Also available as Olivetti Research Ltd

Technical Rep ort number p

Larmouth J Larmouth Scheduling for a Share of the Machine Soft

ware Practice and Exp erience January Also

available as University of Cambridge Computer Lab oratory

Technical Rep ort number Octob er p

Leer S Leer and M Karels Trailer Encapsulations Internet

Request for CommentNumber April p

Leer S Leer M McKusick M Karels and J Quarterman The

Design and Implementation of the BSD UNIX Operating

System AddisonWesley pp

Leslie I Leslie Extending the Local Area Network Technical Re

p ort University of Cambridge Computer Lab oratory

PhD Dissertation p

Leslie I Leslie and D McAuley Fairisle An ATM Network for

the Local Area In Computer Communication Review vol

ume pages ACM SIGCOMM Septemb er

pp

Liu CL Liu and J Layland Scheduling Algorithms for Multipro

gramming in a hard RealTime Environment Journal of the

Asso ciation for Computing Machinery January

p

McAuley DR McAuley Protocol Design for High Speed NetworksTech

nical Rep ort University of Cambridge Computer Lab ora

tory January PhD Dissertation pp

McAuley D McAuley The design of the ARX operating system Per

sonal communication September p

McCanne S McCanne and V Jacobson The BSD Packet Filter ANew

Architecture for Userlevel Packet Capture In USENIX Winter

Conference pages January p

Mills D Mills Internet Time Synchronisation The Network Time

Protocol Internet Request for Comment Number Octo

ber p

Mogul J Mogul Livelock in Ultrix Internal Technical Rep ort

Digital Equipment Corp oration Western Research Lab oratory

p

Mogul J Mogul Ecient Use of Workstations for Passive Monitor

ing of Local Area Networks In Computer Communication Re

view volume ACM SIGCOMM Septemb er p

Mo ore S Mo ore Multithreaded Processor Design PhD thesis Uni

versity of Cambridge Computer Lab oratory Octob er

p

Mullender S Mullender I Leslie and D McAuley Pegasus Project

Description Technical Rep ort Pegasus Esprit Pro ject

September Also available as University of Cambridge

Computer Lab oratory Technical Rep ort number pp

Mullender S Mullender I Leslie and D McAuley OperatingSystem

Support for Distributed Multimedia In USENIX Summer

Conference July p

Nakamura A Nakamura An investigation of realtime synchronisation

PhD thesis University of Cambridge Computer Lab oratory

p

Needham R Needham and A Herb ert The Cambridge distributed com

puting system International computer science series Addison

Wesley London p

Newman P Newman Fast Packet Switching for Integrated Services

Technical Rep ort University of Cambridge Computer Lab

oratory PhD Dissertation p

Nicolaou C Nicolaou A Distributed Architecture for Multimedia Com

munication SystemsTechnical Rep ort University of Cam

bridge Computer Lab oratory December PhD Disser

tation p

Nieh J Nieh J Hanko J Northcutt and G Wall SVR UNIX

Scheduler Unacceptable for Multimedia Applications In Pro

ceedings of the th International Workshop on Network and

Op erating Systems Supp ort for Digital Audio and Video pages

November p

Oikawa S Oikawa and H Tokuda UserLevel RealTime Threads An

Approach Towards High Performance Multimedia Threads In

Pro ceedings of the th International Workshop on Network

and Op erating Systems Supp ort for Digital Audio and Video

pages November p

OMalley S OMalley M Abb ott N Hutchinson and L Peterson A

Transparent Blast FacilityInternetworking Research and Ex

p erience December p

OMalley S OMalley T Pro ebstring and A Montz USC A Uni

versal Stub Compiler In Computer Communication Review

volume pages ACM SIGCOMM Septemb er

p

Owicki S Owicki Experience with the Firey Multiprocessor Work

station Technical Rep ort Digital Equipment Corp oration

Systems Research Center September p

Pike R Pike D Presotto K Thomson H Trickey T Du and

G Holzmann Plan The Early Papers Computing Science

Technical Rep ort ATT Lab oratories July p

Pratta I Pratt Hardware Support for Operating System Support for

Continuous Media University of Cambridge Computer Lab o

ratory PhD Research Prop osal July pp

Prattb I Pratt Internet Connectivity on the Desk Area Network

Personal communication April p

Prattc I Pratt and P Barham The ATM Camera V AVA In

ATM Do cument Collection The Blue Bo ok chapter

University of Cambridge Computer Lab oratory March

p

Reed D Reed and R Kano dia Synchronization with eventcounts

and sequencers Technical Rep ort MIT Lab oratory for Com

puter Science pp

Ro deheer T Ro deheer The hardware of the OTTO ATM network in

terface Personal Communication January p

Rosco ea T Rosco e Linkage in the Nemesis Single Address SpaceOper

ating System ACM Op erating Systems Review

Octob er p

Rosco eb T Rosco e The MIDDL Manual rd Edition available by

anonymous ftp from ftpclcamacuk in p egasusMiddlpsgz

January now sup erseded by the th edition p

Saltzer JH Saltzer Naming and Binding of Objects chapter A

pages Number in Lecture Notes in Computer Sci

ence SpringerVerlag pp

Schro eder M Schro eder and M Burrows Performance of Firey RPC

Technical Rep ort Digital Equipment Corp oration Systems

Research Center April p

Sha L Sha R Ra jkumar and JP Leho czky Priority Inheritance

ProtocolsTechnical Rep ort CMUCS Carnegie Mellon

Computer Science Department December p

Shand M Shand Measuring System Performance with Repro

grammable Hardware Technical Rep ort Digital Equipment

Corp oration Paris Research Lab oratory August p

Sreenan CJ Sreenan Synchronisation services for digital continuous

media Technical Rep ort University of Cambridge Com

puter Lab oratory March PhD Dissertation pp

Stroustrup B Stroustrup The C programming language Addison

Wesley second edition p

Sun Re Suns Sparc technology business discloses nextgeneration

processor USENET News article in comparch MessageID

ppfbfengnewsEng Sun COM from M Tremblay

tremblayayoutEngSunCOM Septemb er p

Tanenbaum A Tanenbaum and S Mullender An Overview of the Amoeba

Distributed Operating System ACM Op erating Systems Re

view July p

Temple S Temple The Design of a Ring Communication Network

Technical Rep ort UniversityofCambridge Computer Lab

oratory January PhD Dissertation pp

Thacker C Thacker L Stewart and E Satterthwaite Firey A Mul

tiprocessor Workstation Technical Rep ort Digital Equip

ment Corp oration Systems Research Center December

p

Trickey H Trickey Internals of Plan Naming Personal Communi

cation September p

Yuhara M Yuhara C Maeda B Bershad and J Moss The MACH

Packet Filter Ecient Packet Demultiplexing for Multiple

Endpoints and Large Messages In USENIX Winter Con

ference pages January p