A Persistent System in Real Use - Experiences of the First 13 Years -

Jochen Liedtke

German National Research Center for Computer Science GMD SET 53757 SaIikt Augustin Germany jochen.liedt [email protected]

Abstract more memory and a hard disk were added. In the following years the system was ported to many differ- Eumel and its advanced successor L3 are operating ent niachines based on Zilog 280 and 28000, Motorola systems built by GMD which have been used, for 13 GSOOO and Intel SO86 processors. years and 4 years respectively, as production systems Unfortunately 8-bit and the early 16-bit processors in business and education. More than 2000 Euinel sys- did not provide hardware to support a state of the art tems and 500 L.9 systems have been shipped since 1979 , in particular they lacked a memory and 1988. Both systems rely heavily on the paradigm management unit. We, therefore, designed a virtual of persistence (including fault-surviving persistence). machine with a powerful instruction set and virtual Both data and processes, in principle all objects are 32 bit addresses. Although the instructions and the persistent, files are implemented by means of persis- XlMU had to be implemented in software, the sys- tent objects (not vice versa) etc. tem's overall efficiency was high enough that commer- In addition to the principles and mec1ianism.s of Eu- cial sites had up to 5 terminals per system. nael/LS, general and specific experiences are described: Due to its non-standard architecture Eumel was these relate to the design, implementation and main- initially a one language system based on ELAN tenance of the syslenzs over the last 13 years. For [Honi 791. Later, some other compilers became avail- general purpose timesharing systems the idea is pow- able (CDL, Pascal, Basic, Dynamo) but these were not erful and elegant, it can be eficiently implemented, but widely used. making a system really usable is hard work. In due course processors came up with the neces- sary hardware support for data security and virtual addressing; the requirements of the virtual machine 1 Historical background: Eumel and could then be met directly by hardware. In 1987 L3 we started the L3 development, principally aimed at achieving higher performance and obtaining an open When the first 8-bit microcomputers became avail- system. The L3 architecture is an advancement and a able in Germany, GMD started an effort to introduce generalization of the Eumel principles, but built com- them, aa small workstations, into schools and univer- pletely from scratch. Since L3 is upwards compatible sities. Since there was no really usable operating sys- with Eumel, it inherited all of the existing tools and tem for these machines (only CP/M and similar con- applictions. trol programs), GMD and the University of Bielefeld Both L3 and its predecessor Eumel are pure p- decided to develop a new system from scratch. kernel based systems relying heavily on the idea of Design and implementation of the Euniel (pro- persistent processes. They are strongly influenced by nounced 'oimel') operating system started in 1979. hlultics and have similarities to later systems, such as The initial hardware base was a computer with a Zilog Accent and . Z80 processor, 64 I

2 0-8186-5270-5193$3.00 0 1993 IEEE ming and text processing. In the following years com- mechanisms used to achieve this are virtual address mercial systems to support lawyers, and other spe- spaces and m,apping. Benefits of this decision are that: cialised applications for small and medium sized com- panies, were built on top Eumel. By the mid 80’s more 0 Only one general mechanism has to be designed, than 2000 systems had been installed. implemented, verified and tuned for files, pro- Delivery of L3 began in 1989 and, to date, about grams and databases. 500 L3 systems have been shipped. Most of these are used for commercial applications, others as Eumel suc- 0 New persistent data types can be built efficiently cessors in schools. L3 is now in daily use in a variety using this mechanism. of industrial and commercial contents. 0 A single, general, recovery mechanism can be con- st,ructed. 2 work Related 0 Neither syntactical nor run time overhead is in- curred when accessing persistent data. The Eumel/L3 model of virtual memory was strongly influenced by the Multics [Ben 721 ideas. Lo- Since data is persistent, the active instances oper- rie [Lor 771 introduced shadow paging for checkpoint- ating on the data should be persistent too; so we de- ing. A slightly more general variant of this method is cided that processes would also be persistent. In fact used in Eumel/L3 for both copying and checkpointing. this only requires that stack pointers, program coun- In 1979, most related work was yet to start (or not ters and other internal registers be regarded as data. yet widely known). Accent [Ras 811 and its successor Consequences of this decision are that: Mach [Acc 861 used copy on write techniques. These and various other systems (e.g. Amoeba [Mu1 841, 0 Program and system checkpoints are easy to im- Chorus [Gui 821, [Che 841 and its predecessor Toth plement. [Che 791) are based on the message passing paradigm, but not on the persistence paradigm. 0 All running programs automatically become dae- The programming languages Elle [Alb 801 and PS- mons. Algol [Atk 821 already handled persistent and tran- sient data uniformly. To date, partial data persis- 0 For more complex functions (e.g. protection and tence (without dealing with faults) is part of the the synchronisation) active algorithms can be used in Comandos [Cah 931 project. Data and process per- place of conventional passive data structures. sistence including faults is also supported in Monads Thus tasks - consisting of an address space, one or [Ros 871 and by the KeyKOS [Har 851 nano-kernel, more threads, and a set of data objects mapped into first released in 1983. the address space - are principally persistent. The Eumel/LS concepts and experiences have also influenced the BirliX [Hir 921 operating system design Unfortunately not all tasks can be made persistent. at GMD. When beginning the L3 design we decided that all device drivers had to be user level tasks. This idea is natural, since there is nothing exceptional about a 3 Principles device driver, and it proved to be flexible and powerful. But, for hardware related reasons, most device drivers 3.1 Everything is persistent must not be swapped out, or even blocked for long enough to write back their status to the persistent We did not find any concept(ua1reason for anything store. So tasks can be declared to be resident, which not to be persistent. Obviously all things should live also means non persistent. as long as they are needed; and equally obviously turn- ing off the power should neither be connected with the 3.2 Processes are first class objects lifetime of a file nor with the lifetime of a local vari- able on a program stack. There are files that are only Most operating systems introduce passive data as needed for a few seconds, and programs that run for first class objects: they give unique identifiers or even weeks. path names t>othem, permit low level access control Some data (e.g. files) have to be persistent, so we etc. However, when processes are also persistent, this decided that all data should be persistent. The basic turns out to be upside down. An active process is more

3 general, flexible and powerful than a passive data ob- of inter-process communication. The integrity of mes ject (although the models are Turing-equivalent), es- sa.ge transfer, the uniqueness of task and thread iden- pecially when implementing objects with inherent ac- tifiers and one distinguished identifier are sufficient to tivity: e.g. traffic lights, floppy disk drives (turning solve these problems. The last item can be used as a off the motor if no request for 3 seconds) or a terminal root for higher level naming schemes. handler which cleans the screen and requests new au- Since all data objects are completely controlled by thorization (like xautoolog), when no keystroke occurs tasks, local identifiers are sufficient, unique only per for some minutes (or the terminal camera no longer task and not in time; this simplifies the kernel. Fur- “sees” the user). A simple example for this type of thermore higher levels gain flexibility, because they object is a clock which shows the actual time on the are less restricted by kernel concepts. Some naming screen and can be switched to different local times and binding mechanisms that have been implemented concurrently: on top of the kernel are described in section 5.

PROC configurable screen clock : delay := 0 ; 4 Details do do This section gives a short overview of some details show (time + delay) ; of the support for persistence in Eumel and L3. More timeout := 60s - time niod 60s ; details are described in [EUM 79, EUM 79a, Bey 89, receive (client, msg, timeout) Lie 91, Lie 921. until msg received od ; delay := msg.time shift 4.1 Tasks, threads and communication od END PROC The L3-kernel is an abstract machine implementing the data type task. A task consists of This example also shows that Atkinson’s orig- inal scale of persistence [Atk 831 (transient, lo- at least one thread cal, own/global, . . . outliving the program) must be A thread (like a Mach thread) is a running pro- widened by a second dimension, the lifetime of the 8ra.m; up to 16384 are allowed per machine. All process. Since the process itself is persistent, the vari- threads, except resident (unpaged) driver or ker- able ‘delay’ lives forever, even though it is local to the nel threads, are persistent objects. procedure. up to 16383 dataspaces Tasks maintain the data objects and exclusively A dataspace is a virtual memory object of size up control access to them, and, in consequence, there are to 1 GB. Dataspaces are also persistent objects no data objects outside a task. (If a data object is and are subject to demand paging. Copying and not owned by at least one task, it no longer exists.) sending is done lazily. Physical copy operations Therefore, directories, file servers and databases are are delayed (copy on write) as in Accent, Mach most naturally implemented in Eumel and L3 as per- and BirliX. sistent tasks containing persistent data and persistent threads. one address space Tasks and threads are first class objects: they Dataspaces are mapped dynamically into the ad- dress space for reading or modification. For hard- have unique identifiers, which are even unique ware driver tasks, the address space is logically over time (to date, 64 bibs a.re used per task or extended by the IO ports assigned to the task. thread id), (Recall that all device drivers are located outside the kernel and run at user level.) are active, autonomous and (in most cases) per- sistent, As in hlach, paging is done by the default or exter- nal pager tasks, hence all interactions between tasks, communicate via secure channels. and with the outer world, are based on inter-process communication (ipc). Thus global naming and binding problems at the The Eumel/LS ipc model is quite straightforward. p-kernel level are reduced, applying only to the field Active components, i.e. threads, communicate via

4 messages which consist of strings and/or dataspaces. level, but maintenance of large distributed acls be- Each message is sent directly from the sending to the comes hard when access rights change rapidly. So receiving thread. There are neither communication Kowalski and Hartig [Kow 901 have proposed that ob- channels nor links, only global thread and task identi- ject (passive entity) protection be complemented by fiers (ids).l L3 ipc therefore involves neither concepts subject (active entity) restrictions. In this approach of opening and closing nor additional kernel objects the kernel is able to restrict the outgoing messages of such as links or ports. This simple model differs in a task (the subject) by means of a list of permitted two important respects from most other message ori- receivers. ented systems: The clan concept [Lie 921, which is unique to L3, is an algorithmic generalization of this idea: absence of explicit binding (opening a channel) One would expect additional costs, because the kernel must check the communication’s validity each time a message is sent, but, on the other hand, there is no need for bookkeeping of open channels. We consider that our approach is more elegant, and, in fact, communication in L3 is fast (see below). , 0 no message buffering Due to the absence of channel objects we have synchronous ipc only; sender and receiver must A clan (denoted as an oval) is a set of tasks (de- have a rendezvous. But practice has shown noted as a circle) headed by a chief task. Inside the that higher level communication objects, such as clan all messages are transferred freely and the kernel pipes, ports and queues, can be implemented fles- guarantees message integrity. But whenever a message ibly and efficiently on top of the kernel by means tries to cross a clan’s borderline, regardless of whether of threads. it is outgoing or incoming, it is redirected to the clan’s chief. This chief may inspect the message (including As described in [Lie 931, L3’s ipc implementation per- the sender and receiver ids as well as the contents) and forms well: short message transfer is 22 times fast>er decide whether or not it should be passed to the des than Mach and 5 times faster than QNX [Hi1 921. This tination to which it was addressed. As demonstrated allows to integrate even hardware interrupts as ipc. in the figure above, these rules apply to nested clans Local ipc is handled by the kernel, remote ipc by as well. Obviously subject restrictions and local refer- user level tasks, more precisely by chiefs (see sec- ence monitors can be implemented outside the kernel tion 4.2). Tasks and threads have unique identifiers, by means of clans. Since chiefs are tasks at user level, unique even in time, so a server usually concludes from the clan concept allows more sophisticated and user the id of the message sender whether the requested ac- definable checks as well as active control. Typical clan tion is permitted for this client or not. The integrity structures are of messages (no modification, no loss, no sender id fak- ing), in conjunction with the autonomy of tasks, is the Clan per machine: In this simple model there is only one clan per machine covering all tasks. Lo- basis for higher level protection. cal conimunication is handled directly by the ker- nel without incorporating a chief, whereas cross 4.2 Clans & Chiefs machine communication involves the chief of the sending and the receiving machine. Hence, the Within both the L3 system and other systems based clan concept is used for implementing remote ipc on direct message transfer, e.g. BirliX, protection is by user level tasks. essentially a matter of message control. Using ac- cess control lists (acl) this can be done at the server Clan per system version: Sometimes chiefs are used for adapting different versions. The servers How does one acquire the id of a new partner for commu- of the old or new versions are encapsulated by a nication? This is achieved using one or more name servers. Usually the creating task implants the id of at least one name clan so tha.t its chief can translate the messages. server into a newly created task. Using these or the root id the new task can coinmunicate with some name server to get Clan per user: Surrounding the tasks of each user further task/thread ids. or user group by a clan is a typical method when

6

- . building security systems. Then the chiefs are complete system (of one node). It includes a consistent used to control and enforce the requested security copy of all tasks, threads, address spaces and dataspa policy. ces (i.e. all processes, files, databases, terminal screens etc.) taken at the same point of time. Due to copy on Clan per task: In the extreme case there are single write techniques, this atomic action is short enough tasks each controlled by a specific chief. For ex- not to disturb the users. Writing the dirty pages to ample these one-task-clans are used for debugging the backing store is done afterwards, concurrently to and supervising suspicious programs. normal user and system activity. To implement the fixpoint, all dataspaces of all In the case of intra-clan communication (no chief tasks and all thread and task control blocks are in- involved), the additional costs of the clan concept are tegrated into the ‘dataspace of all dataspaces’ (dsds). negligible (below 1% of minimal ipc time). Inter-clan So lazy copying can be applied to the system as a communication however multiplies the ipc operations whole. Using a slightly modified ELAN notation, the by the number of chiefs involved. This can be tol- fixpoint algorithm is: erated, since (i) L3 ipc is very fast (see above) and (ii) crossing clan boundaries occurs seldom enough in practice. Note that many security policies can be begin atomic implemented simply by checking the client id in the new fixpoint dsds := actual dsds server and do not need clans. end atomic ; for i from 0 upto max frame number do 4.3 Dataspaces and pagers if frame[i].dirty A frame[i].belongs to fixpoint then write to backing store (frame[i]) Data objects can be mapped into address spaces fi and then arbitrarily addressed, therefore they are od ; called dataspaces. They are managed by either the de- begin atomic fault or their individually associated external pagers. delete old fixpoint The default pager implementing persistent data ob- make new fixpoint valid on backing store jects was one of the key components of Eumel and end atomic. L3 from the very beginning (1979), whereas Ivlach’s external pager concept was integrated later as a gen- The first atomic action, copying the dataspace of eralization. External pagers are user level tasks. They all da,taspaces, requires a physical copy of the system may use devices (different from the default swapping root pointer (32 bytes) and flushing of all page frames device) for paging as well as virtual objects which are which are mapped with write permission in this mo- managed by other pagers, i.e. pagers may be stacked. ment. On a 25 MHz Intel 386-based machine with 12 Dataspaces can be created, deleted, mapped, un- MB main memory this costs between 2 and 5 millisec- mapped, copied and sent as parts of messages to other onds. The following cleanup phase runs concurrent to tasks. Copying and sending are done lazily. Since copy the normal user and system activities at a relatively on write is supported by the kernel’s memory manage- low priority. On average, one third of main memory ment and the default pager, even file copies are cheap. contains dirty frames; depending to a great degree on Lazy copying and lazy sending use techniques similar the actual workload, between 7 and 150 seconds are to shadow paging [Lor 771 but are totally symmetric. needed for complete write back. The final atomic ac- When modifying the original of a file the correspond- tion writes one sector to the backing store, this takes ing page must be physically copied (or mapped to a about 15 milliseconds. new backing store block) in the same way as when Since all active and passive objects of one machine modifying the copy. (including the terminal screens) are frozen consistently at the same time, fixpointing and recovery is transpar- 4.4 Fixed points ent for most applications. Clients and servers on the same machine are always in a consistent state. Since processes are as persistent as data ob- However, this mechanism is not sufficient for a jects, checkpoint/recover mechanisms covering both distributed system, since a “worldwide” fixpoint is are needed: the most important one in Eumel and L3 not reasonable, especially not in a heterogenous dis- is the jized point, which is also denoted by the slang tributed system. Therefore explicit (and sometimes word fizpoznt. A fixpoint is a checkpoint covering the a.pplication specific) recovery policies are necessary

6 whenever relevant partners are not included in t,he not. work, since the stacks may contain real addresses fixpoint. Since the p-kernel itself is not distributed, which necessarily come up when handling page faults. these policies must be implemented on top of it. Usu- Furthermore, changing the kernel code would invali- ally, general policies are based on the clan-mechanism. date return addresses. Fixpoints occur periodically or user initiated, but Therefore the restart algorithm inspects the kernel their latency limits their frequency. A practical lower stack bottom of each thread and decides how to restart bound is one fixpoint all 3 minutes. A common it: method for implementing more fine-granular check- points is to use an external pager which supports Waiting for ipc: The thread remains waiting. Its checkpointing each dataspace separately. By this, a state is not changed but the return addresses on remote file server can also save each modified file in- the kernel stack are updated, since the kernel code dependently of the various fixpoints in the distributed may have changed. system. But note that even such servers consist of persistent tasks and threads which are fixpointed by Ruiiiiiiig ipc: The thread state is not changed, but the default pager. its kernel stack is reset to the outmost level and A regular shutdown consists of enforcing a fixpoint its instruction pointer set to a special ipc restart and stopping all other threads. There is no relevant routine. difference to an irregular breakdown. In both cases Otherwise: The thread state is set to ‘busy’ and restarting the system is done by activating the last restarted at user level. (This is possible, since valid fixpoint. all system calls behave atomically, i.e. have no effect until completion.) 4.5 Fixpointing the kernel

Recovering a task or restarting a thread needs not only user level data (stacks, dataspaces) but also 5 Applicatioiis thread state, priority, mapping information, saved processor registers etc. All these kernel data is held in 5.1 Programs and procedures thread and task control blocks which are virtual ob- jects. They are located in a kernel dataspace being One consequence of task persistence is that there subject to paging, and the ordina.ry fixpoint mecha- is no real difference between running a program and nism checkpoints them like all other objects. Thus calling a. procedure within a. program. Conventionally, the control blocks are as persistent as user data, and running a program means: recovery is simple as far as only values and virtual (kernel and user) a.ddresses are involved. allocate space ; For illustration let us assume that a message trans- copy program code from load file into memory ; fer running between two persistent threads is int,er- relocate code ; rupted by a page fault (or t8ime slice exhaustion or call (start address) . external interrupt) and that this situation is frozen by a fixpoint. When recovering the system from this fix- The first three actions are not required in a persistent point, inspection of the thread states (part of the per- task. If the compiler takes advantage of this feature sistent thread control blocks) leads to restarting the (as Eumel’s ELAN compiler does), it simply generates message transfer. Recall that the data and the con- executable code at the appropriate virtual addresses. trol block addresses are virtual so t1~a.tchanges of real Then running a program is memory allocation are transparent to ipc. While fix- point/restart is transparent to ipc between persistent call (start address) . threads (even to timeout handling), communicat,ion with a transient thread (e.g. a resident device driver) By means of this mechanism, and by using lazy copy- is aborted on restart. ing for its table initialization, the ELAN compiler In fact, holding the control blocks as virtual objects achieves noteworthy speed, especially when translat- does not solve all problems related to thread restarts: ing small programs. For example, to translate and Although kernel stacks are part of the control blocks execute ‘put (“hello world”)’ costs 48 ms total elapsed and therefore persistent, and stack addresses are vir- time on a 25 RIHz 38G based machine, 10.6 ms on a 50 tual addresses, simply loading the stackpointer does RlHz 48G. This speed permits the translation of each

7

. job control and editor command as a separate single Here ‘access permitted’ can be an arbitrary al- program. gorithm usable to support various security policies, If programs with conflicting addresses should sub- e.g. ‘free access’, simple ‘password protection’, ‘time sequently run in the same address space, and the exe- bound access’, ‘authorized user only access’ etc. The cutable code of each program is contained in a file, a mechanism permits optional as well as mandatory ac- program is executed by: cess control.

if required region of address space already used then unmap corresponding dataspaces 6 Experiences fi; map file to required region of address space ; 6.1 Stabilities and instabilities call (start address) . We strongly recommend that system developers use their own system as a workbench (not only for testing) 5.2 Dataspaces and persistence as early as possible. Only if they take the risk of their own software instabilities and inefficiencies, will the Dataspaces can be mapped into an address space system soon become stable and fast enough for end and may contain data of all types, i.e. Atkinson’s type users. In particular, relying on the system will lead to completeness requirement [Atk 83) is fulfilled. Vari- a realistic balance between stability and optimization. ables located in mapped dataspaces are accessed by When we began to use Eumel as a workbench, we the same mechanism (by virtual address) and with were part,icularly afraid of collapsing processes. Re- the same efficiency than normal program variables. In call that the shutdown of a task makes all of its data, fact, programs are contained in dataspaces as well. i.e. all of its local files, inaccessable. Of course, we Note that persistence here does not require could have held all files in a file server task, so that open/close operations. Due to task persist,ence a da- t,liey would not be seriously affected by a collapsing taspace may remain mapped arbitrarily long. Dis- worktask. However, although the probability of shut- tributed applications use external pagers and remote down in a file server task may be lower than in a task procedure calls (RPC) for synchronization a.nd/or re- used for editing, compiling and testing new programs, mote object access. the file server itself is also a task, potentially suffering from the same threats. In fact, the very first Eumel 5.3 Files, directories and protection versions we (and only we) used, led to disasters like: - subscript overflows in directory handlers resulting By default, directories and file server are imple- in broken chains and unusable directories, mented as tasks. These use simple arrays to hold the catalogue (file names and attributes) and dataspaces - stack overflows within the stack overflow exception as files: handler resulting in absolutely dead tasks,

initialize catalogue as empty : - crashes of the virtual machine when executing in- do {forever} structions under special circumstances. wait for order ; if from authorised partner A access permitted However, most of these crashes turned out to be then if is file access order AND file exists soft, i.e. recoverable. We cured them by simply reset- then send or map dataspace ting the system to the last fixpoint. On average the elif is create order A 7 file exists previous 5 minutes were lost, but serious harm had then allocate catalogue entry; been avoided and we could correct our software. create dataspace The fixpoint mechanism made the system more ro- elif... bust, not only in its early stages but also when it fi became widely used for production. The majority else reject of faults survived by means of fixpoints, thereafter] fi were the originally intended hardware related ones, od . e.g. power failure, controller ha.ngups etc. and spuri- ous kernel faults.

a

.._ ... - ...... ~ ...... -- The most prominent use of the fixpoint fa- not find anything extraordinary. For two months cility occurred when we gave a demonstration we racked our brains performing “dry” analy- for the German Minister of Research and Tech- sis of potentially critical situations. Indeed, we nology. Shortly before the minister entered the found two errors which certainly had never oc- room, a cleaner hurried in to remove the last curred, but not the critical one. Then the in- speck of dust. Of course, her first action was to stallation was upgraded from 8 MB to 12 MB unplug our power supply cable and plug in her main memory, and the problem immediately dis- vacuum cleaner. It provided a perfect demon- appeared. stration of fault tolerance in a persistent system But: When simplifying the kernel algorithms by fixpoint. in the following year, we discovered three er- rors, each of them being very very improbable Besides such positive effects of stability and robust- but able to explain the spurious crashes. Since ness through fixpointing, we also had bad experiences. the machine tool manufacturer did not want to Unfortunately the fixpoint stabilises a corrupted sys- return to his potentially unstable hardware to tem in the same way. In practice, this only hap- allow further tests, we will never know for cer- pened in conjunction with kernel level errors. The tain whether the crashes were induced by those overwhelming majority of severe kernel errors led to errors. Intuitively I would give the odds 6 : 1 : an immediate crash or starvation and did not hurt 3 for ‘fixed L3 bug’ : ‘as yet unknown L3 bug’ : the fixpointed system, but some errors in the areas ‘hardware malfunction’. of memory management, paging and concurrent block Concluding the good and bad experiences of 13 garbage collection led to serious and permanent low years as far as stability in a persistent system is con- level defects. In most cases synchronization errors re- cerned: sulted in a multiply used block on disk: usually one instance stored a page block table in a disk block, A persistent system can be made very stable. while a different one used the same block for a text With Eumel and L3 we have reached a state file. where crashes (recoverable and unrecoverable) In the first years of Eumel these errors soon be- due to software are less frequent than those due came very rare, but three unpleasant effects became to hardware. noticeable: 0 The overwhelming majority of crashes are recov- Severe kernel errors tended to be extreme rare erable. and only to occur at one or two installations (far 0 Even when the system is as stable as possible, if away, of course), and not correlated with specific you wait long enough there will still suddenly be actions. someone suffering from a really damaged system.

0 About one third of these problems turned out to be spurious hardware malfunctions, which in 6.2 Performance most cases were hard to detect and to prove. Frequently, people believe that persistence is expen- 0 Improving the kernel algorithms and incorporat- sive, especially when taken together with fault toler- ing new hardware (memory bank switching, spe- ance. Due to our experiences with Eumel and L3, we cific CPU features etc.) was required by the users, are convinced that it is relatively cheap, if the system but often led to new periods of higher instability. is carefully and thoroughly designed, from hardware up to application level. By way of illustration, here is one example incident: Unfortunately we cannot prove that persistence is There was a machine tool manufacturing inexpensive, since there is no nonpersistent Eumel or plant using L3, which had on average (but not L3, but we are able to give two arguments that support periodically) one system breakdown every two our claim: weeks. The crashes often occurred early in the morning, when nobody was working and there Apart from fixpointing, there is no feature that were no special system activities. Analyzing the is required purely in order to provide persistence. crashes did not give usable hints to us. At that Virtual address spaces and paging are common time about 200 other L3 systems were in use, features of timesharing systems. and none of them suffered from kernel instabil- ity. We inspected the special application soft- There are commercial 16-usersystems running ware running on the crit,ical system, but we did L3 on a single 486 monoprocessor.

9 Fixpointing, however, is not for free. It costs a certain amount of backing store, additional swap outs proc set password (text old, new) : for storing the fixpoint and some free main memory for if old = actual password copy on write actions taking place immediately after then actual password := new starting the fixpoint and before its frames are written fi back to disk. end proc set password ; Experience has shown that for most applications these additional costs become negligible when 30% more main memory - reducing normal paging cir- boo1 proc password ok (text pw) : cumvents the disk traffic bottleneck during fixpoint (pw = actual password) - and a disk reserve of 3 times main memory size are end proc password ok : available. Although measurements of a few special and singu- Although this type of application is produced as lar mechanisms are not adequate when seriously dis- a matter of course, students start to think explicitly cussing a system's performance, some are presented about persistence when they first write programs for here simply to give the reader a rough feeling. A 50 long running computation. For a beginner, the persis- MHz 486 noname PC with 16 MB main memory was tence of a 'really executing' activity seems to be less used for these measurements: natural than that of a 'waiting' activity. Things are different when experienced software en- null cross address space RPC 10 ps gineers first use Eumel/L3. They know about long (user c) user) running jobs from their experience of mainframes, but copying 300 files to another task 2.8 "file they tend to hold as much data as possible in files. (including directory updates) They soon start to map files/dataspaces holding ar- create dataspace; map it; 2.0 ms bitrary data structures, but only construct applica- write one byte; delete dataspace tions, based on a multitude of persistent tasks, after fixpoint covering 1000 dirty page 6.8 ms/page they have gained considerable experience. Often they frames (no concurrent activity) structure their system on tasks to separate responsi- translate and execute 10.6 ms bilties. For example, the lawyer supporting system 'put ("hello world")' has about 10 tasks per user. Sometimes this leads to one task per developer in the same way that three 6.3 How programmers adopt the ideas compiler writers working together tend to construct a three pass compiler. We have used Eumel to teach word processing to An unintended way using persistence is practiced university secretaries who had no previous comput- by some experienced users: having deleted the wron ing experience whatsoever. were completely unexperi- file (despite the system asked back), pressing (resetl enced concerning computers. We soon learned that we resets the system to the last fixpoint and the deleted must not talk about persistence: they took persistence file reappears. Usually, colleagues who are using the for granted. It was a natural concept for them (paper same machine concurrently prohibit them from doing is persistent), but our complicated explanations of per- this. sistence irritated them. When giving the next course to secretaries we therefore reduced our explanation of persistence to: "For obscure technical reasons an un- expected power failure will destroy the work you have 7 Conclusions done in the preceding few minutes." Pupils or students learning programming using Persistence as a basic and universal principle cover- ELAN over Eumel/L3, at first look at files as some- ing data and processes has proved to be a good founda- thing which can be edited. They regard persistence as tion for Eumel and L3. Our experiences of the last 13 a natural but secondary effect. Later on, they write years have shown that programmers like persistence, modules which are naturally based on the persistence and both system level and .application programming of variables, e.g.: become easier. We have learned that realizing persis- tence at the p-kernel level permits the construction

test var actual password := "" ; {static variable] of an efficient and flexible system with a reasonable amount of work for design and implementation. How-

10 ever, a system relying heavily on persistence needs sig- [Gui 821 M. Guillemont. The Chorus Distributed Operating nificant effort to make it stable enough for real use. System: Design and Implementation. Proceedings ACM International Symposium on Local Com- puter Networks, Firenze, April 1982. Acknowledgements [Hiir 921 H. Hiirtig, W.E. Kiihnhauser, W. Reck. Operating The development of Eumel and L3 would have been Systems on Top of Persistent Object Systems -The BirliX Approach -. Proceedings 25th Hawaii In- impassible without the contributions and program- ternational Conference on Systems Sciences, IEEE ming efforts of Uwe Beyer, Dietmar Heinrichs and Press 1992, Vol 1, pp. 790-799. Peter Heyderhoff. I would also like to thank Peter [Har 851 N. Hardy. The Keykos Architecture. Operating Dickman for proofreading this paper and helpful com- Syst.ems Review, September 1985. ments. This paper was written using UTEX on L3. [Hi1 921 D. Hildebrand. An Architectural Overview of QNX. Proceeedings Micro-kemel and Other Ker- nel Architectures Usenix Workshop, Seattle, April References 1992, pp. 113-126.

[Acc 861 M. Accetta, R. Baron, W. Bolosky, D. Golub, R. [Hom 791 G. Hommel, J. Jackel, S. Jahnichen, K. Kleine, Rashid, A. Tevanian, M. Young. Mach: A New C.H.A. Koster. ELAN Sprachbeschreibung. Wies- Kernel Foundation for UNIX Development. Pro- baden 1979. ceedings Usenix Su"er'86 Conference. Atlanta, [Kow 901 0. Kowalski, H. Hbtig. Protection in the Bir- Georgia, June 1986, pp. 93-113. liX Operating System. Proceedings 10th Interna- [Alb 801 A. Albano, M. E. Occuchiuto, R.Orsini. A uniform tional Conference on Distributed Computing Sys- management of temporary and persistent complez tems, 1990. data in high level languages. Nota Scientifica, S- J. Liedtke, U. Bartling, U. Beyer, D. Heinrichs, R. 80-15, September 1980. [Lie 911 Ruland, G. Szalay. Two Years of Ezperience with [Atk 821 M. P. Atkinson, K. J. Chisholm, W. P. Cockshott. a @-KernelBased OS. Operating Systems Review, PS-Algol: An Algol with a Persistent heap. Sigplan 2,1991. Notices 17(7), July 1982, pp. 24-31. [Lie 921 J. Liedtke. Clans €4 Chiefs. Proceedings 12. [Atk 831 M. P. Atkinson, P. J. Bailey, I(. J. Chisholm, GI/ITG-Fachtagung Architektur von Rechensyste- W. P. Cockshott, R. Morrison. An Approach to men, Kiel 1992, A. Jammel (Ed.), Springer-Verlag Persistent Programming. Computer Journal 26(4), November 1983, pp. 360-365. [Lie 92a] J.Liedtke. Fast Thread Management and Com- munication Without Continuations. Proceeeding [Atk 871 M. P. Atkinson, 0. P. Buneman. Types and Persis- Micro-kernel and Other Kernel Architectures tence in Database Programming Languages. Com- Usenix Workshop, Seattle 1992. puting Surveys 19(2), June 1987, pp. 105-190. [Lie 931 J.Liedtke. Improving IPC by Kernel Design. Pro- [Ben 721 A. Bensoussan, C.T. Klingen, R.C. Daley. The ceedings 14th ACM Symposium on Operating Multica Virtual Memory: Concept and Design. Principles, Asheville, North Carolina, December CACM,15, 5, May 1972, pp. 308-318. 1993.

[~ey891 U. Beyer, D. Heinrichs, J. Liedtke. Dataspaces in [Lor 771 R.A. Lorie. Physical Integrity in a Large Seg- L3. Proceedings MIMI '88, Barcelona, July 1988. mented Database. ACM Transactions on Database Systems, 2, 1, March 1977, pp. 91-104. [Cah 931 V. Cahill, R. Balter, N. Harris and Rousset de Pina (Eds.). The Comandos Distributed Applica- [Mu1 841 S.J. Mullender et al. The Amoeba Distributed Op- tion Platform. Springer-Verlag 1993 (to appear) erating System: Selected Papers 1984-1987. CWI Tract. No. 41, Amsterdam 1987. [Che 791 D. Cheriton, D. A. Malcolm, L. S. Melen, G. R. Sager. , a Portable Real-Time Operating [Ras 811 R. Rashid, G. Robertson. Accenf: A Communica- System. CACM 22(2), February 1979, pp. 105-115. tion Oriented Network Operating System Kernel. Proceedings 8th Symposium on Operating System [Che 841 D. Cheriton. The V Kernel: A Sojtware Base for Principles, December 1981. Distributed Systems. IEEE Software, pp. 19-42, April 1984. [Ros 871 J.L Rosenberg and J.L. Keedy. Object Manage- ment and Addressing in the MONADS Architec- [EUM 791 E UMEL Benutzerhand buch (German). University ture. Proceedings 2nd International Workshop on of Bielefeld 1979. Persistent Object Systems, Appin 1987, available EUMEL User Manual (English). GMD 1985. as PRRR-44, Universities of Glasgow and St. An- drews. [EUM 79a] EUMEL Referenzhandbuch (German). University of Bielefeld 1979. EUMEL Reference Manual (English). GMD 1985.

11