A Multiprocess for 16 Bit Microcomputers or

Porting UNIX1 to a NS32016 with Enhancements

Lucy Chubb 1987

Submitted in partial fulfilment for the requirements of the degree of Master of Science, School of Electrical Engineering and Computer Science, University of New South Wales.

1Unix is a trademark of AT&T Bell Laboratories CERTIFICATE OF ORIGINALITY

I hereby declare that this submission is my own work and that, to the best of my knowledge and belief, it contains no material previously published or written by another person nor material which to a substantial extent has been accepted for the award of any other degree or diploma of a university or other institute of higher learning, except where due acknowledgement is made in the text.

Signed .

Date ..... I would like to extend my sincere thanks to my supervisors Dr. David Mee and Dr. David Clements for their assistance during my thesis and their valuable comments on the style and contents of my thesis. I would also like to thank my husband Peter for patiently listening to repeated explanations of my project and for reading and commenting on my thesis. - 1 -

ABSTRACT

This thesis describes how the Unix operating system version 7 was ported to a microcomputer based on the NS32016 CPU and associated slave processors. The port had several aims: to get a working kernel running on the NS32016, to implement paged virtual memory, to change the representation of processes so that the address space of each running process corresponds to a file in a directory created for that purpose, and investigate any further developments which such changes might make possible.

Preparation for the port required making sure that the appropriate hardware and software tools were available and familiarization with the source code of Unix. After deciding what modifications were required the port proceeded by designing, coding and testing the changes. The kernel is now essentially complete.

Finally, a number of extensions to the operating system based on the changes made to the representation of processes and the implementation of paged virtual memory are proposed and discussed. These extensions allow a simpler implementation of text sharing and the ability for processes to map files into their virtual space. -2-

CONTENTS

1. Introduction ...... 5 1.1 Project Goals ...... 5 1.2 Porting an Operating System ...... 6 1.2.1 What is Portability? ...... 6 1.2.2 Features Contributing to Portability...... 6 1.2.3 Unix as a Candidate for Porting...... 7 1.2.4 Preparation for Porting...... 7 1.2.5 The Process of Porting...... 9 1.2.6 Designing Changes...... 10 1.3 A Brief look at Unix...... 11 1.4 Ways and Means...... 16 1.5 What Comes Next...... 17

2. Software and Hardware Tools...... 18 2.1 Software...... 18 2.1.1 ACK -The Amsterdam Kit...... 18 2.1.2 NLD - The L Fonnat Downloader ...... 19 2.1.3 ARCH - the archive and library maintainer...... 19 2.1.4 The DB16000 monitor...... 20 2.1.5 Make...... 20 2.1.6 Cxref - the crossreference generator...... 21 2.2 Hardware Used...... 21 2.2.1 The DB16000 Development Board...... 21 2.2.2 The NS32016 Central Processing Unit...... 22 2.2.3 The NS16202 Interrupt Control Unit...... 26 2.2.4 The NS16082 Memory Management Unit...... 27 2.2.5 Address Translation...... 28 2.2.6 The WD1002 Winchester/Floppy Controller...... 29 2.2.7 The Disk Drives...... 31 2.2.8 The Source Machine - VAXl l/780 ...... 32

3. A Unix Port...... 34 3.1 Traps and Interrupts...... 35 3.2 The Unix Input/Output System...... 42 - 3 -

3.2.1 Writing a New Device Driver...... 42 3.2.2 Block Input and Output ...... 43 3.2.3 Character Input and Output...... 49 3.2.4 The Console Device Driver...... 55 3.3 Scheduling, Synchronization and Timing...... 57 3.4 The UNIX File System...... 61 3.4.1 The Initial File System...... 62 3.5 Memory management...... 65 3.5.1 Virtual Memory...... 65 3.5.2 Swapping Under Unix ...... 66 3.5.3 Demand Paging...... 67 3.5.4 An Implementation of Virtual Memory with Demand Paging...... 68 3.5.5 Transferring Data Between Virtual Spaces...... 83 3.5.6 Management of Real Memory...... 85 3.6 Process Management...... 89 3.6.1 Processes under Unix Version 8 ...... 90 3.6.2 The Representation of Processes...... 90 3.6.3 Process Duplication...... 92 3.6.4 Process Execution...... 96 3.6.5 Process Termination ...... 98 3.6.6 Creating a Core Image...... 99 3.6. 7 Process Switching...... 100 3.7 System Calls...... 104 3.7.1 System Calls Deserving Special Mention...... 105 3.8 System Initialization...... 107

4. Proposed Extensions...... 114 4.1 Representation of Virtual Space ...... 114 4.2 Shared Text...... 116 4.3 Setting Up Segments...... 118 4.4 Extending Segments...... 122 4.5 Removing Segments...... 123 4.6 Page Faults ...... 124 4.7 Freeing Pages...... 125

5. Current Status and Reflections...... 127 -4-

6. Conclusion...... 130

7. References ...... 131

8. Appendix A - Notes on Software and Hardware...... 134 8.1 The Memory Map of the DB 16000...... 134 8.2 NS32016 Addressing Modes...... 134 8.3 Methods of Handling Interrupt Priorities Using the NS16202...... 136 8.4 A Summary of Commands for the WD1002...... 137

9. Appendix B -A Summary of Changes to made to Unix...... 139 -5-

1. / ntroduction

Operating systems are complex pieces of software. Considerable amounts of time and money are devoted to writing a new one. When the availability of a new type of hardware or the desire for a new type of virtual machine result in the need for a new operating system, the expense of providing it with one can be reduced by moving an existing operating system to the new machine or making it provide new features. The process of moving an operating system to another machine is called porting and usually involves modifying the operating system because of differences between the machine it runs on and the destination machine.

Even when a machine has an operating system a different one may be required because it is easier to use, better known or provides functions not provided by the original. On a new machine the use of an existing operating system allows many programs written for use under that operating system to become available. An early port of the TROTH operating system was done to provide the same abstract machine for applications programs even though the underlying hardware had changed [Cheriton,77]. This saves much time and effort which may otherwise have been required in rewriting those programs to run on the new machine.

Another major benefit of porting an existing operating system is the ease with which users can adapt to working on a different machine. There is no major relearning required because the logical machine presented to the user is the same or very similar.

1.1 Project Goals

The main goal was to obtain a working Unix kernel [Ritchie,74] on a microprocessor using the NS32016 chip set [NS,85]. It was desired to make use of the memory management unit and implement paged memory, removing the disadvantage of limited process sizes under swapped memory systems on processors with small memory sizes. In conjunction with the implementation of paged memory it was decided to create a new directory containing files corresponding to the address space of each running process [K.illian,85]. -6-

1.2 Porting an Operating System

1.2.1 What is Portability?

"Portability is a measure of ease with which a program can be transferred from one environment to another: if the effort required to move the program is much less than that required to implement it initially, and the effort is small in an absolute sense, them the program is highly portable" [Tanenbaum,78). An attempt to quantify portability was made [Powel,79) with the following equation:

relative portability from A to B = (I - C/) * 100

where C = effort to transport from A to B D = effort to implement on B from scratch

The relative portability from A to B is 100% if A and B are identical. This equation suffers from a difficulty faced by all software development which is estimating the effort required to write or modify software. Thus it is easy to work out how portable a piece of software is in retrospect, but harder to work out if it will be worth porting. In practice, the experience of others in porting a particular operating system may give some indication of how easy it will be to port.

1.2.2 Features Contributing to Portability

There are two things which are likely to restrict the portability of a program. They are the language of implementation and the logic of the program which may be machine dependent [Minchew,82). Problems with languages may arise from lack of standardization or the inability to make it run in an environment due to hardware constraints. The latter is a particular problem for assembly languages.

A strategy critical to portability is modular design particularly in isolating machine dependencies in a small area, so they may be easily found, and reducing the amount of code to be changed when the hardware changes. -7-

It was said of the operating system TROTH [Cargill,80] that "the system's portability is achieved by defining a hierarchy of computational abstractions which can be realized on each machine. High level abstractions such as files, devices and real time are defined in tenns of lower level abstractions such as address, word, byte and interrupts, which in tum are mapped onto hardware." Similarly the Perseus operating system [Zwaienpoel,84] was designed as a two level hierarchical kernel. The lower level interfaced directly to the hardware. The upper level provided a virtual machine which was not dependent on the architecture.

It is always necessary to have some portions of the operating system written in assembly language. It is desirable to reduce that portion to a minimum. Code written in a high level language tends to have fewer bugs, be more readable and easier to modify correctly.

1.2.3 Unix as a Candidate/or Porting

UNIX has been shown by past experience to be highly portable. ([Schriebman,84] [Hsu,84] [Jalics,83] [Miller,78]) It is (reasonably) modular and is written in a high level language. The C language does not enforce good programming practices to the extent that some other languages do, but it has proved to be a powerful and flexible language suitable for writing operating systems.

The amount of effort required to port Unix is relatively small. Kernighan in "The Unix System and Software Reusability" [Kemighan,84] says Unix takes about 1 year of effort to port while another effort took 16 man months [Jalics,83] with a team having a good knowledge of operating systems but little knowledge of Unix in particular.

It was intended, when the project was proposed that Unix be the operating system ported. The intended users of the system are familiar with both Unix and the C . A large number of desirable application programs and utilities would be made available with little or no modification required.

1.2.4 Preparation/or Porting

There are a number of issues which need to be considered before starting a port of an operating system. The following issues are suggested by Bodenstab et al [Bodenstab,84]. If more than one processor is available using the operating system, which one will be used as a starting point? Is there a compiler, assembler, loader and any other required piece of software available for the target system? Is there a way of loading object code onto the target system? Can an initial file system be created? What debugging tools are available? -8-

Hardware

The hardware to be used as the destination of the port needs to be determined. The extent to which the hardware is able to provide the support required by the operating system should play a significant part in any decision. The required features of the operating system will influence this to some extent For instance the requirement for virtual memory will make memory management hardware and a hard disk very desirable. Features such as word size, byte ordering within a word, interrupt and memory management hardware affect the ease of porting. The more similar the two architectures are to each other the easier the task of porting becomes.

If the hardware is new in design and has not been extensively used debugging may be required. Using the NS32016 chip set when it was new, Hsu et al [Hsu,84] found the task of porting more troublesome because of the need to isolate hardware bugs from software bugs.

The architecture of the host and destination processors needs to be well understood. Of particular interest are the differences between them because almost every difference will require the rewriting of code.

Tools

The most obvious tools required are the compiler, assembler and linker. If any of these are not available they will have to be written or ported. The facility to load from the host machine is also required in the fonn of a loader on the host machine and a monitor on the destination machine. Loading could be done via a serial port, floppy disk or cassette tape.

Unix requires a file system to exist before it is able to execute. A stand alone disk driver is required to create the initial file system unless another system is available under which the file system can be created.

When testing is commenced some debugging tools will be required, both hardware, such as breakpoint registers, and software, such as a post mortem debugger.

Another item often considered useful, or even essential, is a running Unix system. The port described by Jalics [Jalics,83] started without access to a Unix system. After examining the resources needed to do a port they decided that they could not proceed without a running Unix system and purchased a PDP-11/34 with Unix. -9-

When all the appropriate tools have been gathered some time needs to be allowed for becoming familiar with them.

Software

If only one operating system source is available there is no choice as to which one to use. On the other hand if several are available a choice will have to be made. It is important to bear in mind that porting is easier when the source and destination machines are similar.

The source has to be well understood in preparation for deciding which parts of it need to be modified or rewritten. The ease of this will depend on the language recognizing that high level languages are usually easier to read than assembler, the quality of commenting, and the clarity of the code. Even UNIX which has been deemed highly portable has had some problems in this area. "Lack of comments in parts that obviously have to be modified or rewritten is a serious impediment to portability" [Jalics,83].

Other considerations

The following suggestions apply particularly to ports being done by a team, but may be useful considerations for individuals considering doing a port. It may be useful to establish guidelines about when it is acceptable to modify code [Hsu;84]. An overall (committed to paper) vision of what the system will be like when completed is useful. Some things to consider are the uses it will be put to, whether it will be single or multi user, what sort of peripherals it will have and if virtual memory is required. It might be desirable to include new features in the operating system or leave some features out. When new features are included it is possible that some features become obsolete. A timeframe within which the port is to be completed should be established. The experience of anyone else who has ported the same operating system will provide useful guidelines.

1.2.5 The Process of Porting

Several approaches to getting an operating system running on a new machine are possible. Three different approaches are suggested by Jalics [Jalics,83].

1. Write a hardware emulator for the new machine and run the operating system on that -10-

2. Write an operating system for a mythical machine which is at a higher level, hence easier to program, than normal hardware. An emulator is written to run the operating system on the real hardware. A benefit of this is that only the emulator is machine dependent

3. Write (or otherwise obtain) an operating system in which the machine dependent parts of the code are isolated. The machine dependent portions are rewritten and the rest recompiled. It was this approach which Jalics considered to offer the best potential. The more usual approach with the last option is to rewrite the assembly portions first making modifications to the high level code and recompiling it later. A different "top down" approach was taken by Miller [Miller,78].

Miller divided Unix into three levels. The top level being system calls, the second being support for primitive operations and resource allocation, most of which is machine independent, and finally the machine dependent routines controlling peripherals, interrupts, process switching and register manipulation. His port was based on these divisions.

The first stage was to test the C compiler. The C runtime library was rewritten to use the target machine operating system, 0S/32, to simulate system calls.

The next stage was to run the machine independent portion of the kernel as a privileged program with 0S/32 providing pseudo devices and interrupts. This allowed the logic of Unix to be tested and minor changes to be made. The different size of integers between the source and target machines was mentioned as a reason for changes.

Finally the lowest level routines were written. This approach meant the there was no need to construct an artificial environment for testing drivers and other low level routines as the Unix kernel itself provided a suitable environment.

In this thesis the third approach was chosen. The second method is only applicable when writing a new operating system. The use of an emulator, as in the first option, always has a performance penalty. Having chosen the third option the "top down" approach was immediately eliminated because the target machine had no operating system, only a monitor.

1.2.6 Designing Changes

Initially I thought the right approach was to understand exactly what the routine being changed did then write the new routine to do the same things. This can be done, but there is another possible approach. Seek to understand what the rest of the operating system requires - 11 - from the routine, including side effects of the original routine, and write another routine to fulfill those requirements. This means that some routines can be rewritten without being understood in detail. Others have to be understood thoroughly to work out what the operating system required of them. The new routine does not necessarily have to work in the same way as the original.

1.3 A Brief look at Unix

Unix can be partitioned into subsystems on the basis of function. In addition to being divided into subsystems it can be divided into three sections relating to portability. Decisions on how to port each subsystem are aided by this further division.

Changes which affect one routine in a subsystem are more likely to affect other routines in the same subsystem than those in other subsystems. The following subsystems are briefly described. Further details about each subsystem are given in chapter 3 where relevant.

1. Initialization.

The operating system initializes registers and memory, which contain all the data structures used during its life. Hardware is initialized to provide a stack, interrupts and memory management. Each different type of machine will have different hardware initialization requirements. Most data structures need only be zeroed during initialization, but others need more complex procedures such as the construction of lists. The final part of the initialization is to create a user process which can be thought of as a dynamic data structure. It executes a file containing a program to allow users to log on and start the shell.

2. Traps and Interrupts.

Traps and interrupts are occurrences which cause the CPU to suspend execution of the current process and execute another routine which is part of the operating system to handle the event indicated by the trap or interrupt. Traps are generated in response to some condition within the CPU while interrupts are generated outside the CPU by devices requesting servicing by the CPU.

A "processor priority" is maintained which prevents an interrupt of lower priority being processed until the processor priority falls. This is used to exclude all but one process from vital sections of code. The ability to have software initiated interrupts is - 12 -

provided and is used by the clock to do some of its periodic processing at a lower priority than the clock interrupt.

3. Scheduling, Synchronization and Timing.

Two resources are scheduled between processes: residency within memory and use of CPU time. Unix allows more processes to be active (executing) than will fit into its main memory. As a process must be resident in main memory to use the CPU, residency must be shared between processes. Removing a process from memory and placing it on a secondary storage device such as a disk or restoring it from such a device into main memory is called swapping. After a process has been swapped out for a sufficient length of time it will be swapped in, even if another process has to be swapped out to make room for it.

A process is allowed use of the CPU either until it gives it up voluntarily or until a certain length of time has elapsed. A process gives up the CPU when it has to wait for a resource to become available or an event such as the tennination of an input or output operation. The CPU is rescheduled according to a scheduling priority maintained for each process which enables the most deserving process to be identified.

A process may suspend its execution by calling the routine sleep and specifying what it wants to sleep on. It resumes execution when another process calls wakeup on the thing the first process said it wished to sleep on. It is possible for a process to suspend execution for a specified number of seconds using timeouts.

The time is maintained within the system by accumulating the number of clock interrupts which occur. The clock routine perfonns periodic processing such as the reevaluation of process priorities and execution of timeouts.

A signal is generated by some abnonnal event. It usually causes the receiving process to tenninate and in some cases forces an image of the process at the time the signal was received to be produced. The image is called a "core" file and can be used for debugging. A process can catch a signal or have it ignored rather than allow the default processing. Examples of occurrences causing signals are illegal instructions, the receipt of a delete character from a tenninal, dividing by zero and so on. - 13 -

4. Block and Character Input/Output.

There are two levels within the input/output subsystem. The lower level, the device drivers, isolate differences between device types from the rest of the operating system. The higher level manipulates buffers for the data to be read from or written to a device. There are two major types of devices, block and character, which each have their own buffering methods.

There are two arrays, one for block devices and one for character devices, which contain the entry points for the device driver routines. The arrays are indexed by the device type which is part of the identification number of every device. This allows the correct routines to be selected without having to know anything about the device except its identification and whether it is block or character.

5. The File System.

The file subsystem defines the structure of files, maintains a hierarchical directory structure, and manages allocation of space within the file system. It does all accesses to devices through the input/output subsystem. Each node of the directory contains one of the following objects: a directory file, a regular file, a special file or a removable file system.

To the user a regular file looks like a sequence of bytes. No internal structure is imposed on the contents of the file by the operating system. The file is maintained as a number of blocks scattered over a disk with an index maintained to locate them. Each file has a structure called an inode which contains all the required information about the file.

A directory file consists of a series of directory entries each containing a name and an inode number. Each directory file contains an entry for itself, an entry for its parent directory and entries for each object within the directory. The directory allows an inode to be found once the name of a file is known. It also allows more than one name to refer to the same inode.

The file system also supports a file for use in interprocess communications, called a pipe. It is written to by one process and read from by another process. - 14 -

Each input/output device is associated with a special file. An input or output request on a special file results in the activation of the associated device. The special files reside in a directory called "/dev". To the user special files look much the same as regular files. They have the advantage of being able to use the file protection scheme provided by the file subsystem.

It is possible to graft a directory tree on another device into the main directory tree by using the mount system call. A regular file in the directory is made to refer to the root of the new directory tree.

6. Memory Management.

Memory management provides a virtual address space in which a process runs. The addresses the program is compiled with need not correspond to the real addresses used because the programs addresses, which are virtual addresses, are translated into real addresses. At the same time protection is provided to keep a process from deliberately or accidentally referencing physical memory which does not belong to it.

The memory occupied by a process is divided into three areas called segments, the text, data and stack areas. It is possible to offer different levels of protection to different segments, such as to protect the text segment from being written into. More than one process is allowed to share the same text segment, which is useful for commonly used programs like the shell or an editor. Data and stack segments are able to expand, the data segment growing upwards, towards higher addresses, as it increases in size while the stack segment grows downwards towards lower addresses.

There is an area called "swap space" in secondary storage which is used to put process images when they are unable to fit into main memory. It is allocated in variable sized lumps from a free list of swap space.

Physical memory is allocated and deallocated in 512 byte blocks. A bit map is kept to indicate which blocks are allocated and which are free. In order to speed up allocation of free blocks a short free list is kept making it unnecessary to search the bit map every time an allocation is made.

7. Process Management. - 15 -

Processes under Unix are temporary and can be created or destroyed on request. When an is executed its contents replaces the code of the process which executed it, thus destroying the original process. If a user wants to keep the original process a copy is made of it and the copy executes the object file leaving the original process intact. The replication of a process is called forking. The original process is known as the parent and the new one as the child. They are able to establish their identity as parent or child by examining the value returned to them by the system call which performs the forlc. A parent process is able to suspend its execution until one of its children terminates.

8. System Calls.

System calls allow a user process to request some form of service from the kernel. The user process specifies a number which is used to locate the requested system call in a table. Most available system calls are concerned with file manipulation and process control. The following paragraphs briefly mention some of the capabilities provided.

A number of system calls allow creation, deletion, opening, closing, reading and writing of files. There is more than one system call for some of these functions because of the different types of files. A call is provided to allow seeking to a specified location within a file, a function which cannot be used on a pipe or character special file. The mode, owner or group of a file can be read or changed. A removable file system can be joined into a directory tree or removed.

The flow of processes is controlled by providing system calls to duplicate a process, execute a file, terminate or wait for another process to terminate. A process can be locked into memory to prevent it from being swapped out. The amount of memory allocated to a process can be changed when the data area needs to change size, either by growing or shrinking. Processes have a number of characteristics associated with them such as an identification number, a user number showing who the process belongs to, and a value used in calculating the scheduling priority. All these things can be looked at and some of them changed. Other information about the processes execution is available such as an execution profile and the execution time.

A user process can use signals by scheduling a signal to be sent to it after a specified length of time. It can request that its execution be suspended until a signal is received. It can also specify what it wants done when a signal is received. - 16-

A system call is provided for changing the software defined characteristics of a device.

The time maintained by the system can be either examined or set.

9. Utilities.

The utilities provide much of the power which Unix has as an operating system. Because they are not part of the kernel they are not discussed within this document

Partitioning the Operating System According to Portability

The assembly language code fonns one partition on the basis of portability. It is highly machine dependent and almost always needs to be completely rewritten.

Device drivers fonn the next partition. If a device type is common to both the source and destination machines it is possible the appropriate device driver can be used substantially unchanged. Otherwise, a new driver needs to be written.

The remaining kernel code, all written in C, fonns the third partition. It is largely independent of the underlying machine architecture. With the exception of memory management it remains substantially unchanged through most porting efforts changing only when the function of the operating system is modified.

1.4 WaysandMeans

A VAXI 1nso was available as the home base for the port. The source code of Unix version 7 was the only version available.

Although hardware and software tools were already available a considerable proportion of the time used on this project was spent in getting them into a fully usable state. The chip set, the microprocessor board, the compiler, loader and runtime library had a number of bugs. There was a 3-4 month wait for a new chip set, after deciding that the original chip set was not sufficiently functional to support Unix. After it arrived it was found to have at least one significant bug. Fortunately a method of working around the bug was obtained from a National Semiconductor bug sheet. - 17 -

It was decided to seek simplicity in design leaving any possible optimizations until later. An example is the scheduling of disk accesses in the disk driver where the algorithm used is "first come, first served" but could be modified easily to utilize a more efficient algorithm. Lack of time was the major motivation for such an approach.

The first step was to become familiar with the Unix source code and examine the experience of others in porting Unix. After gaining some familiarity with the code it was possible to plan which areas of the code were to be modified and decide what the desired results were. Major changes were desired in the logic of memory management routines, to better utilize the capabilities of the memory management unit, and the structure of processes along the line of processes under version 8 Unix [Killian,85].

The system was built up gradually starting with the initialization routine and adding more routines as required by the initialization sequence. When the initialization was complete and the first user process started most of the kernel code had been added. At that stage a selection of system calls were added. After that only the character device driver and utilities remained to be added.

A number of standalone programs, not part of the Unix kernel, were required during the project. They provided the ability to test hardware, create an initial file system, and create special files for all the devices. The program written to exercise the winchester/floppy controller provided the basis for a standalone disk driver used in creating the initial file system. It proved invaluable for post mortem examinations of the contents of disks during debugging. The initial file (directory) system was created on floppy disk until the kernel was running correctly. It was faster to create the file system on the floppy disk than on the much larger hard disk. It also allowed several copies to be made at once and used in case they were corrupted during testing, which occurred more than once. l .5 What Comes Next

Chapter two reviews the hardware and software tools used during the port and aims to give enough infonnation to show how each of them works and how it is used. Chapter three goes through each subsystem of Unix and discusses how it was ported. Detail is given about unchanged areas of the operating system only where it was considered a necessary prerequisite for changes to be understood. The fourth chapter explains some proposed extensions which can be made to the operating system enabling files to be mapped into a process's address space. The status of the project at the time this thesis was written is given in chapter five along with some reflections on the project The conclusion then follows. - 18 -

2. Software and Hardware Tools

2.1 Software

The following software tools are available under UNIX at the University of New South Wales and were used during this project. Some of these tools will be familiar to Unix users, but a brief mention of their function will be made for the benefit of those who have not met them before.

2.1.1 ACK -The Amsterdam Compiler Kit

The Amsterdam Compiler Kit is described by Tanenbaum in "A Practical Toolkit for making Portable " [Tanenbaum,83]. It is a collection of programs designed to simplify the task of producing compilers for algebraic languages such as C, Pascal and Basic. It generates an intermediate language code from the input, converts this to the target assembly language code which is then assembled and linked. Optimization may be done in several places during this process.

Each front end is a separate program, there is one for each high level language provided. The back end, which translates intermediate code into target assembly code, consists of a program which is common to all back ends and a table for each different machine. Because the table is compiled into the back-end a separate back-end exists for each target machine. ACK uses the suffix of the file name it is given to decide which front end to call. It has been set up in a directory with a number of links to the compiler, giving a separate directory entry for each back end. The compiler looks at the name it was executed with to decide which back end is to be used. It consists of the following components: a preprocessor, one front end for each high level language, a peephole optimizer, a global optimizer, a back end to translate the intermediate code into the target machine assembly code, another optimizer and the universal assembler/linker.

An ACK compiler/assembler was the only one available for the NS32016 at the start of this project. As received, ACK did not contain either a global or optimzer. The back end supplied for the NS32016 contained bugs but was rewritten by a Phd student, Jayasooriah, at the University of New South Wales. It was still being tested and debugged during the early stages of this project. Having to debug the compiler at the same time as debugging new code makes the process lengthier. -20-

2.1.4 The DB16000 monitorl

Toe OB16000 board was supplied with two eproms containing the monitor MON16. Toe commands provided allow manipulation of memory, general registers, special registers and slave CPU registers. There are commands to control the execution of programs together with some limited debugging facilities such as software breakpoints and step commands. Toe monitor allows the use of L format commands for downloading as produced by "nld".

Toe monitor provides a number of supervisor call routines for use by user programs running under the monitor. They provide services such as reading and writing to the parallel and serial ports, printing error messages, changing to supervisor mode, trapping to a procedure and manipulating the interrupt table provided within the monitor.

Toe monitor has proved sufficient for the requirements of this project but has some drawbacks. It does not seem to use the ere but treats it as another byte to be loaded although the documentation indicates that the ere generated with the L format command is used by the monitor to verify that the data was loaded correctly. MON16 is unable to handle the speed at which the host generates L format commands. Only every second line is loaded. Toe speed can be reduced to a level which MON16 can cope with by inserting about 30 nulls on the end of every line. This significantly extends the time taken to load a large program such as Unix. A program was written to interpret L format commands instead of the monitor and eliminate the need to put nulls on the end of each line. Toe other feature which is mildly irritating is the inability of the monitor to recognize lower case letters.

2.1.5 Make

Considering the number of files which make up the operating system remembering to compile all the correct ones when changes are made can cause considerable problems. For this reason the "make" command was used to generate the system from its component files.

Make looks at the specified target file and examines all files it depends upon. If any of the dependent files have been modified since the target was last created, it is recreated according to instructions contained in a make instruction file. An instruction file name can be specified to make but there is a default name of "Makefile" or "makefile".

1. The information about MON16 was taken from the DB16000 Monitor Reference Manual [NS,83]. - 19 -

The NS32016 assembler has an integer size of 2 bytes which is different from the integer size the UNIX code was written for on the VAX. This necessitated several changes to the Unix code.

One of the nice features provided is the ability to use the preprocessor on assembly files. This feature has been widely used to improve the readability of the assembly code portions of the operating system code.

Some of the major problems encountered with the compiler are listed here.

1. The library routines used for the C switch command did not work, but were later rewritten.

2. The compiler does not support external procedure calls with their associated use of module tables and process descriptors.

3. Some complex expressions produce incorrect code. Exactly what feature of the expressions causes this is unknown. This bug has not been found, but correct code can be generated by splitting the complex expression into two or more simpler expressions.

2.1.2 NLD -The LFormat Downloader

Nld produces National Semiconductor L format suitable for downloading ACK output files. to the DB16000 development board. The load instruction consists of an 'L' to indicate the type of command, an address at which the specified bytes are to be loaded, up to 16 bytes to be loaded into consecutive memory locations and a cyclic redundancy check digit which is the sum of all the data bytes modulo 256. Given the name of an a.out file the nld command produces L format commands on standard output. Nld did not work at the start of this project because of program bugs.

2.1. 3 ARCH - the archive and library maintainer

Libraries are maintained in the target assembly code using the "arch" command. An archive maintainer "ar" was already available under Unix, but ACK requires its own, which has a slightly different format. - 21 -

Four make instruction files were used to maintain the Unix code. The device drivers, assembly code routines and other routines were each kept in a separate library. One make instruction file was used to maintain each of the library files. Another one remade the Unix object code whenever any of the libraries or the header file were remade.

2.1.6 Cxref - the C crossreference generator

Cxref builds a cross reference table from the input files given. It produces a listing of all symbols in those files and the file name and line number where it is referenced. An asterisk next to the line number indicates the symbol is defined in that location.

In a reasonably large program such as an operating system a cross reference is essential for finding where symbols are defined, their scope and for navigation through the source.

2.2 Hardware Used

A National Semiconductor development board, the DB16000, provided the central component of the hardware to which the operating system was ported. It was augmented with a Western Digital winchester/floppy controller, the WD1002, together with two floppy disks and a 30MByte hard disk. The DB16000 was connected to a VAX11n8o running Unix through a tenninal which allowed program downloading from the VAX.

2.2.1 The DB16000 Development Board2

The DB16000 is a single board containing the NS32016 central processing unit, the NS 16081 floating point unit, the NS 16082 memory management unit and the NS 16202 interrupt control unit. The floating point unit, memory management unit, and interrupt control unit are optional. It is designed primarily for the evaluation of the chip set and as a development tool. The board used was provided with 128K of memory and all the slave processors. There are plans to provide off board memory in addition to the 128K at a later stage. The memory map of the DB 16000 is shown in Appendix A.

The original board developed an untraceable problem and became unusable. A new board was obtained in January 1986. The new board worked satisfactorily after replacing a faulty timing control unit however the chip set contained a number of bugs. It was decided to replace

2. The information on the DB 16000 board was taken from the DB 16000 Monitor Reference Manual [NS, 83). -22- the chip set. The new chip set was received in august 1986. A bug was later discovered in the central processing unit. The means of avoiding the effects of the bug in software was taken from the National Semiconductor "bug sheets" for the central processing unit.

2.2.2 The NS32016 Central Processing Unit3

After looking in general at the features of the DB 16000 a closer look at the central processing unit (CPU) is needed. The structure of the CPU will change the way at least some of the system is written. There is a good description of the NS16000 architecture in Hunter and Farquahr's "Introduction to the NS16000 Architecture" [Hunter,84).

The Instruction Set

The instruction set of the NS32016 is symmetrical, meaning that the instructions can be used with any addressing mode or combinations of addressing modes, any general purpose register and any operand length ( byte, word or double word ). Most of the instructions are two operand, however one, three and four operand instructions also exist. The operand length is specified as part of the opcode. In addition to the instructions which refer to the NS32016 there are others which refer to special slave processors which act as extensions of the CPU.

The Register Set

The NS32016 has 8 general purpose registers and 8 special purpose registers. The general purpose registers RO through R7 are 32 bits wide. All general purpose registers are available to all instructions and may be used as accumulators, data registers or address pointers.

The special purpose registers are used for storing addresses and status information. The MOD and processor status registers are 16 bits while the others are effectively 24 bits. The special purpose registers are as follows:

a. The Program Counter ( PC ).

The program counter points to the first byte of the currently executing instruction. As it is 24 bit wide it can address 16Mbytes of memory without needing segmented addresses.

3. The information on the NS32016 was taken from the NS32016 data sheets [NS,85]. - 23- b. The Stack Pointers ( SPO and SPl )

The SPO register points to the lowest address of the last item on the interrupt stack. It is normally used only by the operating system. The SPl register points to the lowest address of the last item on the stack. When the top of stack addressing mode is used the stack pointers are automatically incremented or decremented. c. The Frame Pointer. ( FP )

The FP register is used by a procedure to access parameters and local variables on the user stack. When a procedure is entered the FP is set to point to the stack frame of that procedure. The parameters are addressed with positive offsets from the FP, while local variables are addressed with negative offsets. d. The Static Base ( SB ).

The SB register points to the global variables belonging to the currently executing module. All references to global variables are relative to this register. It is possible for different modules to share the same static data by specifying the same value for the SB register. e. The Interrupt Base ( INTBASE ).

The INTBASE register points to the start of the interrupt table. f. The Module Register ( MOD ).

The MOD register holds the address of the module descriptor of the currently executing module. Modules are described under the "Modules" subheading in this section. g. The Processor Status Register ( PSR ).

The PSR is 16 bits long. The low order 8 bits are accessible to all programs, but the high order bits are only accessible to programs executing in supervisor mode. h. The Configuration Register ( CFG ). - 24-

The configuration register consists of four bits which declare the presence of the interrupt control unit, the memory management unit, the floating point unit and a custom slave processor. Without the appropriate bit set in the CFG register, instructions specific to those slave processors produce an "undefined instruction" trap. If an interrupt control unit is declared to be present interrupts are considered to be vectored otherwise they are considered non-vectored.

Addressing Modes

An instruction for the NS32016 contains an operation to be performed, the type of operand to be used and the location of the operands. The operand can be in a register, in memory or in the instruction itself. There are 9 addressing modes available in the NS32016. The addressing modes are explained in Appendix A.

Modules

The NS32016 architecture supports the concept of modules. A module on the NS32016 consists of 3 components :

a. An area containing object code and constant data.

b. A static data area containing the global data for that module.

c. A link table with one entry for each external procedure or variable referenced from within that module.

The object code is position independent. It uses addressing modes which form effective memory addresses relative to one of several special registers. This means a module can be loaded and run anywhere in memory. Toe only things which have to be changed are the module table entries.

The module table contains a module descriptor for each module. This consists of 4 double word (32 bit) entries. The entries are:

1. The address of the static base area. - 25-

2. The address of the modules link table.

3. The address of the code and constant data area.

4. An unused entry reserved for future use.

A program may consist of a number of modules. When an external procedure is called or an external variable referenced, the link table is used to locate it. The link table entry for an external procedure, called an external procedure descriptor, contains the address of the module table which the called procedure is in and the offset of that procedure within the module. Link table entries for external variables contain the absolute address of the variable.

When a normal procedure call is performed only the return address is saved on the stack, however when an external procedure call is performed the value of the MOD register is also saved.

Unix consists of a single module due to lack of support for modules in the compiler. Because of this, no link table is required, however every program requires a module table even if it consists of only one module.

Traps and Interrupts

The interrupt table is maintained in memory, at the address indicated by the interrupt base register. It contains external procedure descriptors for all the routines which handle traps and interrupts. The first entry in the table is for non-vectored interrupts, the second is for handling non-maskable interrupts and entries 3 to 16 are reserved for traps. The entries following these can be used for vectored interrupts.

The interrupt system may be used with or without an ICU. If no ICU is used the CPU treats all interrupts except the non-maskable interrupts as non-vectored interrupts. In the case of a non-maskable interrupt the default vector is zero. If an ICU is used it will provide the CPU with an index into the interrupt table, to provide the interrupt vector, which is a procedure descriptor. Each different interrupt source into the ICU will cause a different index to be passed. Multiple ICUs can be cascaded together to handle up to 256 different interrupt sources. -26-

2.2.3 The NS16202 Interrupt Control Unit4

The Interrupt Control Unit (ICU) is designed to minimize software and hardware overheads in handling prioritized interrupts. Up to 16 additional ICU's can be cascaded to handle a maximum of 256 interrupts. There are 32 registers which control the operation of the ICU. The registers are 8 bits long. Many of them are grouped into functional pairs because they hold similar data.

On the OB16000 the ICU registers are memory mapped starting at location FFFEOO. Each register is aligned on a word boundary. Thus, registers which are paired have an unused byte between them. The configuration used on the DB 16000 has a single ICU arranged to allow 16 different hardware interrupt sources eight of which can be used for software interrupts. Each interrupt source can be assigned a unique priority level. Interrupts from the terminal (USART) and connectors are assigned to interrupt positions using jumpers. Some initial problems were experienced until it was realized that the circuit diagram for the board incorrectly showed what was connected to a number of the jumper pins.

Each interrupt source can be individually masked by setting the corresponding bit in the ICU's Interrupt Mask register (IMSK). Each interrupt source can be programmed to be edge or level triggered by using the edge/level triggering register (ELTG). The triggering polarity register (TPL) can be used to specify whether an interrupt will be produced on a rising or falling edge (if edge triggered) or a high or low level (if level triggered).

There are four ways the ICU can be programmed Unix requires fixed priority interrupts. This and the other methods of handling interrupts with the ICU are described in Appendix A. to handle priorities.

The sequence of events when an interrupt is issued to the ICU is :-

1. The ICU makes an interrupt request to the CPU.

2. The CPU does an interrupt acknowledge and reads a vector byte from address FFFEOO, which is the ICU's Hardware Vector Register (HVCT). The vector is 8 bits long. The low order bits contain the number of the interrupt source line activated.

4. The information on the NS16202 was taken from the data sheets for the NS16202 [NS,85). - 27 -

3. The vector is used to index into the interrupt table to find the external procedure descriptor for the proper interrupt service routine. The location of the vector is INTBASE +vector* 4 . INTBASE is a register in the NS32016 which always points to the start of the interrupt table.

4. The service routine eventually exits by a RETI ( return from interrupt ) instruction. The RETI instruction restores the MOD and PSR registers, which are not saved during a normal procedure call.

2.2.4 The NS16082 Memory Management Unit5

The NS16082 memory management unit (MMU) is a slave processor designed to work with the NS32016 CPU. It provides various debugging facilities as well as support for paged virtual memory.

The configuration register in the NS32016 CPU must be set before the MMU can be accessed. There are 6 special instructions provided by the CPU for use with the memory management unit. They allow validation of a address for either reading or writing, moving data between user and supervisor spaces, and manipulation of MMU registers, which are not memory mapped.

MMU registers

There are 10 registers in the MMU, six of which are specifically concerned with debugging. Only the registers concerned with memory management are mentioned here. The debugging facilities were unusable in the original MMU because of bugs in the chip.

The status register (MSR)

The MSR is a 32 bit register which contains information relating to both the memory management and debugging functions provided by the MMU.

5. The information on the NS16082 was taken from the NS16082 data sheets [NS, 85). -28-

The Page Table Base Registers

The page table base registers PTBO and PTB 1 are 32 bits long. Bits O to 23 give the page table base address. Bits O to 9 must be zeroes, in other words the level 1 page table must be aligned on a lK boundary. If dual memory spaces are being used PTBO will be used for supervisor mode and PTB 1 will be used for user mode, otherwise PTBO will be used for both user and supervisor modes.

The Error/Invalidate Address Register

The error/invalidate address register (EIA) is 32 bits long. It serves two purposes.

1. It allows examination of the virtual address which caused the current MMU exception.

2. Writing to the EIA allows removal of invalid page table entries from the MMU's translation buffer. If a user modifies a page table entry and that entry is resident in the translation buffer it will have to be replaced by the new entry.

The Translation Buffer

The translation buffer is a 32 entry associative cache. It contains the most recently accessed logical addresses with the corresponding translated physical address. Each entry consists of the high order 15 bits of the virtual address and the high order 15 bits of the physical address. If an address is present in the translation buffer the address translation time is reduced from about 20 cycles to one clock cycle. A hit rate of 98% is claimed [NS,85].

2.2.5 Address Translation

Address translation is done by the MMU using the page tables resident in memory and pointed to by the PTBO and PTB 1 registers. There are two levels of page tables. The level one table is pointed to by the PTB register and indexed by bits 16 to 23 of the virtual address. The level one table contains the base addresses of the level two tables. The level two table is indexed by bits 9 to 15 of the virtual address and gives the physical address of the required page in memory. This address is combined with bits O to 8 of the virtual address to give the full physical address required. Each process has its own first level page table and set of second level tables. (See figure 2.1) -29-

The page table entries also contain protection information. There are two privilege modes: user and supervisor. Each page may be specified as no access, read only or read/write in either privilege mode. If any sort of access is allowed in user mode the read/write access is automatically allowed in supervisor mode. If the protection level is violated during address translation, or an invalid page table entry is encountered the MMU will cause a page fault.

2.2.6 The WDJ002 Winchester/Floppy Controller6

Toe WD1002 can support up to three hard disks and four floppy disks.

The Task File

The task file is a set of registers used by the winchester/floppy controller to perform all its disk functions.

Toe data register provides a window into the sector buffer for use by the host. On a write command it is used to fill the sector buffer by successive writes to it by the host. On a read command the host empties the sector buffer by successive reads from this register.

Toe error register is read only, and valid only when the error bit in the status register is set. The faults indicated by this register are data address mark not found, track 000 not found by the drive during a restore, command aborted, the ID field containing the specified cylinder, head and sector number was not found, attempted error correction failed, and bad block detected.

The write precompensation register is used only for the winchester drives. It holds the cylinder number at which write precompensation begins.

The sector count register is used during multiple sector read and write operations and for the format command. For each read or write during a multiple sector read or write the sector count is decremented and the sector number is incremented. For the format command the number of sectors to be formatted is loaded.

The desired sector number is loaded into the sector number register prior to a read or write operation.

6. The information on the WD1002 was taken from the data sheets for the WD1002 [WD,83). - 30-

The cylinder number occupies two registers which specify the cylinder where the head is to be positioned on a read, write or seek command. The high register is not used for floppy disk accesses.

The sector/drive/head (SDH) register contains the sector size, drive select and head select information.

The status register is a read only register containing status information relating to a command after it terminates. The command register is a write only register at the same address as the status register. The command to be performed by the winchester/floppy controller is loaded into this register after the other task registers have been set.

There are a number of registers for the use of the winchester/floppy controller which are not accessible to the host.

Commands

There are six commands; one to run internal diagnostics and the other five to perform disk operations. They are described briefly in Appendix A. Before issuing a command the task file registers are set up. The command is then written to the command register. If more information is required by the command, such as with the write or format commands, it is written to the sector buffer via the data register. After the command is completed a check for errors can be made. If a read command was performed the data can be retrieved from the sector buffer via the data register.

During a write command an error correction code is produced and written onto the disk with the data. During a read command the error correction code is recalculated from the data and compared with the previous correction code. If an error is indicated the instruction is retried.

Features Affecting Software Design

Automatic command retry on errors simplifies the writing of a device driver. It removes the necessity for retrying to be software controlled.

Although there is provision by the controller for multiple sector read/writes it was not considered worth while using that ability at this stage. The structure of the file system is such that two consecutive blocks within a file may be anywhere on a disk. Since almost all disk - 31 - accesses are done via the file system the ability to read multiple sectors from the same track would not be very applicable. This makes multiple sector reads and writes inappropriate for read-ahead. It may be desirable to provide multisector read/write commands later on to allow a fast disk copy routine to be written for backup and restore operations.

It was desired to have the option of using the floppy disks to hold removable file systems. Given that there is only one hard disk in the system it was thought that extra file storage space would be needed eventually. However there are some problems with implementing it on this system.

There is no way of knowing if a floppy disk has been removed from a drive and replaced by another one, except by examining the contents of the floppy disk. This raises problems in respect of maintaining the integrity of the file system.

Two situations might arise, a floppy disk may be removed preventing further input/output on the device in which case the removed disk may be left in an inconsistent state. The second case is where a floppy disk has been removed an replaced with another. The original may be left in an inconsistent state and the new one corrupted by write operations intended for the original disk.

Two solutions are possible. One which seems to be used in most supermicro systems with UNIX or UNIX lookalike operating systems is to exclude floppy disks from the file system. The other, more dangerous, solution is to allow floppy disks to form part of the file system allowing the users to suffer the consequences if they remove a floppy disk without dismounting it first. This is completely unacceptable in a commercial system which should be robust, but may be acceptable in a research system under some circumstances. A minimal precaution in this case is to affix a sign to the floppy drives warning that floppy disks must be dismounted before removal, which will ensure there are no outstanding input/output operations on that device.

2.2.7 The Disk Drives

A paged memory operating system requires a reasonable amount of secondary storage with a high speed of access. While it is possible to page memory from a floppy disk the rate of program execution suffers considerably because of its slowness. This reason combined with the greater capacity of a hard disk make it a necessity. - 32-

2.2.8 The Source Machine - V AXJ 117807

Only the features which affect the structure of the operating system are of interest as they are the features most likely to require changes as the operating system is ported.

Fortunately the ordering of bytes within words was the same on the VAX and the DB16000. It can be a source of problems if the ordering is not the same due to implicit assumptions made about byte ordering in programming [Jalics,1983].

The major differences were in the peripheral devices available and the memory management units. Since the VAX and the DB16000 had no peripherals in common, all the device drivers had to be rewritten.

The VAX Memory Management Unit

The VAX MMU imposes some structure on the virtual space represented by its 32 bit addresses. The upper half, addresses with bit 32 set, belongs to the supervisor space. All user spaces share the same supervisor space. The same page table is used to map the supervisor space regardless of which user space is being used. The lower half of the virtual space, belonging to the user, is divided into two regions. The first area starts from address zero and is allowed to expand upwards. The second area starts just below the supervisor space and expands downwards. Each region is mapped using a different page table. A page table is an array of page table entries. Each page table entry contains a 21 bit page frame number, some reserved bits, a modified bit, a four bit protection field, and a bit to indicate whether the entry is valid.

Four privilege modes are provided. They are kernel, executive, supervisor and user. The kernel mode is the most privileged and user mode the least. When either read or read/write privileges are specified for a location in any of these modes, all modes of higher privilege also have read/write permission for that location.

The method used to translate a virtual address depends on whether it is a user or supervisor address. The following applies to supervisor addresses. The base of the supervisor page table is pointed to by the system base register. Another register, the system length register, holds the number of entries in the system page table. Before a virtual address is translated it is checked using the system length register to verify that a page table entry exists for that address

7. The information on VAX memory management was taken from the data sheets for the VAX hardware [DEC,81). - 33 - within the page table. Two types of faults may occur during the translation of an address. If a page table entry is invalid a "not valid" fault is generated. If the access mode is illegal or an attempt was made to access beyond the end of the page table was made an "access control violation" fault is generated.

When translating a virtual address bits 9-29 are used to locate the page table entry within the supervisor page table. A page frame number is obtained from bits 0-20 of the page table entry. The page frame number is combined with bits 0-8 of the virtual address to form the real address. (See figure 2.2)

Translating a user virtual address is similar to translating a supervisor address except that user page tables are maintained within system virtual space necessitating two translations. The user virtual address is used to find the address of the user page table entry within supervisor virtual space. Because it is a supervisor space virtual address it needs to be translated before the page table entry is retrieved. The MMU maintains a cache which reduces the number of translations necessary. - 34 -

3. A Unix Port

This chapter contains the details of what was done during this port of Unix. Each subsystem is examined proceeding from the low level subsystems to higher level ones. The breakup of Unix into subsystems is possible because of its fairly modular design. Nevertheless, the subsystems are interrelated and it it sometimes necessary to refer to the writeup of different subsystems in order to understand the workings of a particular subsystem.

The sections on initialization, traps and interrupts, input and output, and memory management describe changes which were mainly the result of the change in hardware used from the VAX to the NS32016 and the use of different peripheral devices. The machine dependence of this code resulted in much but not all of it having to be rewritten. With the exception of memory management, the intention was to perform the same function as the original code while using different hardware. In the areas of memory and process management decisions were made to change the way the operating system worked, not only based on the change in hardware but the potential for future experimentation. The section on process management describes proposed changes to be made when the system becomes fully usable.

Only minor changes were made in the areas described in the sections on system calls, file management, scheduling, synchronization and timing. The system calls required minor changes as a result of the runtime library provided with the compiler used in this project. Other changes were of a minor nature resulting from things like assumptions about the size of integers.

A summary of changes made to Unix during the port is given in appendix B. - 35 -

3.1 Traps and Interrupts

This section is not concerned with the higher level routines which process specific inteITIIpts but rather the general inteITIIpt handling capabilities which need to be provided for the operating system. Most of the code needed to be rewritten for several reasons. Firstly, since the source machine is different from the target machine, all code written in assembler has to be rewritten. Secondly, the inteITIIpt handling hardware is different.

Operating System Requirements

UNIX requires inteITIIpts with a fixed priority together with the ability to inhibit all inteITIIpts of a lower priority while an inteITIIpt is being serviced. It must also be able to inhibit inteITIIpts of a specified type during critical activity which might result in coITIIption of data structures if allowed. Invalid traps and inteITIIpts need to be recognized and the appropriate processing routines called for both valid and invalid ones.

A "processor priority" needs to be maintained which is associated with the currently active process and prevents inteITIIpts of a priority equal or less than the current processor priority from inteITIIpting the process. The processor priority must be able to be raised or lowered relative to the original priority. The processor priority should not be confused with the process priority which is used for scheduling purposes and is described in section 3.3 "Scheduling, Synchronization and Timing".

Processing Traps and Interrupts

The ICU looks after the prioritization of inteITIIpts. The fixed priority mode is selected for the ICU during its initializatiqn so that it requests servicing for the highest priority inteITIIpt awaiting service, inhibiting all others of lower priority until it has finished being serviced.

Traps occupy the first 16 entries in the inteITIIpt table and have a higher priority than inteITIIpts but do not disable them. The inteITIIpt table entries consist of external procedure descriptors containing a module table address and an offset. After saving the PSR register and adjusting its contents, The CPU obtains a vector from the ICU and uses it to find the appropriate inteITIIpt table entry. The CPU performs an external procedure call using the procedure descriptor from the inteITIIpt table. - 36-

Short assembly routines save the registers rO and rl, provide any parameters required, and call the appropriate high level routines. On return the stack is cleared and the registers rO and rl are restored. The registers rO and rl are used to return values to a calling routine which means they are not saved during procedure entry by the compiler. The other registers, r2 to r7, are saved on entering a routine and restored during a return from a routine. The values of rO and rl may still be required by the interrupted process, so they have to be saved before calling the C routine and restored afterwards. Most traps push a value identifying the trap onto the stack and call a common trap handling routine.

XXXtrapQ { Save registers rO and rl so the users values are not lost; Move the trap number to the top of stack; Go to a common trap handling routine - alltrapsO;

where XXX is one of the following: NVI - Non Vectored Interrupt NMI - Non Maskable Interrupt FPU - Floating Point Unit Exception ILL - Illegal Instruction (Privileged instruction in user mode) DVZ - Divide by Zero FLG - PSR flag bit set when flag instruction executed BPT _ Breakpoint Instruction executed TRC - Trace Trap UNO - Undefined Instruction RES - Reserved Position in the interrupt table

alltrapsO { Call the C trap handling routine; Remove the trap number from the stack, it is no longer needed; Restore the registers rO and rl; Return from trap;

All traps are handled by the above assembler routines except the system call and page fault traps. The system call routine, SVCtrap, differs in the parameters which have to be set up for the C routine which processes system calls. It requires all the general registers and the user stack pointer as well as the PSR and return address which are available to the other traps. The user stack pointer is changed during the processing of a system call as the parameters to the call - 37 - are passed through the user stack.

SVCtrapO { Save registers n) to r7; Call getusp to get the current value of the user stack pointer put it onto the stack, it is a parameter to syscall; Call the C routine syscall; Call setuspO to set the user stack pointer to its new value; Remove parameters from the system stack; Restore the registers n) to r7; Return from trap;

The page fault routine, ABTtrap, calls a different routine from the general trap handling routine. It would have been possible to make it call the C trap handling routine in a similar way to the other traps having the call to the pagefault routine within the C trap routine. The only benefit of of not calling the C trap routine is avoiding one procedure call. It it not particularly important which way it is done.

ABTtrapO { Save registers n> and rl; Call the C routine pagefault(); Restore registers n> and rl; Return from trap;

The interrupt handling routines are similar to those which handle traps except that some of them pass a parameter identifying the device which caused the trap to the device driver which is subsequently called. In the system as it is now this is not necessary because only one device of each type exists, but it will enable more devices to be added later. In addition, the routine allintsO requests that the CPU be rescheduled if it is appropriate. - 38 -

XXXXint() { Save registers rO and rl; Move a parameter to the top of stack if required; Call the appropriate routine to handle this interrupt; Take the parameter off the stack; Go to the common routine allintsO to finish processing;

where XXXX is one of the following: CLOK - Clock Interrupt TIME - Timeout processing TTYR - Terminal receive interrupt TTYX - Terminal transmit interrupt WINF - Winchester/Floppy controller NULL - Unused interrupt position

allintsO { If there is a higher priority process than the current one (as indicated by a flag) If the last mode was user mode Call a routine to reschedule the CPU; Restore registers rO and rl; Return from interrupt;

After an interrupt is processed a decision is made as to whether a process of higher priority may exist to whom the CPU should be given. That possibility is indicated by a flag. Each process has its own system stack and when an interrupt occurs the interrupt processing routine "borrows" the system stack of the currently running process even though there is only a coincidental relationship between them.

It is possible that as a result of the trap another higher priority process exists, in tenns of scheduling priority, and the CPU should be rescheduled. A switch is only allowed after an interrupt is serviced if the system was running in user mode prior to the interrupt. The reason for this restriction might be shown by considering what would happen if the CPU were to be rescheduled when the previous mode was system mode. Consider the following sequence of events. - 39-

1. Process A is running in user mode and is interrupted by a low priority interrupt 1-1.

2. Servicing ofl-1 begins using the system stack belonging to A.

3. Servicing ofl-1 is interrupted by the occurrence of a higher priority interrupt 1-2.

4. The servicing of 1-2 is completed. On returning from processing 1-2 the scheduling flag is found to be set. The highest priority process is found to be process B. Process A is suspended and process B is resumed.

What happens at this point may vary slightly between systems, however it will almost certainly be undesirable in any system. When process A is suspended the processing of interrupt 1-1, which is associated with process A, is suspended for an undetermined period of time. It will resume when process A is resumed. During the intervening time all interrupts of an equal or lower priority to 1-1 will be inhibited. In some cases, such as when the interrupt indicates the presence of new input, this may result in loss of information. Where the interrupt indicates that a device has completed a previous command it may result in under usage of the device.

Processor Priority

When a process is running in user mode the processor priority is low, allowing any interrupt to occur. During the processing of an interrupt the processor priority takes on the priority of the interrupt. At any time while running in system mode the processor priority may be raised above the original value and later restored. In the one case where the processor priority has to be lowered below the original value a second interrupt of a lower priority is generated and the higher priority interrupt is terminated.

An interrupt is prevented from receiving service if that interrupt is masked or if there is an interrupt of higher priority being serviced. This suggests two ways of changing the processor priority. To raise the processor priority, interrupts below the required priority can be masked or the ICU can be made to think that an interrupt of the appropriate priority is being serviced. This can be done by setting a bit of the required priority in the ISRV (Interrupt in Service) register of the ICU. The priority is lowered in the first case by restoring the original interrupt mask and in the second case by resetting the bit in the ISRV register. -40-

The processor priority cannot be lowered using masking. Any bits set in the ISRV register above the required priority would inhibit interrupts below that level. The processor priority could be lowered by manipulating the ISRV register.

After consideration it was decided to handle processor priority in a manner similar to that used by the Unix code being ported. When the processor priority is raised above the original, masking is used. When lowering the processor priority below the original a software interrupt is generated at a lower priority. The ICU allows interrupts to be generated under software control. If this had not been the case another method would have to be devised. The processor priority drops as the higher priority interrupt finishes and returns, enabling the software interrupt to be serviced at the lower priority.

The processor priority is maintained using the interrupt mask register of the ICU in which each interrupt can be individually masked. A routine is provided to mask, or unmask, interrupts of equal and lesser priority than a particular level. Because the priority is associated with a particular process it must be saved and restored during process switching. The ICU holds masked interrupts until such time as they are unmasked when the priority falls, or a lower priority process is switched in. For the purpose of priority levels, interrupts are gathered into groups of the same priority level. These groupings in descending priority are clock, character devices, and block devices.

The assembly language routine, plset, sets the processor priority. It is passed a parameter which is the interrupt mask required for that priority level. The routine disables all interrupts, saves the current value of the interrupt mask registers so the value can be returned, assigns the new value and then enables interrupts again.

plset(priority) ( Disable all interrupts; Save the current priority in rO so it will be returned; Set the new priority; Enable interrupts; Return; - 41 -

Software Interrupts

This feature is used by the clock to continue some of its processing at a lower priority because it might take longer than one clock tick to complete. If it does so some ticks will be lost, making the system lose time. Completing the processing at a lower priority allows subsequent clock ticks to be processed.

All interrupts awaiting servicing are represented by a bit in the interrupt pending register (IPND), which is cleared when the interrupt is serviced. Software interrupts are provided through a routine which takes an interrupt number as a parameter and writes it to the IPND register of the ICU, thus setting the required bit. It does not disable interrupts as there is always a higher priority interrupt running at the time it is called. The only decision it needs to make is whether the high or low register is written into.

sint(level) { If the level is less than 8 Set the top bit in level; Write the level to the low IPND register; Otherwise Set the top bit in level; Write the level to the high IPND register; Return;

If the top bit written to the IPND register is 1 the specified bit is set, otherwise it is cleared. Since the desire is to produce an interrupt not clear one, the top bit is set. - 42-

3.2 The Unix Input/Output System

Unix input and output can be divided into two categories, block and character into which any device can be placed depending on how it is accessed. On block devices one input/output operation transfers a whole block of characters and any device which does not do this is a character device. A typical block device is a disk drive, while character devices include such things as tenninals and printers.

Device independent routines provide character and block buffer manipulation and a small number of other functions. One set of device dependent routines, which perfonn the input or output operations and are known as device drivers, exist for every type of device. Infonnation is passed between dependent and independent routines using the block buffer header for block devices and the data structure associated with each device for character devices.

Toe device drivers are the lowest level interface between the device and the operating system. They are therefore device dependent. To overcome the difficulty of having separate routines for each device type an array of procedures is kept which is indexed by the device type and the routine required (eg. read, write or other). Toe device arrays allow the levels of operating system above the device drivers, in particular the file subsystem, to treat all of them in the same way, thus isolating the differences from the rest of the operating system.

Each device is uniquely identifiable by its device number within the two categories of block and character. Device numbers consist of two parts, a major and minor device number. Toe major device number identifies the type of device and the minor device number identifies which device of that type is being refered to.

Toe device independent input/output routines and buffer manipulation routines remained fundamentally the same during this port. In contrast, new device dependent routines had to be written. While there are standard functions which the device driver routines have to perfonn, individual drivers may vary greatly in the way they provide the required services. While existing drivers may provide a model some new design often has to be done.

3.2.1 Writing a New Device Driver

When adding a new device driver to the operating system the following activities will have to be perfonned. -43 -

1. When the interrupt position is assigned to the new device, entries need to be made in the interrupt table so the devices interrupt handling routine is invoked when the device produces an interrupt.

2. An assembly language routine needs to be written for each new entry in the interrupt table. It will find the device number and call the appropriate 'C' interrupt handling routine.

3. The new 'C' interrupt handling routines need to be written. These routines are part of the device driver but are not accessed through the device array.

4. An entry has to be made in the device array for the new device type giving the address of the device driver routines.

5. The procedures accessed through the device array need to be written.

3.2.2 Block Input and Output

The following sections on block buffer handling and block input and output give a brief description of the way in which buffers are used and what the device independent routines do.

Block Buffer Handling

The block buffer header contains information about the buffer. The information includes the address of the 512 byte block in memory. The blocks of memory are in an array called "buffers" and are allocated to buffer headers during system startup. When the buffer contains a block from a block device the header indicates the block and device numbers. Each buffer is normally linked into two lists, one of which is for the device it is currently associated with, and the other is either a list of buffers available for other use, or a list of buffers representing pending input or output operations. There is a special pseudo device, "NO DEV", which holds all buffers not associated with any device.

An array of buffer headers is reserved for raw input and output. They vary from other buffer headers by not having a block of memory permanently associated with them. This is to make it possible to read or write directly from the location in memory where the information resides. It avoids copying information from the buffer to another location when input or output has finished. It is used for swapping process images. Raw buffers are not kept in a free list. There are so few of them that it is practical to search the array every time one is required. -44 -

Each block device has a structure associated with it which holds the heads of two buffer header lists. One contains all the buffers currently associated with the device. A buffer is associated with a device when it contains a block of information from that device. The other is the input/output queue for the device which holds pending input/output requests for the device.

Not all buffer headers on the free list represent buffers available for immediate use. A buffer header may be marked as "delayed write" signifying that although it has been set up for a write command the physical write has not yet been performed. A delayed write prevents the physical write from being performed until the buffer is required for some other purpose. If the block is required again, before the physical write has taken place, the buffer can simply be removed from the free list avoiding the necessity of performing a physical read.

Block Input and Output

When a routine is called to read a block it checks whether the block is already associated with a buffer in memory by examining the buffers associated with the requested device. If there is no buffer associated with it the oldest non-busy buffer for that device is found and the block is read into it. If the buffer selected to be used for the read is marked as "delayed write" then it has to be written and released before it can be used.

Under certain circumstances an attempt to anticipate the next block to be required is made. After the read on the requested block has been started a read is scheduled for the block after it in the file in an attempt to speed up processing when a file is being read sequentially. The read ahead block is read asynchronously, that is, the requesting process does not wait for the completion of the read.

On a write request the device's "strategy" routine is called to schedule a write for the buffer header it was passed. It either waits for output to be completed then releases the buffer header or performs the command asynchronously, not waiting for it to complete.

The delayed write releases the buffer it is passed after marking it so it will be written out before being used for another purpose. If the buffer belongs to a magnetic tape the operation is scheduled immediately by calling the "strategy" routine, as writes must be performed in the order requested for sequential media. -45-

When an input/output operation has been completed or a buffer header is released with no input or output implied a check is made to see if the buffer is wanted by another process. If the wanted flag is set a wakeup is done on the buffer that notifies any interested processes that the buffer is now available.

WD -The WD1002 Winchester/Floppy Device Driver

The WD1002 controller did not exist on the source machine so a new driver had to be written. The new driver was written following the design of existing disk drivers where possible but differs significantly from existing ones.

The WD 1002 controller makes winchester and floppy disks appear almost identical which enables both types of disk to be driven using the same device driver. The areas of difference are being able to set the track at which write precompensation is to begin for the winchester and in the translation of logical block numbers because the number of cylinders, heads and sectors per block differ between the two types of disk.

For each type of block device there is an entry in the block device array. The entry consists of the addresses of three device handling routines for that device type and the address of the structure containing infonnation about that device including the head of its pending input/output queue. Only the open, close and strategy routines are called using the block device array.

Open and Close

The open routine verifies that the device requested exists and perfonns the initialization of device registers and any variables associated with the device. The close routine marks the device as closed, whereupon it may not be used.

The WD1002 does not require any special initialization before being used as it initializes itself on reset or power up. The open routine simply has to check that the device number requested is reasonable. The close routine has nothing to do to the device itself.

Strategy

The strategy routine is passed a pointer to a buffer header which contains all the infonnation needed to perfonn the input or output request. It places the buffer structure onto a queue to await servicing by the device drivers start routine. If the device is not active at this -46- point the strategy routine calls the start routine to process the next input or output request.

wdstrategy( pointer to a block buffer header ) { If the block number does not exist on this device Return an error; Place the buffer on the list of buffers waiting to be serviced by this device; If the device is not active Call start so the newly arrived request can be serviced;

Wdstrategy() is responsible for how input/output requests are scheduled. It may be much more sophisticated than this rather simplistic approach of doing them in a "first in, first served" order. Wdstrategy() puts buffer headers onto the end of the pending input/output request queue. Wdstart() removes requests from the front of the queue as new operations are started.

A more efficient method of scheduling is to take into account the current direction of head movement when placing the request on the queue to await servicing. This strategy tends to minimize head movement which is the major component of the time taken to perform a disk operation.

Start

The start routine takes the next buffer header from the device input/output queue and sets up the device's registers to start servicing the request.

wdstartO { If nothing is waiting to be serviced ( an accidental call) Return; Marie the device as active; Set up the controller registers; If it is a write command Transfer the data to be written into the controllers buffer; -47 -

In order to set up the controller registers the block number must be translated into sector, head and cylinder numbers. The way in which it is calculated depends on whether it is a winchester or a floppy disk. Minor device numbers O to 2 refer to winchesters while numbers greater than 2 are floppy disks.

The calculations are as follows:

The size/drive/head register = ( block size code) bitwise or ( drive select bits ) bitwise or ( head select bits ).

Sector number register = ( logical block number) mod ( sectors per track ).

cylinder number registers ( high and low ) = ( logical block number ) div ( sectors per cylinder).

The absence of a DMA controller means that data has to be copied in and out of the disk controller's buffer by the device driver routines.

Intr

This routine is called as a result of an interrupt from a device, usually on completion of the previous operation. It checks the device error register and performs the appropriate error handling which may include retrying a failed command or marking the buffer as having an error during the input/output operation. When anything else associated with the completion of a command has been done a device independent routine is called to notify any processes waiting for the completion of this input/output operation. The last thing to be done is to call the start routine to begin servicing the next request. -48-

wdintrO ( If the device is not marked as active Print an error message saying there was a stray interrupt and return. This is not a critical error; If the input/output operation has not finished Wait until it has; If there was an error in the input/output operation Mark the buffer; If the command was a read Copy from the controllers buffer to the system buffer; Call the standard routine for handling buffers on input/output completion; Call wdstartO to service the next request;

On an error termination the controller indicates six different types of error which may occur during an input or output command. WdintrO does not distinguish between the different types of error, the process is simply notified that an error occurred but not what type. This is consistent with the rest of the operating system which tends to go for simple error reporting. As the WD1002 controller automatically retries failed operations there is no need for the device driver to concern itself with such details.

Read and Write

These routines are used for raw input which reads directly into or writes directly from the processes virtual space avoiding the need for buffering.

Read and write are similar in that they perform raw input or output operations. They call a device independent routine to decide if the block exists on the device, passing, as parameters, all the information about the device needed to make the decision. If the block exists on the device then a device independent routine is called and passed the address of the devices strategy routine, the address of a buffer, the device number and a value informing it whether the operation is a read or a write. -49-

read or write( device number) { Call a device independent routine to validate the request; If valid Call a device independent routine to service the request;;

A raw buffer is kept solely for input and output on this device and is passed as a parameter to the device independent routine. It does not reside in the raw buffer array.

The Block Device Array

This is the current block device array containing only one device type.

/* block device array */ struct bdevsw bdevsw[] = { /* 0 *I wdopen , wdclose , wdstrategy , &wdtab , } ;

3.2.3 Character Input and Output

Character Buffer Handling

The character buffer header contains a count of the number of characters on the queue, and pointers to the first and last character buffers on the queue. The character buffer is used to store characters going to and from character devices. It contains a pointer to the next block on the list, an array of characters and two indices showing the first and last characters in the array.

Each tenninal device has a data structure associated with it containing infonnation specific to its current state, called a tty structure. The tty structure contains headers for the three queues nonnally associated with a character device, the address of the start routine which handles that device type, flags, the next character to be output to the device and other infonnation. Each character device has a raw queue, a canonical queue (both input queues), and an output queue. - 50 -

There is one more character buffer queue header which is used as a head for the free list. It contains the character buffers which are not in use.

Device Independent Character Handling Routines

When characters are received from a device they are placed on the raw queue and echoed, if required, by placing them on the output queue at the same time. The routine which transfers characters from the raw queue to the canonical queue processes erase/kill and escape characters. A flag indicates that a CTRL-S was sent to stop input/output. The character CTRL-S indicates that output is to be suspended until receipt of a CTRL-Q character. The routine which puts characters on the output queue expands tabs, adds delays and the like. In raw mode characters are passed straight from the raw queue to an area within the user process.

The routine which places characters onto a character queue also adds a new buffer to the queue when the last one is completely full. The routine which removes characters places a buffer on the free queue after the last character has been removed from it.

It is possible to remove all the characters from the raw, canonical and output queues without processing them. A variation on this is to allow characters on the output queue to drain and then flush the other two queues.

Character device drivers

For each different type of character device there is an entry in the character device array, "cdevsw". All calls to the device driver routines are done through this array which contains one entry per device consisting of the addresses of five routines specific to that device and the address of a tty structure for that device. The routines are read, write, open, close, and ioctl. Not all devices will require all of these routines. A printer will not require a read routine, a card reader will not require a write routine and so on. If a call to a routine constitutes an error as in the case of reading from a printer the entry in character device array would be nodevO. If nothing is done by a routine but calling it does not constitute an error then an entry of nulldevO is placed in character device array.

All routines for a particular device are placed in one file together with routines for handling interrupts from that device. - 51 -

The Terminal Device Driver

Open and Close.

The function of these routines is the same as for open and close routines described in the block device driver.

usopen(device, write_flag) ( If the device requested does not exist Set an error flag and return; Initialize the terminals data to indicate it is open, set up the address of the start routine used by the device and say which minor device is being used. Initialize the devices registers by calling usparamO; Call the device independent open routine to complete initialization of data.

usparam( device number ) ( Write three zeros to the USARTs control register address to ensure it is the control register being accessed; Ensure that the mode register is being addressed; Initialize the mode register; Set up the control register.

Since this device has both its mode and control registers residing at the same address, at the time of an open it is not known which register is being addressed. With the 8251A USART, writing three zeros to that address will ensure that the control register is being addressed regardless of the previous state. At that point the mode register can be requested allowing the device to be set up by writing first to the mode register and then the control register.

The value that is written into the mode register is stored in an array with one entry per device of this type so that each device may be initialized differently if necessary. The value will depend on the type of terminal attached to the USART. - 52-

use lose(device) { Call a routine to get rid of any characters still waiting to be output to the terminal; Mark the terminal as being closed; Mask the transmit and receive interrupts in the device control register;

The device independent routine called by usclose() waits until all characters on the output queue have been sent to the device and then arranges for any characters on the raw and canonical queues to be removed, leaving all three queues empty.

Ioctl

This routine allows manipulation of the devices characteristics as represented in its tty structure. It allows a process to find out the values and to modify them. Most devices allow the ioctl routine to change input and output speeds, the erase and kill characters, and an integer containing status flags.

Read and Write

The read routine calls a device independent read routine to get characters from either the raw or canonical input queue, depending on the input mode, and pass them to the user process. The write routine calls a device independent routine to put characters onto the devices output queue from the user process.

Map

The map routine returns the address of the tty structure belonging to the device whose number was passed to it.

Interrupt Routines

When an interrupt occurs an assembly routine is called to find the minor device number of the interrupting device then call the 'C' language interrupt handling routine passing the device number as an argument. Up to three routines may be required, one for transmitting one for getting the next character to transmit and one for receiving characters. The following routines, - 53 - usrint() and usxintO, are called from the interrupt system.

The receiver interrupt handling routine gets a character, just received by the device, and places it onto the raw character queue. If echoing of input is required it also places the character on the output queue.

usrintO ( Read a character from the USARTs data register; If the device is closed Mask out the receiver interrupt and exit. The character is lost; If there was any sort of error Perform the appropriate action; Reset the error bit in the USART; Call a device independent routine to place the character on the raw input queue;

The reading of a character from the data register automatically resets the receiver interrupt removing the necessity of explicitly performing that operation.

The transmitter interrupt handling routine finds out if there is a character to transmit, passed to it in the devices tty structure, and passes it to the device. When that is done it calls usstartO to get the next character from the output queue. If there are no characters on the output queue it masks or otherwise inhibits the transmit interrupt produced by the device. - 54-

usxint0 { If the USART is ready to transmit (not called accidentally) If there is a character ready to be transmitted Write the character to the USART; Set a flag in the terminals data to indicate the character has been sent; Call usstart0 to get the next character for transmission; Otherwise Mask the interrupt to prevent any more interrupts being generated before there are characters to send;

UsstartO gets another character from the output queue and passes it to usxintO via the tty structure for that device if usxintO has transmitted the previous character. As well as being called by usxintO, usstartO may be invoked by some of the device independent routines which obtain its address from the tty structure, which is initialized in usopenQ.

usstart(pointer to the terminals data) { If the device has timed out, is busy or has stopped return; If there is at least one more character on the output queue If not in raw mode and the character is greater than a certain value ( then it is being used to store a timeout value) Mark the terminal as timed out; Pass the character to the tty structure; Otherwise The terminal state becomes busy; Put the character into the tty structure; Enable the transmit interrupt;

If there are fewer characters on the output queue than the "low water mark" for this device and it is asleep Reset the asleep flag; Wakeup anything waiting on the output queue;

While there are too many characters on the output queue the status is set to "asleep" to allow some of the output to drain. There are two numbers which regulate how many characters a device can have on its output queue. The "high water mark" is the maximum number of - 55 - characters. If the number of characters on the output queue is greater than this, output is suspended. When the number of characters falls below the "low water mark" again output to the output queue can be restarted. Two arrays are maintained for the high and low water mark values which are indexed by the speed of the device. All devices of the same speed have the same high and low water mark values.

The Character Device Array

The character device array is shown here.

/"' character device array */ struct cdevsw cdevsw[] = ( /"' 0 *I usopen, usclose, usread, uswrite, usioctl, usmap, );

3.2.4 The Console Device Driver

A separate device driver is usually provided for the console device. In this case The console is the same device as the terminal. The console device driver may use the same routines as the terminal driver.

In addition to the usual device driver routines a routine, putcharO, is provided for use by printfO in outputing system messages to the console. PutcharQ uses polling to wait until the device is free and preempts any other pending output to the device making this a routine for emergency use only. - 56-

putchar( character ) ( If it is a NULL character Return; Disable transmit/receive interrupts; Poll the USARTs status register to find out when it is ready to accept a character; Send the character; If the device is open Reenable the transmit/receive interrupts;

The transmit and receive interrupts are disabled in the USART, not by setting a mask in the ICU. - 57 -

3.3 Scheduling, Synchronization and Timing

Few major changes were made to these routines as they are written in the "C" language and are fundamentally machine independent. This section describes the original system except where otherwise noted.

Scheduling

Two resources are scheduled between active processes: occupancy of real memory and the use of CPU cycles. The management of both these resources is based around a queue of active processes called the run queue (runq) which consists of a linked list of process (proc) structures.

When there are more processes than real memory can contain the scheduler, which runs continuously, circulates processes between memory and secondary storage to enable each one to get its fair share of CPU time, as processes can only run while in memory. Whenever processes are swapped out the scheduler examines the runq to find the process which has been swapped out the longest, hence the most deserving of being swapped in. If it is unable to swap that process in, it will try to swap another process out. Processes which are suspended, either because they are waiting for a resource to become available or because they are being traced, are likely candidates for being swapped out. If there are no processes to be swapped in the scheduler will go to sleep until there is some work to do.

In order that swapping is not attempted too often, which would reduce the efficiency of the operating system, swapping does not proceed unless the process being swapped out has been in memory for a certain length of time and the process being swapped into the resulting space has been swapped out for a minimum length of time.

Allocation of processor time is done by the routine swtchO which searches the runq for the highest priority runnable process and gives the CPU to it. If it finds no runnable process it calls the idle routine. The priority of each process is recalculated once every second by the clock routine. The priority is based mainly on the amount of CPU time the process has used such that a process which has used a lot of CPU time will have a lower priority than one which has used less CPU time. The CPU is rescheduled when an active process suspends or once a second as a result of the clock setting a flag indicating the CPU should be rescheduled at the soonest opportunity, which will be on the next return to user mode from an interrupt as described in section 3.1 "Traps and Interrupts". - 58 -

Another scheduling value known as "nice" is maintained. A process inherits its "nice" from its parent but can change the value via a system call. Only the superuser can improve the value of the nice, other users may alter it to be less favorable. The nice gets less favorable very slowly over a long period of time by being incremented once every 4096 seconds of CPU time used. A negative nice affects the scheduling of the CPU making the possessor of such a nice of higher priority. A positive nice plays no part in CPU scheduling. If the scheduler cannot find any suspended processes to swap out it tries to find the lowest priority runnable process which is the one having the largest value of the time spent in memory added to the nice.

In the new system swapping has been eliminated so the routine which schedules occupancy of main memory is no longer needed.

Sleep/Wakeup

A process is able to suspend by calling sleepO and giving the reason for sleeping. When another process does a "wakeup", a search is made for all processes sleeping for the specified reason and they are placed in a runnable state again. Sleep is used by a process when it requires a resource which is being used by another process, when waiting for the completion of a lengthy operation such as an input or output, or waiting for an event signalled by the setting of a flag. It is essential that for every reason that sleep() might be called there is some place where a wakeup might be done on that reason.

The Clock

The system clock routine is driven by a counter in the interrupt control unit which causes an interrupt to be generated with the clock's priority sixty times a second. The clock has several tasks: maintaining the system time, processing timeouts, and recalculating scheduling priorities for all processes. Once a second it causes the CPU to be rescheduled by setting the rescheduling flag which is checked on return from an interrupt as described in section 3.1 on "Traps and Interrupts".

The only change was in the means of producing a software interrupt to commence timeout processing. The new routine to produce a software interrupt under software control, sint(), is described in section 3.1 "Traps and Interrupts". - 59-

Timeouts

Timeouts are used by the operating system to execute a routine in a specified amount of time after the request. A record is set up with the function to be called and the length of time in ticks to wait until it is called. Each time a clock interrupt occurs the time is decremented and when it reaches zero the routine is executed.

Whenever it is time to process the timeouts the clock causes a software interrupt at a lower priority than the clock which initiates them. It is possible that timeout processing takes more than one tick so they must be processed at a lower priority enabling clock interrupts to be caught and processed.

Signals

Signals are the means of notifying processes of events which are of interest to them. Several types of events may cause signals: a hardware trap, a user hitting the "delete" key on a terminal causing a software interrupt, or the use of the "kill" system call.

All traps except the supervisor (system call) or memory management traps send a signal to the user process responsible whenever they occur in user mode. The memory management trap only causes a process to be signalled if a protection level has been violated or the requested page cannot be supplied, as described in section 3.5 "Memory Management".

The "delete" character causes a "kill" signal to be sent to the process currently attached to the terminal from which the character was received.

The "kill" system call may be used to send signals to processes. Kill allows a process number to be specified as the target of the signal or it can be used to signal all processes associated with the same terminal as the process which used the system call.

The default processing performed for a signalled process is to terminate it and for some cases (eg. for most traps) to produce a file containing the process image at the time the error occurred. The file is always called "core".

It is possible to anticipate the occurrence of a signal and specify an alternative action. The "signal" system call allows a user process to specify a user routine which is to be performed on receipt of the signal specified in the system call by that process. - 60-

The routine which executes the user specified routine when a signal is expected, sendsigO, is machine dependent. It alters the return address on the system stack so it will return to a different address in user mode and it sets up a stack frame on the user stack for the return from the routine which is run as a result of the signal. The shape of the stack frame depends on the microprocessor involved. SendsigO has yet to be rewritten.

Within the process information signals are stored as flags within a four byte field. The original code assumed that an integer was four bytes long making an integer sufficiently large to hold one flag per signal type. The compiler used during this project assumes two byte integers which are too short. The signal routines and associated data structures had to be checked to ensure that a four byte data structure was being used in all the correct places. Several changes were made as a result of this problem. - 61 -

3.4 The UNIX File System

The structure of the file system did not change throughout the port however there are two reasons for examining the structure of the file system: an initial file system needed to be created as UNIX assumes the existence of one during Startup, and since processes are now to be represented as files the creation, deletion and structure of files needs to be understood before the modifications to the process subsystem can be understood.

The description of the Unix file system given here is fairly brief. If further details are desired the two articles "The Unix Time-Sharing System" [Ritchie,74] and "Unix Implementation" [Thompson,78] are worth reading. Each of them contains a section on the file system as well as describing other parts of the operating system.

The software which defines the structure and usage of the file system sits above the input/output subsystem converting user file requests into a series of block input/output requests. To the user a file simply looks like a long string of characters. The system governs the representation of files but does not impose any internal structure on them.

Each file has an inode which can be located using the directory entry. The inode maintains any information required about the file such as the type of file, its owner, the size in bytes, access rights, the times of creation, last access, last modification and an array of disk block addresses. The disk block array is not used for special files. Access rights are the read, write and execute permissions used to, protect the file from unauthorized access and inappropriate usage.

A file exists as long as there is at least one link to the inode. The linkage count indicates whether there is still some interest in the file, which does not have to be kept once the count falls to zero. Each directory entry for the file counts as one link, so as long as a file has a directory entry it will be preserved.

When an inode is in use a copy of it is maintained in memory and used to update the disk inode periodically. It is removed when there are no more references to it. Whenever an in memory inode is being modified it is locked to prevent modification by more than one process at once. Each time the file is opened the reference count is increased and each time a close is performed the reference count is decreased. This allows a file to exist even if there is no directory entry for it because it is referenced by a running process. The file is deleted if there are no links to it and the last process referencing it reduces the reference count to zero by - 62- closing it.

The directory file is the only file type which has an internal structure maintained by the system. The directory consists of a series of records containing the file name and inode number of each file belonging to that directory. The directory allows files to be referred to using symbolic names, removing the need for the physical location of the file to be known by the users.

The array of disk block addresses in the inode contains 13 entries. The first 10 give the first 10 blocks in the file. The 11 th entry gives a block containing the addresses of the next 128 blocks of the file (each block address is 4 bytes long). The next entry contains 2 levels of indexing and provides the next 128 * 128 blocks of the file. The last entry has triple indirect addresses and covers 128 * 128 * 128 blocks. This then gives the maximum size of a file under UNIX.

Any of the indices may be empty. This gives the file system the ability to tolerate gaps in files, sections which are not physically represented. When a nonexistent block is required it is read as zeros. This feature is important in the representation of processes as will be seen in the process subsystem.

At least one special file exists for every device in the system. A request for reading or writing on one of these special files activates the device to service the request. The special files reside in a directory called "/dev". As part of the creation of the system the directory "/dev" had to be created as well as one special file per device. The name of the special file indicates the type of device. The special files were called "ttyn" for terminals, "fdn" for floppy drives, and "hdn" for the hard disk where the "n" distinguishes between the devices where more than one of that type exists. Other special files may be created as devices are obtained and device drivers written. A special file is created using the system call mknodO which has the name of the file, the access permissions and the device number as parameters.

3.4.1 The Initial File System

The structure of the file system is determined by the file access routines in the operating system. Block zero on the disk is not used by the file system but may be used in booting procedures or the like. Block one contains the file systems super block. It contains any information necessary about the file system residing on this disk. The inodes reside in contiguous disk blocks starting at block two. The number of inodes is fixed when the file system is created. All the blocks after the inodes are available for the files on the disk. -64-

Two entries are created in the root directory both referring to the inode for the directory. The

11 11 entry"." exists in every directory and refers to itself while the entry •• usually refers to the

11 11 parent directory. In this case the directory is at the top of the tree so the entry •• refers to itself.

setsuperO { Work out how many blocks the inodes will occupy and round it up to the next whole block. Write the number of inodes in the super block; Set up the number of the first block after the inodes; Set up the first part of the free block list, the rest of the free list is set up in main; {reserve the first free block for the root directory ) Set up the free inode list so the system will be forced to read in the free inodes the first time a free inode is requested; Write the super block;

Because the number of inodes is rounded up to exactly fill the last block of inodes the user may get up to seven more inodes than expected, there being eight inodes per block.

During normal operation of the operating system, when an inode is requested and the free list is empty a search is made for free inodes which are placed on the list The construction of a super block free inode list has not been done here but is left for the operating system.

Simple disk and terminal drivers were used to write to the disk, read the number of inodes required and print the request for input. Both drivers are polled to avoid the complexities of initializing and using interrupts. They are not the same as the device drivers used within the operating system but were initially written to test the terminal and disk controller hardware. - 63 -

The super block contains the block number of the first block after the inodes. A list of free inodes is maintained in the super block as a matter of convenience, it is not essential. Free inodes can be detected by reading them and looking at their status after which they can be placed on the free list. This is done by the routine which allocates free inodes on request. The total number of free inodes is also kept. A flag is kept in the super block allowing the free inode list to be locked during manipulation excluding any other processes from accessing it.

Initially all blocks on the volume after the inodes are placed on a free blocks list. The start of this list is in the super block. The last element in the super block portion of the list refers to a block containing a continuation of the free list. The last block number in it points to the next block of free list entries and so on. The total number of free blocks is kept Like the free inode list the free block list can be locked during manipulation using a flag to prevent it becoming corrupted. Unlike inodes the only way to tell from within the operating system if a disk block is free is membership of the free block list.

Some information about the super block itself is kept such as the time of last update, whether the device it resides on is read/write or read only and a checksum which is normally unused. Space is left to record information concerning the volume such as a volume number and the number of sectors per cylinder, if it is required.

The following is a description of the standalone program written for this project and used to create the initial file system. The design may leave a little to be desired, but rewriting was not considered worthwhile for two reasons: the program works and time is better spent designing more important and permanent parts of the system.

mainO { Get the user to enter the number of inodes required; Calculate the total number of blocks on the disk; Fill block zero with zeros; Call setsuperO to create the super block, block one; Zero fill all the inodes; Create an inode for the root directory; Place the unused blocks on a free list; Write the directory entry for root; - 65 -

3.5 Memory management

Memory management routines maintain the mapping between virtual addresses and real addresses and provide a strategy for handling the case where there are more active processes than can fit into memory. The way this is done and the data structures which need to be maintained depend on the memory management hardware used.

Two strategies to cope with the problem of insufficient memory to run all active processes are swapping and paging. The source version of Unix used in this project uses swapping although other versions of Unix may use different strategies. During this port of Unix swapping was replaced by demand paging. One benefit of paging is that only the pages actually being used need to be in memory. This is usually a small subset of the whole process. Hardware provides the mechanism for recognizing a reference to a page which is not in memory. It allows a routine to be called to rectify the situation. Various housekeeping tasks are undertaken such as keeping track of memory usage both real and virtual, keeping a count of the total number of page faults a process has generated and the like. The ability to pass data between address spaces is also provided.

3.5.1 Virtual Memory

At the centre of the idea of virtual memory is the distinction between program addresses, called virtual addresses, and the physical memory locations which they represent. The full set of available virtual addresses is known as the address space and the set of physical memory locations is called the memory space [Lister,79]. The mapping between the two need not be sequential and the two spaces need not be the same size. It is possible to have parts of the address space which are not represented within the memory space, and many address spaces mapped onto one memory space. This allows many processes each with their own address space to run in the same memory space, where the combined size of the address spaces are larger than the memory space, and allows areas of memory to be shared between processes.

Both the address space and the memory space are divided into blocks called pages and page frames respectively. Pages are mapped into page frames using an address translation mechanism. When an address is requested the address translation mechanism is used to find the page frame containing the page with that address in it. The offset within the page is added to the address of the page frame to form a new real address which refers to an area within the memory space. The information indicating which pages are in which page frames is held in a table called a page table. -66-

There are several problems associated with the use of virtual memory.

a. It is slower than direct addressing because each address needs to be translated before being usable. The use of special hardware can help significantly in this area. The NS16082 MMU has a 32 item cache which avoids the necessity of translating any address found there. If there is an entry address translation takes one additional cycle, otherwise it may take up to about 20 cycles. It is claimed that full address translation can be avoided about 97% of the time [Hunter,83].

b. Extra memory must be used to maintain the page tables. For instance a full set of page tables for a process in the NS32016 environment requires 129K of storage. Although a full set of page tables is not always required, a minimum of 2K of page tables per process is needed regardless of how small the process is.

Virtual memory with the NS 16082 MMU allows protection on a page basis. Each page can be specified read/write, read only or no access. A process can only access addresses within its own address space, so it cannot interfere with other proces~es either by accident or design. It is still possible, however, for one area of memory to be part of the address spaces of more than one process by putting entries for that memory into the page tables all the processes which are to be allowed access to it. This also allows the processes to have different permissions the shared memory.

3.5.2 Swapping Under Unix

When there is insufficient real memory to hold all the processes which are active one or more of them are copied by a procedure called swapping into a specified area on a device called a swap device. Events which may result in having insufficient real memory are the creation of a new process, the expansion of the data or stack segments of an existing process or the swapping in of a process already swapped out. The usual swap device is a hard disk which gives the speed and volume of storage space required.

Processes are swapped into a large area of contiguous disk blocks reserved for that purpose. The routines which allocate and free blocks in the swap area use a list to keep track of allocation. The list, called a map, keeps the size and starting address of unallocated space, sorted order by address. The map is maintained in an array. - 67 -

There are several map arrays maintained within the system corresponding to different swap areas that are used for specific purposes such as forking. Space is allocated using a first fit algorithm in units of 512 bytes. If there was an exact fit during space allocation the list element is eliminated. When space is released, it is combined with an existing free area if possible otherwise a new element is inserted into the list.

3.5.3 Demand Paging

It is possible within a virtual memory system that not all the pages are mapped into page frames but instead are held on some kind of secondary storage. When it is discovered that an address cannot be translated because the page is on secondary storage, an exception called a "page fault" is generated. This causes the missing page to be brought in from secondary storage to replace a page already in memory, according to a replacement rule which tries to predict which page is least likely to be needed in the future. This type of paging where a page is only mapped into the memory space when it is requested is known as demand paging.

The another possibility is to try and anticipate which pages will be needed beforehand. It is more complex because of the need to evaluate which page will be required next.

"There have been few proposals to date recommending look-ahead or anticipatory page­ loading because (as we have stressed) there is no reliable advance source of allocation information, be it the programmer or the compiler." [Denning,68]

Anticipation may be unreliable by failing to provide a required page. There still needs to be a means of detecting an attempted access to a page which is not present and and a means to fetch it. The other possible failing of the anticipatory system is that pages are fetched which are not required. The first problem mentioned may prevent an anticipatory system from working, the second simply makes it less efficient.

Why Change to Demand Paging?

Given the available real memory size, severe restrictions would have to be placed on the number and size of processes. Under demand paging restrictions would still exist but be less severe. Under the present scheme a process must be entirely memory resident to be runnable whereas with paged memory this would not be necessary. -68-

The available memory management hardware lends itself to demand paging. It provides a trap when a page is referenced which is either not in memory or does not have the appropriate permissions. The information provided by the MMU, the reason for the trap and the virtual address which caused it, is sufficient to support demand paging.

There are some problems with demand paging. Since not all pages are in memory, page faults will occur and the missing pages are generally kept on slower secondary storage such as a hard disk. This results in a time delay while the page is read in, before execution can resume. In a multiprocessing system this is not a severe problem. While one process is waiting for a page other processes can be running. Significant loss of efficiency will only occur when most of the processes are waiting on pages.

A phenomenon called thrashing may occur when real memory is over committed which can be loosely defined as a state in which the processor spends so much time paging that it can not do any useful work. Thrashing can be handled by mechanisms which seek to detect it and reduce the commitment on memory, or avoid its occurrence in the first place.

3.5.4 An Implementation of Virtual Memory with Demand Paging

The following section describes how virtual memory will be implemented in this port of UNIX. It will also describe what decisions had to be made and why.

The memory management unit dictates many of the features of this implementation of virtual memory. It dictates the page size, the translation mechanism, the page table size and structure and the structure of the address space. Two features not regulated by the MMU are sizes of real memory and secondary storage.

Page Tables

In a small memory system the page tables can be a problem because of the amount of memory they occupy: each first level table requires 1 Kbytes and each second level table requires 512 bytes of memory.

There are three possibilities for allocating space for page tables.

1. Allocate all space required for page tables at Startup. This may be very wasteful of memory. -69-

2. Allocate space for tables as required. This is the most efficient use of memory. It has the disadvantage that first level tables require 2 contiguous pages of memory and the nonnal allocation routine cannot supply contiguous blocks. Fragmentation may become a problem. There may be plenty of memory available but all in single block lots.

3. Allocate the first level page tables at startup, and space for second level tables on demand. This is the method selected. It is a good compromise between the first two methods. However because processes are able to grow, hence demand more page tables it is not possible to know exactly how much real memory will be used by the process when it is executed. The page fault handling routine provides new second level page tables when required.

The Working Set

The working set is the set of pages a process currently requires in memory to run effectively. It is the set of most useful pages at any particular time. The size of the working set may vary during the life of a process.

If a process's working set is larger than the number of pages that it is allowed in memory it will have more page faults than if it had a larger number of pages. Because of this a certain minimum number of pages is usually required for the process to be running. The precise number is a tradeoff between unnecessarily allocating memory to that process and the number of page faults. The more memory that is allocated to each process the fewer processes can run at a time. The more page faults a process has the less efficient the whole system becomes because more time is spent reading pages in proportion to the time spent doing useful work.

The "modified" and "referenced" bits associated with each page and kept with the page tables allow the membership of the working set to be detennined [Denning,70]. If, after a certain period of time, a page has not been referenced it can be assumed not to be part of the working set any longer. The circumstances under which a page leaves the working set are discussed in the section "Releasing Pages". A page enters the working set when it is referenced causing it to be read into memory by the page fault routine.

There is usually a maximum number of pages allowed for a process which is referred to as the working set estimator (WSEst). The WSEst should ideally be an accurate representation of working set size but will always be an approximation. One option for detennining the WSEst is to allow it to be specified by each process using a system call. This can either be built into the compiler or be left to the user. The default value of the WSEst may be changed as a result of - 70- system tuning. Toe other option is to have the operating system change the size of the WSEst dynamically, with some set maximum and minimum values allowed.

In order to detennine the minimum working set size the maximum number of pages a single instruction could possibly need is detennined. If the maximum is less than that number of pages then it would be possible to have an instruction which could not be executed because the essential number of pages could not be provided. The working set will include any second level page tables required for mapping pages within the working set.

If an instruction crosses a page boundary two pages will be required to enable it to be accessed. If the memory relative addressing mode is used for both addresses, assuming this is a two address instruction not a single address instruction, each address will require two memory accesses. If each access lies on a page boundary then four pages will be required for each address. This makes a total of ten pages for the whole instruction. Toe probability of such an instruction existing is fairly low. It is conceivable, but highly unlikely that one instruction may require ten pages to complete its execution.

It would be possible, by use of great cunning to set up an instruction of the fonn mentioned before which also used one second level page table to map each page. This would double the number of pages which the instruction needed to complete execution. The probability of this occuring in a situation other than an attempt to crash the system is remote.

Although the WSEst may fall below this value of 20 pages, no process should be forced to have a WSEst of less than this guaranteeing that a process will always be able to obtain enough pages to complete execution, but not necessarily immediately.

The maximum value of the WSEst will relate to each users memory limit. When that limit is reached the operating system will not increase the WSests of any process being run by that user.

Because of the difficulty of a user or compiler detennining an appropriate WSest the option taken in this project is that the system will dynamically alter the WSEst according to the requirements of the process at any particular time. Restricting a process to fewer pages than this has the potential to prevent the process from continuing and cause immediate thrashing within that process. Toe VMOS paging algorithm [Fogel,1974) requires a minimum WSEst value of 8. This is fairly close to the value just calculated and so supports its reasonableness. - 71-

Thrashing

Two possible strategies for handling thrashing are avoidance and detection. "For thrashing to be avoided the level of multiprogramming must be no greater than the level at which the working sets of all processes can be held in main memory" [Lister,79]. The overuse of memory may occur globally or within a single process. On a global level there may be too many processes competing for the same memory. When a single process is forced to have too little memory it may have to keep paging to access its desired working set. At a global level, thrashing may occur when there is too little memory to contain the working sets of all active processes.

To implement the avoidance strategy process duplication and execution cannot proceed if the sum of the process's WSEst and the WSEsts of all other processes exceed available memory. The avoidance strategy assumes that either processes do not change their WSEsts or do so under carefully controlled conditions. The WSEst will refer to the number of pages and second level page tables held by the process excluding the user structure. The total amount of memory used by a process will be:

maximum amount of memory = working set estimator + number of pages in user structure + number of its pages on the free queues.

If a process is prevented from increasing its WSEst internal thrashing may result on the other hand if the WSEst is allowed to increase in size, global thrashing may occur. There are good reasons for allowing a process to change its WSEst. Reducing the WSEst allows other processes to use more memory and increasing it allows the process to avoid internal thrashing. If this option is chosen some method of detecting global thrashing is needed and some way of resolving it when it is detected.

Thrashing occurs when memory becomes overcommitted but at what point does this happen and is there an easy way to detect it? One approach is to say that when more than a certain percentage of memory is being used then some form of action must be taken. If the WSEsts are not dynamically changed then a strategy might be to go through all processes and ensure they are not using more memory than is specified by their WSEsts. If dynamic modification of the WSEst is allowed then commitment on memory may be reduced by suspending the lowest priority process and relieving it of all pages it holds in memory. Both these approaches will relieve commitment on memory. - 72-

Thrashing will be assumed to be occurring when more than a certain proportion of memory is being used. This will be referred to as the "high water marlc". When the memory allocation routine detects that an allocation would raise memory usage over the high water mark a search is made for the lowest priority process which has some pages in memory. It is relieved of its memory and placed into a state which will be called "thrash suspension". The high water marlc will be the amount of free memory within the system, that is, all memory not occupied by the operating system.

When the memory free routine detects that the usage of memory has dropped below a point referred to as the "low water mark", it will try to resume a suspended process. The WSEst of the suspended process should not raise memory usage over the high water mark when it is restarted. It is useful to keep the value of the WSEst at the point when the process was suspended so this check can be made. The low water mark must be at least a minimum WSEst less than the high water marlc otherwise the operating system will try to restart processes when insufficient memory is available.

There is another case where the level of memory usage needs to be checked. When a process is being forked, the new child process should not be allowed to run unless its WSEst added to the current commitment on memory is less than the high water mark. To make this test there has to be some way of estimating the WSEst of the new process.

A child process is a copy of the parent and starts executing at the point where the parent requested it to be created. It is reasonable to expect that its memory requirements are the same as those of the parent at that point, except where a fork is immediately followed by an exec. When an exec is perfonned a completely different process structure may be expected so there is no reason to assume the memory usage will be similar to that of the parent. Two possibilities present themselves: the WSEst of the parent in the forking case and the minimum working set size which can be enforced which is 20 pages. In practice this may be found to be excessive and a number such as 10 pages more reasonable. Given that a fork is commonly followed by the execution of a file the second option seems more reasonable.

The current process will never be thrash suspended because the implementation becomes slightly more complex when this is allowed. - 73 -

The Page Fault Routine

The page fault routine is called as a result of a page fault when an address is used which refers to a page not in memory or an operation was tried on a page without the correct permissions. The detection of these conditions is done by the memory management hardware which causes a trap. The page fault routine provides the process with the page containing the address it tried to use. It checks first that the operation is legal on that page, that the process has the right permissions.

When a page fault occurs the page fault routine determines the identity of the missing page. Each page within the system can be located through the process information of the process it belongs to and its virtual address within that process's address space. The process information includes the inode of the file containing the process image. The virtual addresses correspond to the file addresses. No translating is needed to find the page referred to by the virtual address within the file. This simplifies the task of locating the required page after a page fault or adding new pages to the disk image when either the stack or data area is extended. If a page fault has occured, then that page is not part of the process's page set but may still reside in memory on one of the free page queues.

If a page is in memory the necessity of doing a disk read is removed. If found, it is removed from the buffer free queue and the appropriate page table entry is updated to indicate its location and valid status, otherwise it must be read from secondary storage. Its disk location must be worked out using the per process data and a buffer supplied for the read. While the read is performed the page fault routine sleeps. The page fault, being a trap, does not inhibit interrupts. When finished, the page table is updated and the process which was suspended on paging is marked active and returned to the scheduling queue.

Only the allocated data area and the stack area allocated upon execution are kept as part of the process image on disk. Normally there are large parts of a process's address space which have no corresponding disk image. As the process is allowed to use any of its address space the image must be enlarged if required. The page fault routine does not need to take any specific action to enlarge the image file. That is done more or less by default by the file system. The UNIX file system allows gaps within files which take up no physical space. Any parts of the file which do not exist physically will read as zeros, and if written to a new block will be created within the file. Thus the pagefault routine need not be aware of whether a disk block exists of not. - 74 -

pagefaultO { If the address was in the supervisor space Panic, the operating system is not allowed page faults; If it is a breakpoint Signal the process because breakpoints are not implemented and go to bad; If the result of a permission violation Signal the process and go to bad; Add 1 to the accumulated number of page faults for this process; If there is no level 2 page table { Allocate one (allocation updates the table entry); Increment the working set estimator; } Allocate a page updating the level 2 page table entry; Increment the working set estimator; Find out which inode to read the page from; Save the values of the variables in the user structure used by the read; Read the page; Restore the values saved prior to the read; Bad: If there is a signal for this process Process it; Return;

The strategy of pagefault() is to see if the page request is legal and supply the page if possible. No paging is done on the operating systems virtual space so a page fault for a supervisor address indicates an error in the operating system. Prudence dictates its immediate suspension. If allowed to continue it may do considerable damage.

Permission violations occur when a process attempts to write to a read only page or tries to perform any sort of operation on a page with neither read nor write permissions. A signal is sent to the process indicating that a permission violation occurred the receipt of which will result in it's demise.

The read routine uses values stored in the user structure as parameters. When a read is done as a result of a user process request these values are assumed to remain unchanged between physical read operations. Pagefault() is borrowing these data items temporarily and may be doing so in the midst of a series of read operations on behalf of a user process. Hence the need to save and restore those values before and after the read. - 75 -

It should be noticed that the WSEst is incremented every time either a page or a second level page table is added to the page set by pagefault().

When page free lists, described under the following section on "Releasing Pages", are added to the system pagefault() will have to be changed to search for a requested page on the free lists before reading it from the process image file. The following changes are made to the code immediately following the allocation of a second level page table if one does not exist

pagefault0 {

Find which inode the page is associated with; Search the modified and unmodified page free lists; { the page can be identified by the inode and block number } If the page is found { Remove it from the free list; Update the level 2 page table entry referring to this page;

Otherwise { Save the values of the user structure variables used by read; Read the page; Restore the values saved before the read;

Releasing Pages

During the lifetime of a program the set of pages required for its execution changes. When a page is no longer needed it is useful to release it and make the memory it was occupying available for other use.

There are four major policies for releasing pages [Denning,68]. They are random selection, first-in/first-out, least recently used and working set. The aim of any page release policy should be to release pages only when they are no longer of use. If pages are released when still of use to a process the frequency of paging increases, lowering the overall efficiency of the system. - 76-

Random selection makes no assessment of the usefulness of a page before removing it Therefore it will frequently remove useful pages while leaving other useless ones undisturbed.

The first-in/first-out algorithm is based on the idea that programs execute sequentially. This is almost never the case, making this method a poor choice.

The least recently used method removes the pages which have remained unreferenced for the longest length of time.

The above three algorithms assume that a page is only freed when a request is made to allocate a block of memory for another purpose. This may result in memory being held unreasonably. It does not give enough infonnation to answer such questions as "is there enough memory which isn't needed so that another process could be started?". It may work well in a single process environment, but can cause problems in a multiprocess environment. The final model, the working set model, says that all pages which are not currently useful should be released. This happens regardless of whether the memory is in demand.

There are two problems associated with this activity. One is deciding when the pages should be checked to see if they are unused. The other is how to decide whether a page is inactive.

It has been suggested [Fogel,74) that appropriate times to scan the page tables are:

1. For the task which generates a page fault where the page is not in main memory.

2. For a task each time it accumulates a certain amount of CPU time (called a micro time slice) unless it has been invoked previously for some other reason within 4000 task instructions.

3. For all tasks waiting for peripheral input/output unless it has been invoked previously for some other reason within 4000 task instructions. This is only done when the CPU is about to go idle.

When pages are freed they are placed on a modified free queue if they have been changed and on an unmodified free queue otherwise. As the modified free queue is filled up pages are written out onto disk then possibly transferred to the unmodified free queue depending on the implementation. As the unmodified free queue gets full the oldest pages are released and become free memory. Pages are placed at the head of the queue and removed from the tail. It is - 77 - hoped by this means to reduce the number of disk reads which are required. The assumption is that the pages most recently placed on the free queues are most likely to be requested again soon.

A process gives up the CPU either when its time slice has expired or when it has to sleep, waiting for a resource. The first and second times for scanning the page set described above are satisfied if the page free routine is run just before the CPU is given to another process in the routine swtchO. The third time may be satisfied by making idle call the page free routine before executing the "wait" instruction which causes all CPU activity to stop until the next interrupt is received.

When a process is terminated the pages are freed and not placed on a free queue. It is known that those pages will not be used again so they need not remain available.

I decided not to maintain an explicit free list initially but use the system buffer handling to partially simulate a free queue. Free lists may be implemented later with no major changes to existing code. If a block is still associated with a buffer in memory it can be retrieved without doing a physical read. When a modified page is written it can be specified as a delayed write which means it remains in memory and is not written until the buffer is required for some other purpose providing some delay between the time a process releases a modified page and when it would have to do a physical read to retrieve it. In the case of unmodified blocks this is a disadvantage. If there is no unmodified free queue it is not possible to reclaim the page. It must be reread. - 78-

pfreelist( pointer to a process structure ) { For each valid second level page table entry { If it hasn't been referenced or modified Release the memory it holds; Decrement the WSEst; If referenced (regardless if modified or not) Reset the referenced flag; If modified but not referenced Schedule the page for a delayed write; Free the memory it holds; Decrement the WSEst;

If there were no valid entries in the second level table Release it and invalidate the level 1 entry;

Pfreelist assumes that if a page has been referenced since pfreelistO was last called it must still be in the working set. If it has not been referenced it is assumed not to be in the working set, that is, it has not been referenced since the last time pfreelist() was run resetting the referenced flag. It is able to be released immediately except if it has been modified. If a page has been modified the process image file must be updated by having the page written to it before being released.

The following routines show the modifications required to pfreelistO to implement free page lists. This has not yet been done. - 79 -

pfreelistO { For each valid second level page table entry { If it hasn't been referenced or modified Call addumodO to place it on the unmodified free page queue; Decrement the WSEst; If referenced (regardless if modified or not) Reset the referenced flag; If modified but not referenced Call addmodO to place it on the modified free page queue; Decrement the WSEst; } If there were no valid entries in the second level table Release it and invalidate the level 1 entry.

This version of pfreelistO varies from the previous one in the way pages leaving the working set are treated. Instead of being freed they are placed on a list where they remain until they are pushed off the list by other pages being released or they are reclaimed to enter the working set again.

The free page lists must be long enough to be able to hold a number of pages for each running process. If they are not then the pages for a particular process will have been flushed from the list before the process becomes active again. This would make the list useless. If X is the average number of pages leaving a process's working set after each time slice then the list should have at least X * N elements to function effectively where N is the number of processes allowed.

The following two routines addmodO and addumodO maintain the unmodified and modified free page queues respectively. When a list is full, the oldest page is freed. For the modified free queue it is scheduled to be written as well. - 80-

addumod(page table entry) { If the unmodified free queue is full Release the oldest page from the front of the list; Add the new page to the end of the list;

addmod(page table entry) { If the modified free queue is full { Schedule the oldest block for writing; Release the memory held by the page; Remove the entry from the modified free queue;

Add the new page to the queue;

Raw buffering is not yet being used which means a page is copied into a buffer when it is scheduled to be written. This means the memory where the page was can be released before the page is physically written. If raw buffering was used the release would occur after the write had been completed.

Switching Between User Processes

User processes are changed by specifying that a different set of page tables are used to interpret virtual addresses. To change processes the page table 1 register in the CPU must be loaded with the page frame number, which is the real address, of the new process's level I page table. After that all address translation in user mode is done with reference to the new tables.

The system stack and user structure are changed each time processes are switched. This is done by mapping the new stack and user structure into the location occupied by the previous user structure and stack. Only the user structure of the currently executing user process is mapped into the operating systems address space at any time. This means that the user structure and system stack always occupy the same virtual addresses. The page table entries for the user struct are kept in the proc structure. Mapping a user struct into the operating systems address space is simply a matter of copying these entries into the appropriate places in the operating systems second level tables. The MMU's cache must then be cleared to prevent it from using - 81 - any old translations and force it to translate using the modified tables. This operation is done in the assembly routine resume().

resume(pointer to proc structure, pointer to register save area) {

For the four level two page table entries covering the user structure and system stack { Move the saved user structure page table entry into the correct location in the supervisor page table; Remove the entry for that virtual address from the memory management units cache;

The saved page table entries are the first items in the proc structure, making them easy to locate. The locations in the second level page table where they are placed are stored in a four element array initialized during system startup. It holds a pointer to each location. These addresses can be calculated using the virtual address of the user structure, but it was decided not to do this because it is slower. Context switching is usually an area where the utmost speed and efficiency is desired.

Housekeeping

At various times in the life of a process it is necessary that the process image file be updated by writing any pages in memory belonging to that process which have been modified. This may be achieved by calling a routine which first writes all modified pages and then, if requested, forces a physical write of all delayed write buffers for the specified device. This may result in more work than necessary because there may be delayed write buffers for the paging device which have nothing to do with the process in question.

A common use of this routine is immediately prior to duplicating a process during a fork. In this case a physical write is not required. The physical write option allows for the later extension of the operating system to allow processes to be infinitely suspended and resumed, even over a reboot. - 82-

psync( pointer to process information, physical write flag) { If there is no data file associated with this process as can happen during the creation of process 1 on startup Return; Search all the valid entries in the second level page tables If the page has been modified { Save the values of the variables in the user structure used by write; Do a delayed write of the page; Restore the values saved before the write; Reset the modified flag in the page table entry; ) If the physical write flag is set Flush the swap device causing all delayed write blocks for that device to be written immediately;

The writeO routine uses values stored in the user structure as parameters. When a read or write is done as a result of a user process request these values are assumed to remain unchanged between physical read or write operations where more than one operation is required to service the user request. PsyncQ is temporarily borrowing these values and may be doing so in the midst of a user read or write request, hence the need to save and restore those values before and after the write.

The routine ptfreeO releases all pages and second level page tables held by a process. It is used during the destruction of a process either on its termination or during an execO system call. ExecO releases all memory held by a process before copying the executed file directly into virtual space. PtfreeO traverses the page tables releasing all pages held in each second level table before releasing the second level table itself. - 83 -

ptfree( pointer to a level one page table) { For each valid page table entry in the first level page table { For each valid second level page table entry { Flush the page table entry from the MMU cache; Release that block of memory; } Clear the cache entry for the second level page table; Release the memory held by it;

3.5.5 Transferring Data Between Virtual Spaces

It is necessary to move data between user and system virtual space in several circumstances. During the execution of a file the new process image is copied from the executed file into user space. The file read occurs in system space. During system calls parameters are collected from user space. Afterwards the results are returned from system space. The routine copyinO moves a specified number of bytes from user space into system space. The routine copyout() works in the reverse direction moving data from system space into user space.

copyin( from, to number of bytes); { While the index is less than the number of bytes to be copied { Perform the bug avoidance instruction; Move one byte from user to system space; Increment the index;

Return zero; - 84 -

copyout( from, to, number of bytes); { Move the parameters into registers; While the index is less than the number of bytes to be copied { Perform the bug avoidance instruction; Move one byte from system to user space; Increment the index; ) Return a value of zero;

These two routines may generate page faults for one of two reasons. Either the page of user space is not in memory or the user does not have the required pennissions on the page. In the first case the page is provided and the operation retried. The second case may arise when an invalid pointer into user space is provided by the user process. A signal is sent to the user process resulting in its demise.

The routines copyin() and copyout() are written in assembly language to enable them to use the special instructions to move data between user and supervisor virtual memory space. This instruction does not work correctly due to a bug in the CPU. The effects of the bug can be avoided by inserting an extra instruction immediately before the move user to supervisor or move supervisor to user instruction. The inserted instruction, "movb tos,tos", is a null operation. The "to" and "from" areas are treated as arrays to enable scaled indexing to be used.

There was no indication that bugs existed in the chip set when it was purchased. About three months were spent, first isolating the problem and then contacting a representative of National Semiconducter to obtain the solution. The task of finding the problem was complicated by the knowledge that the compiler/assembler, the board, the CPU and the slave processor chips had all had problems before the time this problem was being investigated and therefore were all suspect. Investigations were further hampered by the fact that the "movsu" and "movus" instructions did not fail consistently, sometimes they worked.

After being satisfied that the assembler was not at fault, the CPU and MMU chips were investigated using a logic state analyser. The results obtained by that means while running some specially written test programs led to the conclusion that the instructions which transferred data between user and supervisor spaces were at fault. - 85 -

A special case of moving data between user and supervisor memory spaces is the use of push and pop routines to put a value on the user stack or remove a value from the stack returning it to supervisor space. They use copyin() and copyoutO to do the moving of data but also modify the stored value of the user stack pointer. The user stack pointer is saved whenever a system call is started an restored when returning to user mode.

pushusw( value to be pushed onto the user stack) { Decrement the stored value of the user stack pointer; Copy 2 bytes from the address of the parameter to pushuswO to the address given by the stored value of the user stack pointer;

popuswO { Copy 2 bytes from the address given by the stored value of the user stack pointer to a variable local to popuswO; Increment the stored value of the user stack pointer; Return the value popped from the user stack;

The routines shown above push and pop word length values. Two other routines, pushusd() and popusd(), work the same way but with double word length data items. Because these routines use copyinO and copyoutO the have no need of the special instructions to move data between user and supervisor space. They are able to be written in C rather than assembly language.

3.5.6 Management of Real Memory

Real memory is allocated and freed as it is acquired as a result of a page fault or released by a call to pfreelist during a switch or the termination of a process. Memory is allocated and freed in 512 byte blocks. Both routines use page table entries. The free routine gets a block number from a page table entry, frees the block and invalidates the page table entry. The allocate routine allocates a free block and creates a page table entry for it. The page table entry is set up with access permissions being set to read and write for both user and supervisor modes. It is the responsibility of the calling routine to change those permissions if something different is required. At a later stage malloc() could be changed to accept the required permissions as a - 86- parameter.

Several modifications were made to the way the allocation and free routines work. It was noticed that the change to demand paging meant that only a single page of real memory was allocated at a time. The allocation and freeing routines were simplified slightly removing the ability to allocate multiple blocks. The structure of the VAX MMU and the NS16082 is different. If multiple block allocations were allowed the routine would have had to be changed to take into account the structure of the page tables. Under the Vax MMU single level page tables are maintained. It is possible to allocate multiple blocks by placing new page table entries one after the other. With the two level page table of the NS16082 the existence and location of the second level page table would have to be determined before allocating each block. It seems much easier to let the pagefault handling routine worry about the existence of level two page tables.

A bit map of real memory is maintained indicating which page frames are in use and which are free. The sections of code which set and reset the bits were based on an integer size of four bytes and had to be changed to use an integer size of two bytes.

The next version of the routines malloc() and mfree() will detect potential thrashing and take some action to relieve the demand on memory. - 87 -

mallocO {

If this allocation takes the usage of memory over the high water mark { If a process can be found which is sleeping or stopped and is holding some memory Free all the memory held by that process; Otherwise find the lowest priority runnable process holding some memory { Mark it as being thrash suspended; Free the memory it is holding; Increment the count of thrash suspended processes;

In choosing which process to suspend the current process and any process "locked against suspension" will not be considered. Process zero and process one, the initialization process will be in the "locked against suspension" category.

The easiest way of getting memory is to take it from a process whose execution is suspended knowing that the process does not need any memory at this time. There is a serious problem with changing the status of a sleeping process as would be done if it were thrash suspended. Not only would the previous status of the process need to be remembered, not a serious problem in itself, but if a wakeup were done and the process were thrash suspended rather than asleep it would be missed. This can be solved either by modifying the routine sleep() and wakeup() or by not placing the target process in a thrash suspended state and hoping that by the time it wakes up memory will not be as heavily in demand. If memory is still in heavy demand then an active process will be forcibly thrash suspended when no sleeping or stopped processes are holding any memory.

The process chosen to be thrash suspended if no suitable sleeping or suspended processes are found will be the lowest priority active process, the process with the maximum value of the "nice" plus time since last thrash suspension or creation, which ever came latest. This is the same basis on which the scheduler used to choose a process to be swapped out before it was - 88- removed.

mfreeO {

If some processes are thrash suspended and memory usage is below the "low water mark" { Find the highest priority suspended process { Reset its thrash suspended flag; Decrement the count of thrash suspended processes;

A field exists within the process infonnation which is updated once a second by the clock to keep a count of the number of seconds a process has been in memory or swapped out. That infonnation is no longer required so the field will be used to measure how long the process has been thrash suspended. The length of a time a process has been suspended and its scheduling priority will be used to determine which process will be restarted by mfreeQ. - 89-

3.6 Process Management

The life of a process starts when it is created by its parent performing a fork. The new process is an exact copy of the parent but is able to distinguish itself from the parent by looking at the value returned by the fork() system call. The process may request that its image be changed by performing the exec() system call which throws away the old process image and takes the new image from a file given as a parameter to exec(). New processes are created by two system calls which are used together. The first call is to fork() which creates a copy of the process which called it. The second call which is to exec() overlays this new process with the code to be executed. Exec() is nonnally called from the child process which is able to identify itself as such because it returns from fork() with a value of zero. The parent returns with the child's process id. The system call exit() is the nonnal way to terminate a process.

If the process creates children it has the option of ingnoring them or waiting until they tenninate by using the wait() system call. As the CPU is switched between active processes any process will be suspended and restarted a number of times during its lifetime; its state being stored by the routine save() and reloaded by the routine resume(). The process may also be suspended, put into a "sleeping" state, by using the pause() system call or if it has to wait for a resource or an input/output operation. It may also be suspended by its parent tracing it using the ptraceO system call. Finally the process may tenninate by calling exit(). Alternatively, if a fatal error occurs the process may be tenninated by the operating system calling exit() on its behalf. The life of a process is not quite over when exit() is called. At that point the process enters the "zombie" state where all that exists of the process is some of its per process data which remains until it is picked up by the parent process which is either the process which created it or process 1 if the other has since died.

The changes to process management result mainly from three circumstances. They are the changes made to memory management, the use of a different compiler, as its executable files and the process image have a different structure, and the changes in the way processes are represented.

The code handling shared text was removed. An alternative method of handling shared text is presented in chapter 4 "Proposed Extensions". - 90-

3.6.1 Processes under Unix Version 8

Unix version 8 [Killian,85] provides a new directory called "/proc". Each file in this directory corresponds with the process image of a running process. This representation appears to have been motivated by the limitations of interactive debugging under previous versions of Unix where the debugger and the object being debugged are separate processes and it is difficult to pass information between them.

The owner of the process image file is the same as the owner of the process. When the file is created the permissions allow reading and writing for the owner only, which is reasonable as it follows normal file creation procedures. The size of the file is the size of the virtual space of the process. A feature which is particularly useful for debugging is the ability to read and write any part of the process's virtual space using the readO, writeO and seekO system calls.

The version 7 Unix address structure is maintained in Version 8. The user process occupies addresses in the lower portion of the address space, in this case from 0xOOOO00OO to 0x7FFFFFFF. The system occupies addresses starting from 0x80000000 (See figure 3.1).

This project uses the idea of representing processes as files but it is difficult to determine how much it differs from Version 8 with the information available.

What first attracted me to the file representation of processes is that they behave more like the idea of what processes are, like a true "virtual process", making representation of virtual spaces more logical and easier to understand. It also makes the implementation of some interesting extensions to the operating system possible without an excessive amount of extra work. Some extensions are described in chapter 4 "Proposed Extensions".

3.6.2 The Representation of Processes

A process consists of a collection of data about the process used by the operating system called the "per process data" and a process image containing the universe as visible to the process itself. - 91 -

Per process information - proc and user

All the infonnation about a process required by the operating system is divided into two parts: a part which must remain resident in memory and another which need not. The resident infonnation is kept in the proc structure and other infonnation is in the user structure which under a swapping regimen is swapped with the process. Each process has a system stack associated with it which occupies the upper part of the user structure.

There were a number of changes to the process infonnation. As the proc structure was changed the corresponding structure for infonnation belonging to a "zombie" process was also changed. The field holding flags indicating what signals were pending on the process was defined as an integer although it was assumed to be 4 bytes long: the change to 2 byte integers under ACK required that it be changed to a long integer. Several new fields were added as a result of representing procedures as files and changing memory management.

How a Process Sees Itself

Each process sees itself as having a 16MByte linear address space begining at location 0. It does not share the address space with any other process.

Within the address space the text occupies space starting from location 0. Above this lies the data area at an address specified by the linker. The stack starts at the top of the address space. The data and stack areas are allowed to expand into the empty space which separates them, the data expanding upwards and the stack downwards.

Any part of the user process may be paged to and from memory. The user process is unaware of this.

The Process Image on Disk

The disk process image is maintained as a nonnal unix file, whose maximum size is 16MByte. The nonnal file handling routines are used to access, create and delete these files. This eliminates the system of keeping a swap map of available places within a very large contiguous block of disk space which is used for swapping. The image is created on disk during a fork.O or execO and is maintained until the process dies. Under the previous system a process image is either in memory, swapped out onto the swap device, or partially swapped in which case one or more segments are swapped leaving other segments in memory. There was no way - 92- of swapping part of a segment.

In my system, a process is represented by a single file containing the text, data and stack areas. The address of any object within the process image file is the same as its address within the virtual space which simplifies the location of a virtual address within a file.

The Operating System

The operating system is a process, but is not represented in the same way as a user process. There is no per process data for the operating system, it does not have a single stack but one for each active user process, and it it not represented by a process image file but remains in memory all the time.

The operating system will be permanently resident in real memory just above the start of RAM. From 0xOOOO to Ox7FFF is occupied by EPROM which contains a monitor. From 0x9000 onwards is used by Unix. It is possible to page some areas of the operating system's data and code, but for simplicity this has not been allowed. The operating system is not permitted to reference its entire address space, but only those parts mapped to memory during startup. The operating system's virtual space contains all of real memory, the input/output area, and the interrupt control units registers. A page fault on a supervisor mode address is fatal because the operating system is initially given all virtual addresses it will need during its lifetime.

Unlike the Unix used as a source for this port the new version has separate system and user virtual spaces (See figure 3.2). Both spaces start at virtual address zero and are 16MBytes long. It made kernel debugging during the development stages easier because kernel virtual addresses were made to correspond to real memory addresses. The other advantage is that the virtual space available to the user is larger if none of the virtual space has to be given up to the operating system.

3.6.3 Process Duplication

Fork() does a number of checks before trying to replicate the process to ensure the user will not exceed memory and process limits. Originally it checked that sufficient swap space existed by allocating the swap space then deallocating it, but this check was removed when changing to paging. Only the super user can take the last proc structure so normally there must be at least two proc structures available for a process to fork. - 93 -

The first thing newproc() does is duplicate the per process infonnation by searching the proc array for a free entry and copying infonnation from the parents proc structure into it NewprocO assumes that the calling process has checked for the existence of a free entry. A new copy of the process itself is made through a call to procdupO.

After the process has been created it is placed on the run queue. If insufficient memory is available, there is the danger of thrashing so the new process goes into a thrash suspended state rather than immediately becoming active. It will be removed from the suspended state when the danger of thrashing is deemed to be past. The amount of memory available no longer affects the way the fork is done, but insufficient disk space may cause the fork to fail. ProcdupO creates a new file in the directory reserved for process images. The existing process image is made up to date by making sure any modified pages have been written out. The process image can then be copied into the newly created file.

The routine procdup() is used to copy the text, data and stack segments. If it fails to duplicate the process due to insufficient real memory then the new process is created by swapping out the parent but leaving its image in core. This results in two copies of the process one being the parent and the other the child.

Since processes are represented as files a new file must be created in the "/procs" directory when the process is forked. The contents of the child's virtual space is copied from the parents process image file.

Because the child process shares the resources of its parent several reference counts are incremented. Since the child inherits the same open files as the parent, the reference counts of the corresponding inodes need to be updated. The registers are saved so the newly created process will resume at a specific point in newprocO and immediately return to the calling routine with a value of 1. The new process is given a cleared first level table. The process gains second level tables and page frames during execution.

The replication of per process data done by the routine newproc() remains essentially the same except for setting up the few new fields in the process infonnation. The duplication of the process itself is much changed. - 94-

procdup(process to be duplicated) (

Call psyncO to make sure the process disk image is up to date; Create a directory entry and inode for the new data file, or truncate the file if it already exists; Copy the old file to the new one using copyfile; Unlock the inodes used; Return a value indicating success;

In the case where a process image file exists for the new process number it cannot be the image file for a currently active process because newprocO guarantees that process numbers are unique so it is assumed that the contents of the file can be safely deleted. The most obvious cause of such an occurrence is that a process image file remained after a system crash.

The routine psync() is described in section 3.6 "Memory Management".

The file system's ability to maintain files with gaps in them is useful here as blocks only need to be allocated for parts of the file which are being used leaving unused parts of the file without physical storage associated with them. If a block is written to a part of the file which doesn't yet exist a new block will be added to the file. Such locations are read as zeros.

When copying a process image a distinction needs to be made between blocks which do exist and blocks which do not. If a block does not exist there is no point in trying to copy it creating unwanted blocks in the copied file. A process image will always have at least one large gap between the data area and the stack. Allocating disk blocks for addresses within the gap would use a large number of blocks which would never be required. Although the existing routine is able to detect when a block is non-existent it returns a buffer filled with nulls. There is no way of finding out if it is a real block or whether is is non-existent. It is irnpossibler without checking every byte in the block, to decide whether the block should be copied or not. For this reason, to enable an exact copy of a file to be made, a new procedure called copyfileO was written which makes an exact copy of a file and maintains the gaps in the original file. -95 -

copyfile(from inode, to inode) { If the "from" inode does not exist ( Null pointer) Return; For each of the direct blocks in the inode If the block exists ( not a Null entry) { Read in the block; Allocate a new disk block in the "to" inode; Write the block just read into the new block; } For each of the three indirect blocks indices in the inode If the block exists { Call the routine indcopyQ giving it the index level and the index block number to make a copy of that portion of the file. It returns the block number of the new index block which is assigned to the new inode;

Indcopy() is a recursive routine which copies indirectly indexed file blocks while creating new index blocks. -96-

indcopy(device, block number of index block to be copied, index level) { Read the index block given as a parameter; Allocate a block to be the new index block; Step through the index block { If there is a valid block number { If the index level is non-zero, it refers to an index block { Call indcopy0 recursively to copy that indirect block, passing the index level minus 1 as a parameter; Assign the block number returned to the new index block;

Otherwise, since it is a leaf block { Allocate a new block; Copy the contents of the original into it and write;

} Write out the index block just created; Return its block number;

The maximum level of indexing is three levels before the file block number is obtained which limits the depth of recursion of indcopy() to three recursive calls. Since the system stack is non-expanding it is important to know it will not overflow causing the operating system to crash.

3.6.4 Process Execution

A file is executed as a result of a process performing the "exec" system call. If there are any parameters for the program to be executed they are placed in a buffer as a temporary holding place until the process has been recreated. After the parameters are read into supervisor space the existing user image is disposed of and a new image is created using the contents of the executed file. Finally the parameters are copied onto the user stack so they will be available when the transformed process begins execution. -97 -

The execO system call can be made with or without an environment, which consists of the values of global variables set at the time the program is executed, such as shell variables if the program was executed from the shell. If it is made without an environment execO is called which sets the environment pointer to null and calls exece() to do the processing. ExeceO is called directly if an environment is required.

ExeceO calls getheadO to read the header information such as segment sizes from the executable file. It returns a pointer to the inode of the executable file. The function of getxfile() is to get rid of the old process image and set up a new one, reading the text and data from the file to be executed directly into the virtual space. Finally, after the parameters have been copied onto the user stack, the registers and a number of other values are initialized using setregsO.

Only the routines getxfile() and setregs() contain significant changes.

getxfile( inode pointer ) { Call ptfreeO to release all memory held by the process; Release all disk blocks held by the process image file by truncating the image file; Copy the text area from the executed file directly into the process virtual space; Read the data area from the executed file into virtual space; Set the text, data and stack sizes in the user structure;

setregsO { Set signals to suitable values; Set general purpose registers, stored on the stack, to zero; Set return address, module table address and PSR register values on the stack, they are stored on the system stack; Close any files which were open;

The general purpose registers, program counter, module table register and processor status word which will apply to the process when it begins execution in user mode are stored on the user stack. The values applicable to the previous contents of the process are on the system stack. -98 -

Setregs0 changes these stored values as well as closing any open files, and clearing the alternative routines specified for processing signals. The stored values for the general purpose registers are set to zero. The module table is assumed to be the first thing in the text area and the code is assumed to start straight after the module table. The processor status word is set to specify that the user stack pointer is to be used, that it is in user mode and interrupts are allowed.

3.6.5 Process Termination

Among the responsibilities of exit0 in tenninating a process are releasing all the resources held by the process, notifying the parent process if it is waiting for this process, closing any files the process has left open, and setting up the infonnation which will be returned to the parent process. The process is then placed in a state known as a "zombie" state until the parent picks up the infonnation left by the child, whereupon it is completely removed.

Only changes to the routine exit() are shown here. Most of its function remained unchanged through the port.

exit(process) {

Call delnode() to delete the directory entry of the process image file and truncate the file; If this is process 1 Force all delayed write buffers to be written without waiting for completion. This must be done before the system stack belonging to this process is released; Call ptfree() to release all memory associated with the process image; Free the pages associated with the user structure;

The user structure pages are freed using the copies of the page table entries held in the proc structure so that the mapping of those pages will remain valid in the page table until the process is switched out. Any activities which may result in the process being suspended and thus switched out must occur before this point otherwise the process cannot be resumed because the - 99- user structure will not be able to be mapped into the operating systems virtual space.

If process 1 is being terminated all delayed write buffers must be written out after the process image file has been deleted and before the system stack is freed. If this is not done after the process image file is deleted those changes will never be written to disk.

The routine delnode() removes a directory entry and reduces the number of references to the inode corresponding to the directory entry. It works much the same as the unlink() system call except that it makes up the file name using the process number rather than reading it from user space.

delnode( inode pointer, suffix) ( Call procnameO to construct the process file name; Call a routine to find the inode of the directory in which the entry is to be deleted and the entry itself; If there is an error ( Return, as the file cannot be deleted; } else ( Change the inode in the directory entry to zero; Write out the directory entry;

Reduce the number of links to the process image inode; Reduce the reference count on the process inode and the parent directory inode;

The routine which finds the directory inode also finds the directory entry allowing the process image file inode to be obtained from the directory rather than requiring it to be passed as a parameter.

ProcnameO is a simple routine which constructs a file name consisting of five digits giving the process number, and a suffix of ".d" indicating the file contains the process image.

3.6.6 Creating a Core Image

Several types of abnormal termination require the production of a file containing an image of the process at that time. The text, data and stack areas are written from the virtual space into the "core" file which is created in the process's current directory. The routine which performs - 100- this task does not need any knowledge of either the changes in process structure or memory management, so it was not changed. The per process information is written to the file after the stack, so as to be available for post mortem debugging. The "dead" process is cleaned up in the usual way by calling exit().

3.6.7 Process Switching

When a process is suspended all values which may be lost when another process uses the CPU must be saved so they can be restored when the process is restarted at some later point. The values which must be saved consist of the hardware registers for the CPU and the ICU mask register. This is done by the routines save() and resume(), the values being kept in the user structure.

Save() and resume() require a number of special instructions not available in the compiler and are accordingly written in assembly language. Save() stores the values of the registers in a specified area so that when a resume() is performed it can restore the register values so that it looks like a return from the original save(). The two returns can be distinguished by the value returned.

Three different save areas are used under different circumstances. One is used when duplicating a process and stores the register values for the new process to start with. If a process is being resumed after another process is switched out and it is a new process the register values are taken from this save area otherwise they are taken from the save area normally used when switching processes. The third save area is used to save register values for performing longjumps within the kernel. - 101 -

save( address of save area) { Disable interrupts; Put the address of the save area in rO to use as a pointer; Save the return address; Save the value of the user stack pointer and make it point to the top of the save area; Save the values of the special registers by writing them to the top of stack. The frame pointer, MOD, PSR, and ICU mask registers are saved here; Use the "save" instruction to store the values of the general purpose registers; Restore the value of the user stack pointer; Save the value of the kernel stack pointer in the save area; Enable interrupts; Return zero;

The user stack pointer is used as a pointer into the save area. The benefit of this is to enable the "save" instruction to be used in storing the values of the general purpose registers so all of them may be saved using a single instruction. By using the "top of stack" addressing mode the stack pointer can be automatically decremented stepping through the save area as values are stored.

The order in which the values are saved is not very important except that the user stack pointer has to be saved first if it is used as an index into the save area.

Register zero does not need to be saved as it is used to return a value from the saveO and resume() routines making the original value irrelevent - 102-

resume(pointer to a proc structure, address of save area) { Disable interrupts; Get the two parameters off the kernel stack and store them in RO and rl because they would otherwise be lost when the kernel stacks are changed; Load the page table 1 register from the proc structure so new level 1 page tables will be used in address translation from this point on; For each kernel page table entry referring to the user structure { Change the page table entry to refer to the user structure of the process to be switched in; Remove the old entry from the MMU cache; } Position the user stack pointer at the top of the save area but don't bother saving its value because it will be getting a new value soon anyway; Use the "restore" instruction to load values into the general purpose registers; Restore the ICU mask registers, PSR, MOD, and FP registers by moving from "top of stack" to the appropriate register; Restore the user stack pointer from the stored value; Change to using the kernel stack pointer and assign its value from the save area; Place the stored return address on the top of stack; Enable interrupts; Return 1;

Before new values can be loaded into registers the existing user structure, belonging to the process being switched out, has to be replaced by the user structure belonging to the process being resumed. The parameters to resume are saved in registers before this otherwise they would be lost when the user structures, which contain the kernel stack, are exchanged.

As with saveO, the user stack pointer is used as an index into the save area.

The following routines are the kernel versions of setjmpO and longjmp(), which are different from the versions supplied in the compiler runtime library, used for non-local goto's. A non-local goto is used during system calls so that a process receiving a non-fatal or expected signal during the system call appears to be returning from the system call with an error status. - 103 -

setjmp( save area address ) { Save the return address from the stack; Save the kernel stack pointer and frame pointer; Use the kernel stack pointer to save the general purpose registers in the save area; Restore the kernel stack pointer; Return zero;

longjmp( address of save area ) { Restore frame pointer; Using the kernel stack pointer, restore the general purpose registers; Load the kernel stack pointer value from the save area. Load the return address onto the top of the stack; Return 1; - 104 -

3.7 System Calls

A system call allows a process to take advantage of functions which can only be provided by the operating system running in supervisor mode. There are two areas in which system calls are likely to change during a port. The interface between the users and the operating system may change in particular the way parameters are passed and results are returned. Changes in hardware or changes in other subsystems within the operating system may lead to changes in the way system calls have to provide their services. Even with these changes most of the code for system calls remains the same. Where a system call makes use of code belonging to another subsystem the reader is refered to the section on the appropriate subsystem for an explanation of the changes made.

Major changes affecting the way system calls provide their services arise from changes in the structure of processes and executable object files.

The runtime library provided for the NS 16032 in A CK uses a different method of passing parameters from that used by the system being ported. The original method was to use a combination of values passed via registers and parameters resident in the user memory space. Values are returned via registers. The new method passes parameters to the system calls on top of the user stack and returns values there also. All parameters for all system calls are handled in the same way.

Two procedures are provided to pass parameters from the user stack to the system call, popuswO, and return results on the user stack, pushuswO. These two routines are described in section 3.5 "Memory Management".

Because the changes to parameter and result passing were fairly straight forward in most cases, all the system calls were modified even though not all of them have been tested and included in the kernel. The modifications to parameter and result passing only took a few days.

Not all the system calls are required to make the kernel usable, so the system calls which were considered essential were chosen to be modified and tested, leaving the others until after the rest of the kernel had been completed. The system calls chosen fall into the areas of file and process manipulation. The first group of system calls to be tested were the ones associated with process control, forkO, waitO, exec() and exit(). These required more extensive modifications than any other system calls. After them, the system calls relating to file creation, mknod() and creatO, were tested along with the system call which forces all delayed write buffers to be -105- written, syncO, and the change file mode call, chmodO.

3.7.1 System Calls Deserving Special Mention

The following system calls required modifications in addition to the changes to parameter and result passing. The system calls described here are all strongly related to section 3.6 "Process Management" where the routines referred to are described.

Exit

A process uses exitO when it wishes to terminate and have its parent notified. One of the main functions of exit() is to release the resources held by the process so they may be reused. The representation of processes as a disk file with some blocks resident in memory requires that both the memory resident blocks and the disk blocks are released. The file is deleted using existing file routines while page tables and pages in memory are released using a new routine ptfreeQ.

If the process has a process image file it is removed. The routine ptfree() releases memory held for the process image and second level page tables. The first level page table memory stays allocated to the proc structure permanently, because it needs lKBytes of contiguous memory which the memory allocation routine cannot provide.

The last thing to be freed is the memory held by the user structure. This does not actually remove those pages from the system image. They are still in use until a switch is done, but marked as free. A switch cannot be performed in system mode unless specifically requested by a sleep or similar. It is important that no switch is done until the process has been completely terminated otherwise the user structure, with the system stack inside it, will be completely lost since the memory has been freed using the stored copy of its page table entries.

Fork

Fork() provides access to the kernel routine which creates a new process which is a copy of the old one. Most of the changes required for fork to work in the new environment are made to procedures called by forkQ.

The test for sufficient swap space has been removed. It may be wise to test for sufficient disk space before creating the image files, and check for the availability of sufficient real memory, however this is not currently done. - 106-

Exec

During execO, any parameters to the new program are stored in a buffer obtained by allocating some of the swap area and releasing it afterwards. A special swap area is no longer used so instead of requesting a buffer for a known block a buffer is requested for any free block on the swap device.

The calculation of how much stack space has to be left for the parameters to be copied into assumed that integers and pointers were the same size. Using the ACK compiler for the NS32016 this was no longer true, so the calculations were changed to use the size of a pointer rather than the size of an integer.

Other changes to the process of executing a file occur in the routines getheadO and getxfileO which are called by exec(). - 107 -

3.8 System Initialization

Very few assumptions can be made about the state of the machine after a reset. Within the CPU the PC and PSR registers are set to zero. In the ICU all bits of the IMSK register are set to one thus disableing interrupts. The initial state of other ICU registers is given in the NS32202 data sheets. [NS,85) On reset the MMU status register is set so that no address translation occurs and no flow tracing or nonsequential tracing is done. The function of the startup code is to get the machine from this state into a state where the operating system is able to run. This will include initialization of hardware, data structures and dynamic structures such as processes.

The initialization code has two components, one in assembly code and the other written in high level language. It is useful to try to do as little as possible in assembly code but there is a major restriction on code in high level languages in that some special instructions required for initialization are not provided by the compiler. It would be possible to further reduce the amount of assembly code, if desired, by using a compiler which allowed in-line assembly code.

Writing initialization routines proceeds by considering how and when each item of hardware used by the operating system is to be initialized. Some hardware will need to be initialized before the operating system starts running while devices such as user terminals only need to be initialized at the time they are required. Another approach to the initialization of devices is to initialize them once during system Startup and assume they are then usable for the life of the system. The approach taken by Unix is to reinitialize a device every time it is opened. Initialization of devices is described in section 3.2 "The Unix Input/Output System".

The data structures for each subsystem need to be examined to determine how they are to be initialized. For many data structures simply being zeroed is sufficient Other things such as free lists may need more complex initialization. Much of this sort of initialization is done in "C" routines, is portable, and does not need rewriting. When determining the order in which initialization should proceed dependencies need to be taken into account. For example, doing a procedure call assumes the existence of a stack. The order will tend to be hardware first, simple data structures then more complex data structures.

Hardware -108-

For the CPU the following operations need to be done. The slave processors need to be configured into the system before they are addressable because the CPU will not execute slave processor instructions unless that processor is configured. The stack pointer needs to be set up before any procedure calls are done. The interrupt base register needs to point to the start of the interrupt table. The static base register is not used.

When the page tables have been located the supervisor mode page table base register is made to point at the first level page table belonging to the operating system. Only the operating systems page tables are initialized at this point, page tables are created for other processes as they are required. The operating system page tables are not expected to change after this time. When the page tables have been set up the memory status register is modified to cause address translation to begin. The initialization specifies that there is to be two virtual spaces, user and supervisor, and that addresses are to be translated in both.

The counters in the ICU which form the system clock are set up. At a later stage when the system is capable of coping with interrupts they are enabled within the CPU. For the interrupt lines the triggering mode also needs to be specified as the ICU allows interrupts to be triggered in different ways.

During initialization a console terminal which is used to print Startup messages and a disk containing the root directory system are required. The USART which connects to the console terminal has to be initialized. No other input/output devices need to be initialized until their use is requested by a user process. The disk controller performs its own initialization after a reset and requires no intervention by the host.

Software - Data Initialization

Most data items are assumed to be zero when first used. For this reason all memory used by the operating system is cleared.

The input/output subsystem requires that character and block buffers are part of a free list initially. All the buffers are added to the free list using the standard buffer free routines. In addition the block buffer headers are assigned a 512 byte block of real memory which remains associated with them for the life of the system.

The file system requires that the file structures which are used to store information about each open file are in a free list. To make the file system usable the super block for the root file system has to be read into memory. This has to be done before the first process is forked as part - 109- of the creation of a new process involves making a file to hold the process image in the "/procs" directory which is within the root file system.

Before starting the first process the process directory "/procs" is created if it does not exist. If the "/procs" directory exists and contains files they are left as they are. They may be removed by the super user or owner of the file or left until a process with the same process identification number is created in which case the contents of the file are deleted and the file reused. This can only occur as a result of a system crash as process files are deleted when the process terminated under normal circumstances.

The initialization required to start a new process is quite extensive. As it is almost the same for the first process as for any other, the description given in section 3.6 "Process Management" is adequate.

Initialization Routines

Start() contains all the assembly language initialization code required by the operating system. The routine addpte() is used only during system initialization. It creates first and second level page table entries for the supervisor space. Unlike user space, all virtual addresses with the exception of those referring to the user structure are identical to the real addresses. This is not essential, but makes supervisor mode debugging much easier. - 110-

startO ( Configure the MMU, ICU ( and FPU if needed); Clear uninitialized memory (BSS area); Locate address of user structure and set up stack in the top end of it; Locate second level page tables and clear them; Find how much real memory there is; Call addpteO to set up the supervisor mode page tables for: - all existing real memory - the input/output area - the page where the ICU is mapped Turn on address translation; Set up the ICU registers; Enable interrupt at the CPU (although still masked in the ICU); Set up the parameter to main and call it; /* * when process 1 returns from main - (process * zero never returns ) ... */ Set up the user stack pointer; Set up the stack so the return will start executing code in user mode; (the code starts at address Ox 10, the module table is at address 0) Do a "return from trap" so the PSR, MOD, and PC registers are loaded from values on the stack;

The ICU registers are set up in the manner advised in the NS32202 data sheets. [NS,85]

Process zero will never return from the call to main, however process one returns to startO after it is created. So that process one may start executing the user initialization code values are set up on the system stack for the processor status word, the module register, and the program counter which are then loaded into the appropriate registers by performing a return from trap.

The routine addpteO, which is written in "C", is used to map an area of real memory at a given virtual address in supervisor space. It is called three times during initialization to map all of real memory, the page containing memory mapped devices, and the page which the ICU registers are mapped into. - 111 -

addpte( real address, number of blocks, virtual address) ( Find the first level page table entry for the given virtual address; If it is not valid Assign a second level page table and update the page table entry; Find the second level page table entry for the given virtual address; For each block that is to be mapped ( If it will be the first entry in a new table and this is not the first block to be mapped Assign a second level table and update the level one page table entry; Construct the level two page table entry using the given real address, making the permissions read/write in supervisor mode; Adjust the real address to be the address of the next block to be mapped; Go to the next second level page table;

Second level page tables for the operating system are allocated from an array in the data area set aside for that purpose. Because the operating system is a fixed size it is possible to work out in advance how many second level page tables will be required to map it. It is not possible to use the normal memory allocation routines to allocate the level 2 page tables as the data structures required by them have not been initialized at the time this routine is required.

The two routines, start() and startup(), contain all the system initialization which is likely to be machine dependent. Another routine, main(), performs machine independent initialization. Main() calls the C routine startup(). Main() initializes the character buffer, block buffer and file structure free lists. It sets up the process information for process zero and calls a routine to create process one which executes an initialization program. The initialization program creates a process for each terminal to enable users to log on. Main() remained essentially the same during the port except for being changed to check for the existence of the "/procs" directory.

The user structure is always mapped at the same virtual address within supervisor space. When switching between processes a new user structure is mapped in replacing the previous one. In order to speed up the mapping in of the new user structure the addresses of the four page table entries which need to be changed are stored in an array. - 112-

main( address of first free block of memory ) {

Locate the inode of the "/procs" directory and If it does not exist Create a directory called "/procs" but If the create fails Panic; Unlock the inode belonging to the "/procs" directory;

Two features of the above code fragment deserve comment. The failure to create a "/procs" directory makes it impossible to properly create any processes and is thus fatal. The routine which locates the inode given the file name has the side effect of leaving the inode, if found, in a locked state preventing it being used by any other process. Likewise the routine which creates the directory also leaves the inode in a locked state. Failing to unlock the inode at this stage would result in the initialization process sleeping on the inode when it tried to create process one. Thus asleep, it would never wake up.

The routine startupO is supposed to contain the "C" startup code which is machine dependent and calls to machine dependent routines which perform initialization.

startup( address of first free block of memory ) {

Save the addresses of user structure page table entries In the supervisor mode page table to speed up context switching;

An explanation of how the addresses of the user structure page table entries are used can be - 113 - found in section 3.5 "Memory Management" where the routine resumeO is described. - 114 -

4. Proposed Extensions

Under the extensions proposed here data and stack will be separate segments but will still share the same image file. Note that the text is included in the data segment unless it is shared text.

It is intended to extend the operating system to allow files to be mapped into a process's virtual space under user control by introducing a number of new system calls. It is also intended to introduce shared text. The existing code which handles limits on the amount of disk space a user can have will also allow the amount of virtual space to be regulated. No thought has been given to restricting the virtual space on a per process basis.

Possible Uses

The following is a brief list of possible uses for the suggested extensions.

1. A file can be made to look like a part of the process virtual space allowing it to be processed like memory and making the read and write operations invisible to the user.

2. Dynamic linking becomes very easy. A process simply has to map the required object file into its virtual space and transfer control to the appropriate location.

3. The shared segment mechanism can be used to allow processes to share part of their virtual spaces.

4.1 Representation o/Virtual Space

Currently a single file maps the entire virtual space so that each block of used virtual memory corresponds to a disk block in the process image file. This representation will be extended allowing different parts of the virtual space to be mapped to different files. The mapping will be performed using an array which will be kept as part of the per process information. - 115-

Each array element will contain the following information:

1. The base address of the section of virtual memory which is mapped to the file specified in this entry.

2. The inode pointer for the file.

3. The number of blocks mapped in this part of the virtual space. This defaults to the size of the file.

4. The offset within the file, in blocks, at which the mapping starts. This value defaults to zero. The offset combined with the previous field allows part of a file to be mapped in.

5. Flags to allow direction of extension, permissions and other like things to be specified.

struct segment ( char * segbase; struct inode * segino; long seglen; daddr_t segoffs; /* daddr_t is a disk address*/ short segflags;

The array will be referred to as the "segment map". The section of virtual space mapped by one of these array entries will be referred to as a "segment". (See figure 4.1) The entries will be maintained in ascending order by virtual address to make searching the array easier.

The initial mapping of text, data and stack which will be performed by the forkO and execO system calls, can not be removed or changed by the user. They are removed when the process exits. The process may create, destroy and modify other segments within its virtual space.

If the file corresponding to the segment entry was created by the process it has a flag indicating it belongs to the process and the directory entry must be deleted when that segment is "unmapped". This will apply to the data/stack segment image. - 116-

The idea of having auto-extending segments was considered which is mainly of interest for the stack segment. With the hardware being used there is some difficulty in deciding, when a page fault occurs which does not fall inside an existing segment, whether it was caused by the a reference in the stack segment and is a reasonable value. If a pagefault occurs the stack pointer is restored to the value it had before the instruction started by the instruction rollback feature of the CPU.

The idea of extending the stack segment when the page fault address is "sufficiently close" to the extending end of the stack segment is possible but its success depends on how good an estimate of a reasonable value of "sufficiently close" can be made. If "sufficiently close" is too large there is the risk of catching uninitialized or invalid pointers and treating them as valid locations on the stack. Alternatively if large data structures have to be allocated on the stack "sufficiently close" may have to be quite large. For these reasons it was decided to avoid implementing auto-extending segments but to note that doing so using the method discussed above does not require a major programming effort. Users will have to ensure the stack is large enough, if the default value is too small, by using the proposed system call exsegO to extend it.

4.2 Shared Text

The problem of how to represent shared text was ignored during the initial changes since it will be easier to implement during the extensions. The problem will be addressed in a slightly more general form, that of shared segments which are not required to be text. In the case of shared text the original file will be used as the text image. If a segment is not mapped to the start of a file the mapping must begin at a block boundary. Either the header of the executed file can be extended to occupy all of the first block and the mapping start in the second block or the header can be mapped into the virtual address space and subsequently ignored. The alternative to this is to make a copy of the text area from the executed file. Each shared segment file will have an entry in the shared segment array. The entries will give the inode of the file, the number of processes using the shared segment image file, and a pointer to a list of second level page table entries mapping the segment. -117 -

struct shareseg { struct inode * segino; int shareref; long * ptelist;

When a shared segment is created the shared segment array is searched for an entry for the file to be used, if the inode is the same as the "segino" field in one of the entries, then the reference count is incremented, and a new segment entry is created for the process. If no entry exists in the segment array a new entry is created.

When releasing a shared segment the segment array reference count is decremented. When the reference count to the shared segment reaches zero the entry is removed.

With the current memory management different processes with the same file mapped into their virtual spaces may each hold their own copy of the same disk block. If the blocks can be changed then there are problems with keeping all the copies of the block up to date. The shared text files will be made read only during their creation by execO.

To eliminate this problem and ensure that only one copy of a disk block is in memory requires more information to be kept about each page and more processing. A solution is to allow processes to share any second level page tables which map the shared segment (See figure 4.2). While this poses no problems with the pages themselves there are some problems with allocation and release of the level 2 page tables. Four possible approaches are examined here.

1. Allocating all the second level page tables required to map the shared segment on the first request by any process and leaving them allocated until the number of processes referencing the shared segment falls to zero. This is obviously wasteful of memory.

2. Each time a second level page table is required the level one page table of each process using that shared segment is updated, effectively allocating the page to all the processes at one time. When the page is released, either by leaving the process's working set or by the process releasing the shared segment, all other processes using the shared segment need to be checked to find out if the page table can be released. Even if a list of processes using the shared segment were kept in the shared segment array quite a bit of searching would be required. -118-

3. This approach is similar to the last one. Before a second level page table is allocated all processes using the shared segment are searched to see if there is a level 2 page table for that segment allocated in any of them. If so the page table entry is copied otherwise a new page table is created so the entry for the level 2 page table is only kept by the processes which are using it and in the page table entry list.

4. A list of level 2 page tables kept as an array of page table entries is associated with the shared segment array. When a new second level page table is required the appropriate entry in the page table entry list is examined. If it is valid the entry is copied into the level 1 page table of the requesting process. If invalid a block is allocated, an entry made in the list which is copied as before. A reference count is kept with each entry in the list which is decremented when a process invalidates its entry for the corresponding page table. When the reference count reaches zero the memory held by the page table is released and the list entry invalidated.

The major disadvantage with methods 2 and 3 involves the amount of searching which has to be done to locate the appropriate page table entries in all processes using the same shared segment. The searching is made more complicated by allowing the shared segments to map to different virtual addresses in each process.

The extra memory required to hold the page table entry list in the fourth option seems a reasonable price to pay for the simplification of allocation and releasing of level 2 page tables.

4.3 Setting Up Segments

There are two ways segment mappings may be created: forkO and execO create mappings for text, data and stack segments and a system call, makesegO, will be provided to create a segment under user control.

MakesegO will have the following parameters:

1. A pointer to the name of the file to be mapped in. If the file exists the user must have the appropriate permissions on the file. If it does not exist and the user has appropriate permissions on the directory a file will be created. - 119 -

2. The virtual address of the start of the segment to be mapped. A suitable base address will be chosen if this is not specified. Segments are prevented from overlapping within the virtual space.

3. The number of pages contained in the segment. This will default to the size of the file if the parameter is zero. The size is allowed to be larger than the file if the segment and the file both have write permission allowing the file to be extended.

4. The offset in blocks within the file where the mapping is to begin. This allows a part of a file to be mapped or even different parts of a file to be mapped to different segments which may be useful for the data and stack segments allowing one file instead of two to map them.

5. Flags. Some possible flag under consideration are: perm1ss1ons for the segment, particularly where they may vary from the permissions of the file, to indicate a shared segment, to indicate a directory entry is to be deleted when the segment is released in the case of the data/stack file, and preventing the segment from being deleted by the user. The flag preventing deletion will be used for text, data and stack segments created by forkO or execO.

The system call, if successful, returns the virtual address of the base of the segment. This value needs to be kept if the segment is to be removed during the lifetime of the process. - 120-

makeseg() { if the segment array is full return an error; If the flags are not valid return an error; Translate the file name into an inode pointer checking permissions on the file. The inodes are left locked; If the size is not specified assign the size of the file to the segment size; else If the size is larger than the file size and there is no read permission on the file Release the inode and return an error; If the address is not given find a suitable spot in the map and record the base address; If there is no suitable location Release the inode and return an error; else if the segment is shared and not aligned on a 64KByte boundary Release the inode and return an error; if the base address and size make the segment overlap an existing segment Release the inode and return an error; call mksegentO to make a new segment entry; Return the segment base address;

The work of makesegO is mainly to validate all the parameters and find suitable values if the user has not specified values. The routine mksegentO will do the work of creating a segment mapping. - 121 -

mksegent( same as makesegO except an inode instead of a file name) { if the segment is to be shared { Lock the shared segment array against modification; Search the shared segment array to see if a shared segment entry already exists and if it does ... { Increase the reference count in the shared segment array; } else { Make a shared segment entry for the file;

Unlock the shared segment array;

Create a segment entry in the process's segment map;

The making of a shared segment array entry requires the inode pointer to be set up, the reference count set and a block of memory to be allocated to hold the level 2 page table entry list which is used in allocating and freeing level 2 page tables for the shared segment.

getxfileO {

for each entry in the segment array call rmseg 1() to release it; If there is shared text Call mksegentO to create a segment entry for it; Call mksegentO to create the data segment; Call mksegentO to create the stack segment;

The rest of getxfileO remains the same as described in chapter 3, section 3.6 "Process Management". - 122-

4.4 Extending Segments

The system call exseg() will be provided to enable a segment to be explicitly extended upwards or downwards.

Exseg() replaces the brkO system call which increases the amount of real memory allocated to a process. They differ in that exseg() is primarily concerned with the amount of virtual memory allowed a process, not real memory.

The parameters are:

1. The virtual address at which the segment starts which is used to identify the segment. The call fails if there is no existing segment with that base.

2. The number of blocks by which the segment is to be extended. If this is positive the segment is extended upwards, and if negative the segment is extended downwards. The call fails if this would cause the segment to overlap another segment.

The virtual address of the segment base is returned if the call is successful. If the segment was extended downwards the base address may have changed.

exseg( segment base, size ) { If there is no segment entry with the given base address Return an error; If it is a shared segment Return an error; If the segment extended by "size" would overlap another segment return an error; If the size is negative Add size to the base address, moving it lower, Add the absolute value of size to the existing segment size; Return the segment base address;

It may be desirable to stop users from reducing the segment size of vital segments such as their text, data and stack to zero. Shared segments will not be allowed to extend. - 123 -

4.5 Removing Segments

The system call nnseg() will remove a segment mapping from the array. It is passed the virtual address of the segment base and removes the corresponding entry from the array after writing out any updated pages mapped to the corresponding file.

rmseg( segment base ) { If the segment does not exist return an error; If the segment was created by a forkO or execO it cannot be removed except by exit() return an error; Call rmseg I O to remove the single segment entry; Return zero if successful;

Rmseg() is concerned with checking that the segment base address exists and that the user is allowed to remove the segment. It calls nnsegl() to do the work of removing the segment entry.

rmsegl( index into segment mapping array) { If the segment was not created during a fork or exec() and there is write permission on the file Update the file by writing all modified paged in that segment; Free any memory associated with the segment by calling ptfree(); If it is a shared segment Reduce the reference count on the inode, which deletes the file when the reference count goes to zero; If the file was created by a fork() or execO Unlink the directory entry; Decrement the shared segment reference count and remove the entry and release the page table entry list block if the reference count is now zero; else Reduce the reference count on the inode; Remove the segment map entry;

Rmsegl() can only remove files which were created by the process during a fork() or exec() because they are the only ones it knows the names of. It was considered too much work to - 124 - traverse the directory structure in other cases to find the file name and in any case if a number of directory entries exist for the same file which one should be deleted? The other option is to write a system call which is given the name of the file and removes all segments using it.

The following fragment of exitO shows the replacement of the call to delnodeO shown in the version of exitO described in chapter 3, section 3.6 "Process Management".

exitO {

for each valid entry in the segment map call rmseg 1O to remove the mapping;

4.6 Page Faults

The page fault routine will have to be modified so that whenever a page fault occurs the segment mapping array is searched to validate the page fault address, the inode from which the page needs to be read is found and the block number which is needed is calculated. If the address does not fall inside an existing segment the process will be signalled.

When a level 2 page table is being allocated for a shared segment the level 1 page table entry is marked using one of the two spare bits in the page table entry (Yes, only two) to let the page freeing routines know that it belongs to a shared segment so the shared page table entry list can be adjusted. - 125 -

pagefault( ) (

find an entry in the segment map such that segment base <= pagefault address < segment base+ length. If no such entry exists signal the process; If a level 2 page table is required If it is in a shared segment ( Find the entry for it in the shared segment page table list. If it is not valid ... Allocate a block of memory for the new page table and put the page table entry in the page table entry list; Assign the list entry to the appropriate place in the level 1 page table of the process; } else Assign a block of memory and set up the level 1 page table entry;

calculate the block number to be read which will be page fault address - segment base + file offset; read the block from the appropriate inode; Set the permissions in the page table entry from the segment map permission flags;

4.1 Freeing Pages

The two routines used for freeing pages and page tables pfreelistO and ptfreeO need to be modified to free level 2 page tables belonging to shared segments. Instead of just freeing the page table the level 1 page table entry is checked to see if the page belongs to a shared segment. If it does, the shared segment level 2 page table entry list is updated before the level 1 page table entry is invalidated. - 126-

pfreelistO {

If there were no valid entries in the level 2 page table If it is a shared segment page { Find the shared segment page table list entry and decrement the reference count for that page; If the reference count is zero free the page;

else Release the page;

ptfreeO { For each entry in the level 1 page table, if the entry is valid If it is a shared segment entry { Decrement the reference count in the shared segment page table entry list. If the reference count is zero Delete all the pages and the page table;

else Delete all the pages and the page table as before; - 127 -

5. Current Status and Reflections

On starting the port I had expected that the required hardware and software tools would be fully usable but some unexpected hardware and software bugs became evident as I familiarized myself with the tools and others showed up during the porting of the operating system. Further delays were experienced with subsequent breakdowns of the hardware and waiting for replacement parts.

The trouble with hardware extended the timeframe from about 1.5 years to 2.5 years (the last 1.5 years being part time). The hardware was not fully usable until the third quarter in 1986. Until that stage only a limited amount of programming and testing could be done, however it did allow more time for familiarization with the source code of Unix. The companion books "Unix Operating System Source Code level 6" and "A Commentary on the Unix Operating System" [Lions, 77) provided a useful place to start even though they refer to an earlier and simpler version of Unix and require a fair amount of work on the part of the reader.

During the initial stages an examination of existing literature on porting operating systems was performed. The experience of others provided some useful guidelines as to how a port should be done, but little in the way of detail.

Not being familiar with the Unix source code initially makes the task of porting longer but is not a severe disadvantage given the relatively small size and modularity of Unix compared with other operating systems, however a good knowledge of Unix does need to be obtained before doing the port. [Jalics,83)

One of the challenges in designing changes for a program the size of Unix is to keep track of the large number of details which have to be considered. The modularity restricts the extent to which a single change affects the rest of the system localizing the areas which have to be checked for side effects when the changes are made. Designing test data and procedures was a continual challenge in such an environment and involved the use of many print statements, abort instructions and extensive examinations of the contents of memory. It was not possible to use tracing; the ability to stop the operating system was present but not the ability to start it again from where it left off. - 128-

During the port I visually inspected the source code looking for places where the change in integer length and the difference between the length of integers and pointers might have an effect. Most instances were discovered during that inspection but a couple of cases escaped detection until testing was performed.

The almost constant problem of overloading on the host machine meant that most useful work had to be done before 9am or late at night. Compiling and linking the source code was a particular problem however downloading the object code from the host to the NS32016 board also took an annoyingly long time. Downloading took over 10 minutes on a lightly loaded host, it was not tried when the machine was heavily loaded.

Remaining Work

As it stands further work is required before the Unix kernel is fully functional. The amount of work remaining is for the most part fairly insignificant. This project was only concerned with the Unix kernel, but a working kernel does not imply a working system. A number of other non-kernel items of software need to be provided.

The Current State of the Kernel

The file subsystem had no changes made to the code or new routines written. The only remaining task connected with the file subsystem is to transfer the root file system to the hard disk, a task which is not expected to provide any problems.

The disk and terminal device drivers have both been written and the disk driver has been completely tested but the terminal driver requires further testing. At a later stage a device driver should be written for real and virtual memory to enable memory to be examined and modified by a user process.

The system clock is not running at full speed but can be made to do so by changing a constant.

The routine sendsig() which runs a specified user routine on receipt of a signal has not been written.

Modified and unmodified free page queues are not used for pages taken from a process. They are not essential for the running of the system but are part of the design. The changes required to implement the queues are shown in chapter 3, section 3.5 "Memory Management". - 129 -

All system calls which required major changes have been modified and tested. Of the other system calls all have been changed to get parameters and return results in the new way, but only a few have been tested. The following system calls have been tested: exit(), forkO, wait(), creatO, exec(), mknod(), sync(), and chmod().

Non Kernel Work Remaining

The file "/etc/init" needs to be set up by porting it from the host machine to allow users to log in and to create a user shell for the terminal. The shell must also be ported together with the utilities. The experience of others indicates that many of the utilities require only minor modifications during a port. [Hsu,84] The program which checks the consistency of the file system after a system crash ( Fsck()) needs to be ported.

The hardware may eventually be extended to provide off board memory and more terminal ports. If the off board memory is added at the end of existing memory the kernel should require no changes before being able to use it. The addition of more terminal ports will require some minor changes to the terminal driver which currently only expects one terminal to exist. - 131 -

1. References

[Bodenstab,84] Bodenstab D.E., Houghton T.F., Kelleman K.A., Ronldn G., Schan E.P. Unix Operating System Porting Experiences, AT&T Bell Laboratories TechnicalJoumal Vol 63 no 8 part 2, Oct 1984, p1769-1790

[Cargill,80] Cargill T.A. Management of the source text of a portable operating system, IEEE Compsac 1980, p764-768

[Cheriton,79] Cheriton David R., Malcom Micheal A., Melen Lawrence S., Sager Gary R. Thoth, a portable real-time operating system., Comm ACM vol 22 no 2, Feb 1979, plOS-115

[DEC,81] VAX Hardware Handbook, 1981

[Denning,68] Denning Peter J., The working set model for program behaviour, Comm. ACM vol 11 no 5, May 1968, p323-333

[Denning,70] Denning Peter J., Virtual Memory, Computing Surveys vol 2 no 3, Sept 1970, p154-216

[Fogel,74] Fogel Marc, The VMOS Paging Algorithm - A Practical Implementation of the Working Set Model, Operating Systems Review vol 8 no 1, Jan 1974, p8- 17

[Hsu,84] Hsu Nai_Ting, Skinner Glenn, Zelitzky Jay, How to Port UNIX to a New Microprocessor, Computer Design vo123 no6, June 1 1984, p173-181

[Hunter,84] Hunter Colin B., Farquhar Erin, Introduction to the NS16000 Architecture, IEEE Micro 4(2), April 1984, p26-47

[INTEL,81] Intel Component Data Catalog, Jan 1981

[J alics,83] Jalics Paul J., Heines Thomas S., Transporting a Portable Operating System: UNIX to an IBM minicomputer, Comm. ACM vol 26 no 12, Dec 1983, p1066-1072 - 130 -

6. Conclusion

Designing and implementing the changes required to port Unix onto a new microprocessor was easier than expected although this was counterbalanced by the persistent problems with the hardware. The software tools provided some initial problems but were more quickly overcome. I have to agree with Schriebman [Schriebman,84] that "porting an operating system is never a trivial task" but would add that working with unreliable hardware and buggy software compound the problems. In view of my experience I would strongly advise anyone intending to perform a port to verify that the hardware and software is fully functional before starting

The positive effects from doing the port include a greater personal familiarity with the "C" programming language, understanding some of the techniques and problems associated with virtual memory management, seeing how Unix works, experience with porting and the opportunity to sharpen debugging techniques in a non-trivial environment.

The port will now provide a starting point for implementing the extensions described in chapter 4 which I consider will be an interesting addition to the features already provided by Unix. - 132 -

[Kernighan,84] Kernighan Brian W., The Unix System and Software Reusability, IEEE Trans. on Soft. Eng. vol SE-10 no 5, Sept 1984, p513-518

[Killian,84] Killian T. J., Processes as Files, European UNIX Systems User Group, Autumn Meeting Sept 1984

[Lions,77] Lions J., Unix Operating Systems Source Code Level 6, Department of Computer Science, University of New South Wales, 1977

[Lions,77] Lions J., A Commentary on the Unix Operating System, Department of Computer Science, University of New South Wales, 1977

[Lister,79] Lister A. M., Fundimentals of Operating Systems, 2nd Edition, The Macmillan Press Ltd., London and Basingstoke 1979

[Miller,78] Miller Richard, UNIX - A Portable Operating System?, Operating Systems Review vol 2 no 3, July 1978, p32-37

[Minchew,82] Minchew Charles H., Kuo-Chung Tai, Experience with Porting the Portable "C" Compiler, Proceedings of the ACM '82 Conference, Oct 1982, p52-63

[NS,83] DB16000 Monitor Reference Manual, April 1983 National Semiconductor Corporation

[NS,85] Series 32000 Databook, National Semiconductor Corporation, June 1985

[Powel,79] Powel M. S., Experience of Transporting and using the SOLO Operating System, Software - Practice and Experience vol 9, 1979, p561-569

[Ritchie,74] Ritchie D. M., Thompson K., The UNIX Time-sharing System, Comm. ACM vol 17 no 7, July 1974, p365-375

[Schriebman,84] Schreibman Jefferey, UNIX Portability, COMPCOM Digest of papers Spring 1984, Twenty-Eighth IEEE Computer Society International Conference. - 133 -

[Tanenbaum,78] Tanenbaum Andrew S., Klint Paul, Bohm Wim, Guidelines for Software Portability, Software - Practice and Experience vol 8, 1978, p681-698

[Tanenbaum,83] Tanenbaum Andrew S., Van Staveren Hans, Keitzer E. G., Stevenson Johan W., A Practical Tool Kit for making Portable Compilers, Comm ACM vol 26 no 9, Sept 1983, p654-660

[Thompson,78] Thompson K., UNIX Implementation, The Bell System Technical Journal, vol 57 no 6 part 2, July-Aug 1978, pl931-1946

[WD,83] Preliminary Engineering Specification for WD1002-05 Winchester/Floppy Controller Doc. no. 61-031050-0005, Western Digital,

[Zwaenepoel,84] Zwaenepoel Willy, Lantz Keith A., Perseus: Retrospective on a portable Operating System, Software - Practice and Experience vol 14, 1984, p31-48 -134 -

8. Appendix A - Notes on Software and Hardware

8.1 The Memory Map of the DB16000

0xOOOOOO - 0x007FFF on board PROM and PROM expansion space

0x008000 - 0x027FFF on board RAM (128K)

0x028000 - 0x7FFFFF off board memory (not used)

0x800000 - 0x9FFFFF off board input/output ports (ISBX connectors)

0xCOO000 - 0xDFFFFF on board input/output ports The DB16000 comes with an 8251A USART. Its data register is at OxCOOOOO and control word/status register at 0xC00002.

0xEOOOOO-0xFFFFFF ICU ports

The address decoding circuit generates input/output request signals for addresses above 0x800000.

8.2 NS32016 Addressing Modes

An example is given for each addressing mode to show the syntax used by the ACK compiler, which is discussed under software tools. The last letter of the instruction gives the size of the operand. A byte is indicated by 'b', a word by 'w', and a double word by 'd'.

1. Register.

The operand is one of the 8 general purpose registers.

movd rl,rO

2. Immediate.

The operand is contained within the instruction.

movd 0,rO - 135 -

3. Absolute.

The absolute address of the operand in memory is given.

movd 0,@10000

or

movd O,_data

4. Register Relative.

The address of the operand is calculated by adding a displacement given in the instruction to the value in the specified general purpose register.

movd O(rO),rl

5. Memory Mode.

Similar to the register relative mode except that the register used is a dedicated register. It may be the program counter (PC), the stack pointer (SP), the static base register (SB) or the frame pointer (FP).

movw 8(sp),12(sp)

6. Memory Relative.

Two displacements are specified in this mode. The first displacement is added to the specified special purpose register, and a double word is read from that location. The operands address is the sum of that double word and the second displacement. This mode is useful for handling address pointers and manipulating fields in a record.

movd 0(8(fp)),r0

7. External.

This is used to access operands which are external to the currently executing module. The external addressing mode requires two offsets. The first is used to find the linkage table entry to be used. The second is an offset to a subfield of the variable, for example a field within a record. This addressing mode was not used at any time during this project. - 136 -

8. Top of Stack.

The stack pointer specifies the location of the operand. Depending on the type of operation the stack pointer will be incremented or decremented, thus providing an automatic push and pop facility. For example adding top of stack (tos) to a register will pop the value from the top of stack then add it to the register.

movd rO,tos or movw tos,rl

9. Scaled Index.

The scaled index mode is used for addressing arrays with elements which can be byte, word, double word, floating point or long floating point. The operand address is calculated using one of the general purpose registers and a second addressing mode. The second addressing mode gives the starting address of the array, it is added to the offset within the array which is calculated using the index (in the general purpose register ) multiplied by the size of the array elements.

movd O(rl)[r0:d],r2

or

movw _array I [rO:w ],_array2[r0:w]

In the first example given the address of the base of the array is at r(O), the location pointed to by register 1. The index into the array is given by rO. It is multiplied by the size of the array element which is a word as indicated by 'w'. The base and the offset are added to give the final address.

8.3 Methods of Handling Interrupt Priorities Using the NS16202

1. Fixed Priority.

Each interrupt source is ranked in priority from O (which is the highest) to 15. While an interrupt is being serviced, any interrupt of the same or lower priority will be held until the currently executing interrupt handling routine executes a return from interrupt (RETI) instruction. -137 -

2. Auto Rotate Mode.

An interrupt position is assigned lowest priority after an interrupt in that position has been serviced. The highest priority is then assigned to the next lowest priority interrupt position.

3. Special Mask Mode.

This allows the ICU's priority structure to be changed while an interrupt is being serviced.

4. Polled Mode.

All control of interrupt priority is left to the software.

8.4 A Summary of Commands for the WDJ002

TEST This command causes the winchester/floppy controller to run internal diagnostics to check itself.

RESTORE This command is used to calibrate the position of the head on a drive by stepping the head outwards until the TR000 line is asserted. The stepping rate is specified as part of the command.

SEEK The seek command positions the head on the specified cylinder. A stepping rate is specified as part of the instruction. Subsequent instructions use the stepping rate specified here.

READ SECTOR This reads a sector from the disk then makes it available to the host. There is an implied seek in the read command which uses the stepping rate specified in the last seek or restore instruction. The read command performs error correction unless a long read is specified. The long read will inhibit error correction and pass the check bytes from the disk unaltered back to the host. The long read can only be used on a winchester disk. A read command can run in either a programmed 1/0 mode or DMA mode. Multiple sector reads can also be specified.

WRITE SECTOR The write command writes a sector of data provided by the host onto the disk. Similar to the read command it contains an implied seek. It can be used for single or multisector writes. It also has a long write mode. - 138-

FORMAT TRACK After a format track command is received the winchester/floppy controller expects the interleave table to be written to the sector buffer. When setting up the sector image the number of bytes transfered to the buffer must correspond to the specified sector size. - 139 -

9. Appendix B -A Summary of Changes to made to Unix

1. Traps and Interrupts.

This section consists mainly of assembly language code which had to be rewritten. The difference between the original interrupt handling hardware and the new hardware made few differences to the code.

The routines rewritten were:

• A short routine for each trap and interrupt to call the appropriate "C" routine.

• A routine to mask interrupts.

• A routine to initiate interrupts from software.

2. The Unix Input/Output System.

The target system did not have any device types in common with the host system.

The following device drivers were written:

• A character device driver for a terminal.

• A block device driver for the WD1002.

• A routine (polled) to put characters to the console device when printing system messages.

3. Scheduling, Synchronization and Timing.

With the replacement of swapping by demand paging the routine scheduling memory residency was no longer needed and was removed from the code. Some minor changes were made to the signal processing software due to changes in the size of integers from 4 bytes on the host system to 2 bytes on the destination system. - 140-

4. The Unix File System.

An initial file system was created by writing a standalone program. Special files were created for each device using system calls once the system was at a stage where user processes could be run.

5. Memory Management.

Different MMU's resulted in rewriting all code which was dependent on the structure of the memory management hardware used. The implementation of demand paging resulted in new routines being written and old ones modified.

The new routines were:

• Pagefault() which services the page fault trap.

• Pfreelist() which releases pages which are no longer in use by a process.

• Psync() which ensures that all modified pages in memory are written to the process image file.

• PtfreeO which releases all pages and page tables held by a process.

• PushuswO and popusw() which place and remove respectively values on the user stack.

Modified routines are:

• Malloc() which allocates a page of free memory.

• MfreeO which frees a page of memory.

Routines which were rewritten because they are in assembly language were:

• Copyin() which copies a block of memory from user space to supervisor space. - 141 -

• CopyoutO which copies from supervisor space to user space.

6. Process Management.

Routines which handle processes required modification because of a number of circumstances: changes to memory management, changes in the layout of executable files resulting from the use of a different compiler, and changes in the way processes are represented.

Modified routines were:

• ProcdupO which had to create a process image file when duplicating a process.

• ExitO which had to remove the process image file and free the pages and page tables held by the process when terminating.

The new routines required were:

• CopyfileO and indcopy() which together make a copy of the process image file file while not copying the "gaps" in the file.

• Delnode() which deletes the process image file directory entry. The following routines were rewritten:

• GetxfileO which contains the knowledge of the structure of the executable file.

• SetregsO which sets up the initial values for user registers when a process is executed.

• Save() and resume() which are written in assembly code.

• SetjmpO and longjmp() which are also written in assembly code.

7. System Calls.

All system calls were modified to get parameters and return results on the user stack to interface with the runtime library provided with the compiler being used. The calls exit(), fork() and exec() were modified in other ways but the modifications are described - 142-

in the section "Process Management". ExecO also required modification because of the assumption that integers and pointers were the same size.

8. System Initialization.

Two assembly routines were written, both highly dependent on the machine architecture. They provided the initialization of hardware registers, memory management, calling a new "C" routine to set up the page tables for supervisor space.

Two existing routines were modified:

• MainO was modified as a result of changes to the representation of processes.

• StartupO which is called by mainO during initialization was changed to store information used in context switching.