INFORMATION TO USERS

This was produced from a copy of a document sent to us for microfilming. While the most advanced technological means to photograph and reproduce this document have been used, the quality is heavily dependent upon the quality of the material submitted.

The following explanation of techniques is provided to help you understand markings or notations which may appear on this reproduction.

1.The sign or "target” for pages apparently lacking from the document photographed is "Missing Page(s)”. If it was possible to obtain the missing page(s) or section, they are spliced into the film along with adjacent pages. This may have necessitated cutting through an image and duplicating adjacent pages to assure you of complete continuity.

2. When an image on the film is obliterated with a round black mark it is an indication that the film inspector noticed either blurred copy because of movement during exposure, or duplicate copy. Unless we meant to delete copyrighted materials that should not have been filmed, you will find a good image of the page in the adjacent frame. If copyrighted materials were deleted you will find a target note listing the pages in the adjacent frame.

• 3. When a map, drawing or chart, etc., is part of the material being photo­ graphed the photographer has followed a definite method in “sectioning” the material. It is customary to begin filming at the upper left hand corner of a large sheet and to continue from left to right in equal sections with small overlaps. If necessary, sectioning is continued again—beginning below the first row and continuing on until complete.

4. For any illustrations that cannot be reproduced satisfactorily by xerography, photographic prints can be purchased at additional cost and tipped into your xerographic copy. Requests can be made to our Dissertations Customer Services Department.

5. Some pages in any document may have indistinct print. In all cases we have filmed the best available copy.

University Microfilms International 300 N. ZEEB RD., ANN ARBOR, Ml 48106 8121864

Tsay, D u n -Ping

MIKE: A NETWORK FOR THE DISTRIBUTED DOUBLE-LOOP NETWORK

The Ohio Slate University PH.D. 1981

University Microfilms International300 N. Zeeb Road, Ann Arbor. M I 48106 MIKE: A NETWORK OPERATING SYSTEM FOR

THE DISTRIBUTED DOUBLE-LOOP COMPUTER NETWORK

DISSERTATION

Presented in Partial Fulfillment of the Requirements for

the Degree Doctor of Philosophy in the Graduate

School of The Ohio State University

By

Duen-Ping Tsay, B.S.E.E., M.S.

*****

The Ohio State University

1981

Reading Committee: Approved by

Dr. Ming T. Liu, Chairman

Dr. Kenneth J. Breeding

Dr. Bruce W. Weide Adviser Department of Computer and Information Science I dedicate this dissertation and all of its related efforts to my mother, the late Shu-Ming Hung Tsay.

ii ACKNOWLEDGMENTS

I thank my adviser, Professor Ming T. Liu, for his constant support and guidance during the development of this research and my graduate education. His faith and encouragement on all my work are truly appreciated.

I would like to express ray gratitude to Dr. Kenneth J. Breeding and Dr. Bruce W. Weide for serving on my reading committee. Their understanding and support throughout this entire process is very much appreciated.

Many persons have also furnished muchappreciated help in completing this research. Dr. Jacob J. Wolf, C. P. Chou, and J. J. Lin contributed in creating an interesting atmosphere in which to work. I am especially grateful to Richard C. Lian for his many insights and for his efforts to clarify the ideas presented here.

I would like to thank the Graduate School of The Ohio State University (Presidential Fellowship) and the National Science Foundation (Grant MCS-77-23496) for their financial support.

Finally, I am grateful to my parents, Dr. Jeh-Sheng Tsay and Mrs. Ai-Chu Chai Tsay, for their never-ending support and enthusiasm. I would also like to thank my wife, Kung-Tai, who has been very understanding throughout my entire graduate education. There are no words that could express ray appreciation for her except that this Ph.D. is jointly hers in every respect. VITA

September 3, 1949. . Born - Taipei, Taiwan, China.

1971 ...... B.S.E.E., National Taiwan University, Taipei, Taiwan, China.

1 971 -19 73 ...... Second Lieutenant, Signal Corps, Chinese Army, Quemoy, Fukien, China.

1 973-1 974 ...... Teaching Assistant, Department of Electrical Engineering, National Taiwan University, Taipei, Taiwan, China.

1975-1 976...... Graduate Teaching Assistant, Department of Computer Science, Northwestern University, Evanston, Illinois .

1976 ...... M. S., Northwestern University, Evanston, Illinois.

1 9 76-1 977 ...... Graduate Teaching Fellow, Department of Computer Science, University of Utah, Salt' Lake City, Utah.

1 977-1978 ...... Graduate Teaching Associate, Department of Computer and Information Science, The Ohio State University, Columbus, Ohio.

1978-1980 ...... Graduate Research Associate, Department of Computer and Information Science, The Ohio State University, Columbus, Ohio.

1980-1981 ...... Graduate Fellow, Graduate School, The Ohio State University, Columbus, Ohio.

iv PUBLICATIONS

"A Study of Free Space Allocation and File Reorganization Problem of VSAM Data Set," Master Thesis, Northwestern University, Computer Science Department, Evanston, Illinois, December 19 75.

"A Study of VSAM's Behavior of Free Space Allocation and Maintenance Cost," Accepted for Presentation and Publication by the ACM 1976 Annual Conference, Houston, Texas, October 1976. (C. H. Chin, coauthor.)

"Design of a Distributed Fault-Tolerant Loop Network," Proceedings o_f 19 79 International Sympos ium o n Fault-Tolerant Comput ing , pp. 17-24, June 1979. (M. T. Liu, J. J. Wolf, and B. W. Weide, coauthors.)

"System Design of the Distributed Double-Loop Computer Network (DDLCN)," Proceedings of First International Conference o n Distributed Comput ing Systems, pp. 95-105 , October 1979. (M . T. Liu, J. J. Wolf, B. W. Weide, R. Pardo, and C. P. Chou, coauthors.)

"Interface Design for the Distributed Double-Loop Computer Network (DDLCN)," Proceedings of 19 79 National Telecommunicat ions Conf erence, pp. 59.3.1-6, November 1979. (M. T. Liu, coauthor.)

"Design of a Reconfigurable Front-End Processor for Computer Networks," Proceed ings o f 19 80 International S ympos ium o n Fault-Tolerant Comp ut ing, pp. 369-371 , October 1980. (M. T. Liu, coauthor.)

"Design of a Robust Front-End for the Distributed Double-Loop Computer Network (DDLCN)," Proceed ings of Distributed Data Acquisition, Computing, a nd Control Sympos ium, pp. 141-155, December 1980. (M. T. Liu, coauthor.)

"Design of the Distributed Double-Loop Computer Network (DDLCN)," Journal of Digital Systems, Vol. 4, No. 4, April 1981. (M. T. Liu, C. P. Chou, and C M. Li, coauthors.)

"MIKE: A Network Operating System for the Distributed Double-Loop Computer Network (DDLCN)," to appear in Proceedings o f CO MP SAC'81, Chicago, Illinois, November 18, 1981. (M . T. Liu, coauthor.)

"Design of a Network Operating System for the Distributed Double-Loop Computer Network (DDLCN)," submitted to the International Symposium on Local Computer Networks , Florence, Italy, April 1982. (M. T. Liu and R. C. Lian, coauthors.)

FIELDS OF STUDY

Major Field: Computer and Information Science

Digital Computer Architecture and Organization. Dr. Ming T. Liu

Computer Programming, including System Programming. Dr. Sandra A. Mamrak

Theory and Processing of Programming Languages. Dr. Jayashree Ramanathan TABLE OF CONTENTS

Page ACKNOWLEDGMENTS...... iii

VITA ...... iv

TABLE OF CONTENTS...... vii

LIST OF T A B L E S ...... xi

LIST OF...... FIGURES...... xii

C hapter

I.INTRODUCTION ...... 1

Problems of Distributed Systems...... 5 Objectives of Dissertation ...... 9 Significant Features of Research ...... 12 Organization of Dissertation ...... 18

II.BACKGROUND AND PREVIOUS RESEARCH...... 21

Related Works...... 23 Ar ch o n s ...... 25 C A P 2 6 H y d r a ...... 27 O c t o p u s ...... 28 Roscoe...... 28 StarOS...... 29 MIKE: An Operating System for DDLCN . . . 31 Distinct Characteristics...... 33 Functional Capabilities ...... 35 System Transparency...... 35 Cooperative Autonomy ...... 36 Related Works...... 37 System Utilities...... 39 Reliability and Robustness ...... 39

vii Extensibility and Configurability. . AO Principles of Kernel Design...... 41 Data Abstraction...... 43 Capability-Based Addressing . . . 46 Domain-Based Protection ...... 47 Related Works...... 50 Software-Directed Architecture...... 57 Related Works...... 58 Breakdown of Research...... 62

III. THE DISTRIBUTED DOUBLE-LOOP COMPUTER NETWORK. 63

General System Overview...... 64 Reliable Communication Network ...... 66 Loop Interface Design ...... 67 Loop Operation...... 68 Multi-Destination Protocols ...... 70 Distributed Programming Systems...... 71 Distributed Database System...... 72 Other Research on DDLCN...... 73

IV. NETWORK OPERATING SYSTEM MODEL...... 74

The Hierarchical Framework of MIKE .... 74 The Object Model . 79 Introduction...... 79 The Object Model of MIKE...... 82 Type Definitions...... 83 Processes...... 87 T a s k s ...... 90 Guardians...... 90 Task Classification...... 94 Type T a s k s ...... 97 Service Tasks...... 99 Operating System Tasks ...... 100 Naming ...... 102 Naming System ...... 104 Naming Service T a s k ...... 106 A Note on Implementation...... 107 Process Interaction Model...... 110 Message Passing Versus Procedure C a l l i n g ...... 110 Two-Level Process Interaction Model . . 112 Intra-Task Communication ...... 112 Inter-Task Communication ...... 114

vi ii Message-Based Invocation...... 115 Messengers ...... 116 Resource Protection...... 118 Small Protection Domains...... 120 Closed Environment ...... 120 The Least Privilege Principle. . . . 121 System Protection Mechanisms . . . . 123 Domain Isolation...... 123 Protection Domain Switching . . . 124 Residual Control...... 127 Intra-Node Residual Control...... 127 Inter-Node Residual Control...... 129 Action Verification ...... 130 Error Recovery...... 131 System-Transparent Resource Sharing. . . . 132 Application Environments...... 132 Remote Resource Access...... 135

V. PROTOCOL STRUCTURE...... 147

Introduct io n ...... 147 Principles of Protocol Design ...... 151 Interprocess Communication Layer ...... 154 Functions of IPC Layer...... 155 Communication Facility of DDLCN . . . . 156 Loop Access Protocol...... 160 Unreliable Multi-Destination Protocol . 162 Reliable Multi-Destination Protocol . . 163 Guaranteed Multi-Destination Protocol . 164 System Support Layer ...... 166 Functions of System Support Layer . . . 166 Abstraction Sublayer...... 167 Task Templates . . . . . • • • • • • 167 Primitive NOS Utilities...... 169 Interaction Sublayer...... 171 Unreliable Session Protocol...... 172 Reliable Session Protocol...... 173 Guaranteed Session Protocol...... 174 Virtual Machine Layer...... 175

VI. LIU ARCHITECTURE...... 182

Introduction ...... 182 Minimum-Overhead Messages ...... 185 Minimum-Overhead Processes...... 186

ix Inter-Task Communication ...... 188 Message Generation...... 188 Message Mapping ...... 190 Multiprocessor Configuration...... 192 Reconfigurable LIU Architecture...... 193 Sliced Computer Module and Bit-Sliced Processing...... 195 System Architecture of LIU...... 197 Software-Directed Architecture ...... 201 Notion of Processes...... 201 Dispatching...... 202 Process Synchronization...... 203 Domain Switching ...... 206 Stack Architecture...... 206 Capability Mechanisms ...... 211 Authority Checks ...... 212 Typed M e m o r y ...... 213

VII. SUMMARY AND DIRECTIONS FOR FUTURE RESEARCH. . 217

Summary of MIKE's Significant Features . . 218 Areas of Future Rese a r c h ...... 222

BIBLIOGRAPHY ...... 225

x LIST OF TABLES

Table Page

1. Examples of Types and Their Type-Specific Operators...... 89

xi LIST OF FIGURES

Figure Page

1. Interaction Between and Contribution of the Principles of Kernel Design...... 51

2 . Prototype of D D L C N ...... 65

3. An Overview of a Three-Node DDLCN...... 78

4. Type Hierarchy of M I K E ...... 86

5. MIKE Profile Based on Task C o n c e p t ...... 96

6. A FILE Task...... 98

7. Machine-Oriented Names For Entities...... 109

8. Inter-Guardian Communication ...... 117

9. System-Transparent Resource Sharing...... 138

10. Protocol Hierarchy ...... 149

11. Approximate Correspondence Between ISO and MIKE Protocol Hierarchy...... 152

12. Full-Broadcast Message Transmission...... 158

13. Partial-Broadcast Message Transmission . . . . 159

14. Inter-Guardian Message Exchange...... 177

15. Inter-Guardian Communication Using the Virtual Communication Machine...... 179

16. Relationship Between Virtual and Physical i n e s...... 183

17. A Multiprocessor Organization for LIU...... 194

xi i 18. General System Architecture...... 199

19. An Extended Stack Architecture ...... 209

20. Entity Representations in a Typed Memory . . . 215

xiii CHAPTER 1

INTRODUCTION

Microelectronics and information processing are inextricably linked. Machines which retrieve, sort, compute, process, store, and transmit information are built out of microelectronic devices. The advances in semiconductor technology have led to the steady improvement in the level of integration, speed of operation, reliability, and production yields, and to the drastic reduction of cost, power consumption, and size for these integrated circuits. This progress will continue into the foreseeable future [BHA7 9] .

The most dramatic outgrowth of the microelectronic revolution was the proliferation of mini and micro . These small computer systems are less expensive yet high-performance and can provide cost-effective

1 2 processing power for a variety of applications [TER77].

Furthermore, we can afford to downgrade the importance of processor utilization by capitalizing on their price-performance relationship and concentrate on other goals such as system reliability and extensibility. These technological developments have spurred a great deal of interest in many new research areas. Distributed-processing computer networking is one such area.

The motivation to have distributed-processing computer networks is manifold. The never-ending quest for increased processing support at the lowest possible cost and the smallest incremental expansion capability, combined with the demand for enhanced user convenience, are factors influencing the trend toward such a system. Also, software development and maintenance, and computer operations are becoming more and more expensive putting further pressure on system designers to share valuable resources among a community of users in an optimal way.

However, the biggest impact of all is that the entire architecture of modern computer systems has been altered by the versatility and flexibility of these small computers.

No longer is the cost-effective processing of information carried out only in one computer's central processing unit. 3

Today there is a trend toward distributing more processing capability throughout a network of computers, so that the structure of the computer system matches the data flow and organizational structure of the user's specific environment.

This practice will result in giving local organizational elements more responsive computer support and more effective use of valuable computing resources [BOC79, CHA80, ENS78,

STA7 9, TAN 8 1 ] .

Recently, Enslow [ENS78] has given one of the most widely cited definitions of distributed systems (1).

Besides the distribution of physical components of a system

(processing logic, data, the processing itself, and the operating system), a proper definition must cover the concepts under which the distributed components interact.

He gave five essential characteristics of a distributed system:

1. A mu1tiplicity of general-purpose resource

components, including both physical and logical

resources, that can be assigned to specific tasks

(1) In the following discussion, we will use the term "distributed systems" to cover distributed-processing computer networks, distributed data processing systems, and the like. 4

on a dynamic basis.

2. A phys ic al distribution of these physical and

logical components of the system interacting

through a communication network.

3. A high-level operat ing system that unifies and

integrates the control of the distributed

c omp one nt s.

4. Sys tem transparency , permitting services to be

requested by name only.

5 . Cooperative au tonomy , characterizing the operation

and interaction of both physical and logical

resources.

The potential benefits of distributed systems include:

1. high system performance,

2. high reliability and availability,

3. graceful degradation (fail-soft capability),

4. automatic resource and load sharing,

5. ease of modular, incremental growth,

6. high adaptability to changes in work load, and

7. morenatural mapping from applications to

processors. 5

Considerable work has been done on new designs for

distributed systems to achieve subsets of these benefits,

but very few existing systems have made substantial progress

toward meeting all of the criteria. However, several such

"totally distributed" systems are currently at different

stages of design and implementation [JEN80, LIU81, SOL79].

1 .1 Pr oblems of Pis tributed Sys terns

In our view a distributed system is a special case of a

computer network, i.e., one with a high degree of

cohesiveness, transparency, and autonomy. The users of such

a system will see a single, powerful, unified computing

facility with a great numberof resources. They do not even

need to be aware of the networkstructure or its method of

operation at all. A distributed system is physically

loosely-coupled but is logically tightly-coupled. However,

no one processor can dominate the others; all must

cooperate in harmony as a community of equals. To integrate

the physical and logical resources of a distributed system

into a functioning whole, the concept of a high-level network operating system (NOS) must be implemented [ENS78],

That, is, a well-defined set of policies, which features

system transparency and cooperative autonomy among network 6

resources, must govern the integrated operation of the system. The mechanisms which execute these policies must interact harmoniously and their control structures must be totally distributed and highly nonhierarchica 1, i.e., no master/slave relationship can ever exist among these control mechanisms.

The decentralization of these mechanisms has resulted

in each control mechanism having to collect the global

system state, since this information is totally distributed.

However, no one process has more than probabilistic

knowledge of the global state since its totality may be

incomplete or inconsistent due to the following situations:

1. The resources may, either intentionally or

unintentionally, shield information from outside

inspection, and

2. Information becomes inaccurate since there will be

a time delay in the collection process.

In addition to this, the bulk of a NOS has to deal with

such problems as naming, protection, synchronization,

heterogeneity, resource sharing, and interprocess

communication at the network level. These issues are

consequences of the absence of uniqueness, both in time and 7

in space, in dis tributed environments.

Besides these issues which have to be dealt with in distributed environments, few hardware-assisted mechanisms are usually provided to alleviate the complexities and meet the needs of the system-wide control executive. Therefore, a huge gap exists between the abstractions called for by the

NOS structures and the capabilities directly realized by computer hardware. A NOS which has to cope with such a hostile environment and be implemented on such primitive hardware will be very complicated and unmanageable. This will eventually lead to an unreliable product which is difficult to design, debug, and maintain.

The aforementioned factors contribute to the scarcity of "totally" distributed systems. In reality, many existing applications have to settle for less "totally distributed" systems for economic reasons. Furthermore, these systems use ad hoc extensions and "cut and paste" techniques to implement decentralized control mechanisms which are unnatural, theoretically unsound, too complicated to comprehend, and hence unreliable. In order to develop a reliable and efficient NOS, its system design must use sound principles such as modern operating system design concepts.

In addition, the underlying computer hardware should support 8

these concepts at the architectural interface for NOS services. This integrated hardware and software design approach is required if NOSs are to be implemented efficiently.

A network operating system called MIKE [TSA81, TSA82] is proposed in this dissertation which provides system-transparent operation for network users and maintains cooperative autonomy among local computer systems. MIKE, which stands for Multicomputer Integrator KErnel, is designed for use in distributed systems in general and for use in the Distributed Double-Loop Computer Network (DDLCN)

[LIU75, LIU78, LIU79, LIU80, LIU81] in particular. It consists of replicated kernels which reside in Loop

Interface Units (LIUs), and a LIU is used to attach each local computer system onto the DDLCN.

MIKE incorporates modern operating system design principles to enhance its robustness and to reduce its complexity. The LIU architecture is designed with the explicit purpose of facilitating the implementation and maintenance of a coherent NOS for the DDLCN. This integrated approach to the design of the NOS will provide viable system-wide control software for the DDLCN. 9

1.2 Objectives of Dissertation

The design and implementation of "totally" distributed systems reflects intellectually challenging software and hardware tasks with great potential benefits and yet remains within reach, since it is based on existing technology.

However, the issues which have to be dealt with in a distributed environment are so complex that they significantly hinder the development of distributed systems.

This complexity can be considerably relaxed and can become more manageable if the design of NOSs uses advanced operating system principles coupled with extensive architectural support for these concepts.

This research is a major design effort to provide a robust NOS which integrates and unifies the control of autonomous computer systems into a functioning whole. The major design and performance objectives addressed in this dissertation are given below:

1. To establish the framework of MIKE which is used

as a decentralized system-wide control for the

DDLCN .

2. To adopt modern operating system design principles

to structure MIKE such that its complexity can 10

become more manageable.

3. To develop a software-directed computer

architecture such that advanced operating system

concepts can be supported at the architectural

interface.

The presence of such factors as system transparency, cooperative autonomy and functional modularity in the DDLCN is essential, since without them, many of the distinct advantages cited above will be compromised, and the system will be just another computer network with a low level of coherence. An integrated approach to the design of the MIKE

framework, i.e. the NOS model and its protocol structure,

is required if system transparency and cooperative autonomy

are to be realized efficiently and effectively. Further,

this approach provides a versatile system architecture such

that resource sharing and distributed computing are to

evolve in a modular and incremental fashion.

The consistent adoption of modern operating system

principles throughout the design of MIKE is essential since

it facilitates the creation of a robust network operating

system. Implementations of these concepts will be costly in

both time and space, and a heavy overhead is expected if the

logical structure of MIKE is imposed onto conventional hardware. The reasons are that these concepts characterize the basic procedure invocation interface and demand reliable message exchange which are very primitive and frequently used, and any slight degradation is likely to have significant impact on overall performance. Furthermore, many software systems are unreliable because of inadequate hardware assistance for hardware and software error detection and confinement. Therefore, considerable hardware support is required if the NOS services are to be implemented efficiently. Highly reliable operating systems will result only when the kernel is constructed by integrating hardware, firmware, and software.

The use of these modern operating system design principles and additional hardware/firmware support will contribute to the implementation and useability of MIKE.

Especially, they can provide effective and efficient means to achieve such factors as system transparency and local autonomy. The added cost of additional hardware assistance can be justified in terms of decreased software development and maintenance costs, better quality, and more useable products. 1 2

The above research objectives represent the basis for a comprehensive study into the design of a NOS for distributed systems. To meet them, the NOS structure has been modularized. Component interaction within the NOS structure has been formalized. Layered protocol also has been developed to support reliable and uninterrupted message communication and to provide common abstraction mechanisms for the NOS services. Extensive hardware/firmware mechanisms have been augmented to absorb the burden of

providing an efficient software implementation.

1 .3 Signif icant Features of Research

To accomplish all of the above design goals for MIKE,

many new techniques must be applied and some totally new

concepts have to be developed. To better illustrate some of

the significant contributions of this research to

distributed processing, network operating systems, and

software-directed architectures, the following discussion will summarize some of the more important capabilities and

features in this research work. 13

Overall, the contribution of this research is on the conceptual model we have devised, where the network operating system (in LIUs) and local operating systems (in attached host computer systems) can be fitted consistently.

That is, this conceptual model embraces both MIKE and local operating systems coherently, such that the system operation of DDLCN can be described consistently in terras of components and protocols of the model.

MIKE is a network operating system which provides systera-transparent operation for the users and maintains cooperative autonomy among individual computer systems.

Therefore, it presents the DDLCN to the users as a computer network with a high degree of cohesiveness, transparency, and au tonomy.

It adopts modern operating system design principles to structure its organization. These concepts include data abstraction, capability-based addressing, and domain-based protection [DEN76, DEN 7 9 , FLY 7 9 , LIN76b, RAT 8 0]. The use of these concepts can contribute and enhance several characteristics whose presence is essential in a distributed environment. Among these characteristics are system transparency, cooperative autonomy, re1iabi1ity/robustness , and extensibi1ity/configurabi1ity. Therefore, these 14

concepts purport to enhance the utility and reliability of

MIKE, as well as to decrease its software development and

maintenance costs.

The MIKE structure is based consistently on the object

model [JON78, LIN76a, LIS75]. The object model is used to

cope with complexity of NOSs and can be used to verify or

validate the correctness and other desired properties of the

system. The model increases the reliability of MIKE by

dictating system resources being accessed or manipulated

only in terms of well-defined functions or operations. The

principle of data abstraction contributes to system

transparency, functional modularity, and integrity of system

resources.

All communication among autonomous processes (either

local or remote) is done by message passing [LAU78, STA79,

TAN81], The use of message passing as an underlying

semantic concept best reflects the architecture of

distributed systems, i.e., no direct memory sharing exists

among component computers in the system. Capabilities

within request and reply messages are used to name and

validate access to network resource objects. Resources can

be shared by exchanging appropriate capabilities in

messages. The principle of capability-based addressing 15

contributes directly to system transparency and functional modularity .

Furthermore, capabilities provide a dynamically changing protection domain [DEN76, LIN76b, RAT80]. This domain-based protection concept conforms with the least privilege principle [DEN80, GLI79] by allowing minimum access capabilities for processes; consequently, it limits the propagation of errors, both software and hardware. It contributes to system reliability and robustness.

A uniform treatment of operating system and user processes is adopted to provide an extensible and configurable environment [RAT80, RAT81, WUL81, ZEI81]. The only distinction between them is largely a matter of privilege. The operating system functions can, when necessary, be replaced by user-supp1ied specific functions.

Furthermore, users are able to configure the NOS, according to their specific environment, to provide the functions that they need without undue penalty from facilities that they do not need. This leads to a very flexible and cleanly structured end product containing no artificial boundaries to complicate design and detract from efficiency.

Extensibility and configurability can be achieved easily by using data abstraction, capability-based addressing, and 16

domain-based protection mechanisms.

The protocol hierarchy, described bottom up, consists

of three layers: the interprocess communication layer, the

system support layer, and the virtual machine layer. The

interfaces between layers of protocol are kept very simple.

The bottom layer of the protocol hierarchy, capitalizing on

the specific network topology of DDLCN, uses multi-destination protocols to provide a reliable and

uninterrupted message communication for system and user

processes. The system support layer provides two distinct

kinds of services for the upper layer. It abstracts common

mechanisms to facilitate the introduction of new NOS

services while maintaining the integrity of MIKE. It also

provides, through session-oriented protocols, a virtual

communication channel for the virtual machine layer. The

virtual machine layer defines a basic set of standard

resource management services such as remote resource access,

distributed synchronization models, etc. It allows a user

to expand these standard services by adding other facilities

such as distributed database management and the like. This

integrated approach to NOS and protocol design allows

system-transparent resource sharing and distributed

computing to evolve in a modular fashion. 17

The architecture of the Loop Interface Unit (LIU) in which MIKE is housed is designed with the explicit purpose of facilitating the creation and maintenance of a robust NOS for the DDLCN. It is designed in such a way as to narrow the gap between the abstractions called for by the NOS structure and the capabilities directly realized by conventional computer hardware. LIU comprises a number of processing units which are grouped into a hierarchy of subsystems. Each subsystem corresponds to one layer of the protocol structure. The system organization is specially configured according to the MIKE hierarchical framework to expedite the message flow.

Function migration into the hardware/firmware has been done to raise the level of the hardware/software interface and to efficiently and effectively realize the logical structure of MIKE [DEN79, KAH81, RAT 8 0, RAT81, ZEI81].

Adequate hardware/firmware assistance is provided to confine and detect software and hardware errors. Representations of objects, capabilities, and protection domains are facilitated by special hardware mechanisms for performance and reliability reasons. This integrated software and hardware provides a viable environment where MIKE can be implemented efficiently. 18

The preceding paragraphs have shown why it is strongly believed that the design of MIKE for the DDLCN described in this dissertation represents a significant contribution to the field of distributed processing, network operating systems, and software-directed computer architecture. It should be evident that in order to meet the ever increasing computing needs of the future, distributed systems should provide automatic and implicit resource sharing among constituent hosts instead of manual and explicit sharing. A

robust system-wide control software, such as MIKE proposed

here, is essential to the realization of such a distributed

s ys tem.

1.4 Organization of Dissertation

This dissertation is arranged so that each chapter

addresses a distinct topic of research or area of

investigation. When combined together, they form a coherent view of MIKE and its underlying architecture.

The first chapter serves as a foundation from which the rest of the research may be presented. It contains an overview of the need and the feasibility to have distributed

systems for future computing demands and indicates that a 19

robust NOS is needed to facilitate the integration and unification of the distributed components. The major design objectives of MIKE and its supporting architecture are presented and the significant contributions of this research to the fields of distributed computing, network operating systems, and software-directed computer architecture are discussed.

Chapter 2 provides an overview of previous research in the areas which relate to this dissertation, and gives a background from which to work. Several terms and concepts are defined that will be used throughout the dissertation.

Finally, a breakdown of the research is given.

Chapter 3 gives a brief overview of the system design of DDLCN, which is currently being implemented under a grant from NSF at The Ohio State University. The DDLCN serves as a vehicle to better illustrate the role and function the proposed NOS plays in a distributed environment.

Chapter 4 outlines the framework of MIKE. It details

the logical structuring and grouping of system resource entities, and then describes the process interaction model for MIKE. It also covers general NOS model issues such as naming, resource sharing, protection, and error recovery. 20

Chapter 5 describes a robust system kernel based on a multi-layered multi-destination protocol to support the NOS services. It presents the design principles of the protocol structure and identifies each layer of the protocol hierarchy along with its functions.

The system architecture of the Loop Interface Unit

(LIU) in which all of the system-wide executive resides is described in Chapter 6. It explores the system organization which is specially configured to closely match the MIKE protocol hierarchy. Chapter 6 also describes the unusual aspects of LIU which support modern operating system design principles at the architectural interface.

Chapter 7 contains the conclusions of the NOS and its architecture research and provides a good review of the hopes and accomplishments of this integrated approach.

Directions for further research that might extend and enhance the good features of this research work are suggested. CHAPTER 2

BACKGROUND AND PREVIOUS RESEARCH

The term "distributed system" is being used with increasing frequency, but unfortunately there tends to be

little common agreement about what the terra means, and it

therefore is used to represent a number of very different, and occasionally centralized, concepts [ENS78, JEN80,

STA79]. Thus, our initial step is to delineate the region

in which we are interested.

The distributed system being considered here is a

collection of autonomous and heterogeneous computers networked together [REA76, WOL79b]. All these computer systems are independent and capable of solving their own problems most of the time. Occasionally, because they have limited resources at their immediate disposal, they would find it advantageous, even absolutely necessary, to use the resources of other computers in the network. These

21 22

resources might be specialized peripheral devices, processors with increased capabilities, large amounts of primary memory, particular software packages not available at the local site, databases maintained in other computer systems, etc.

The objective of forming such a cluster of interconnected computers is to present a single, integrated computing facility with a great number of resources to the end users such that they do not need to be aware of the system's actual organization and methods of operation

[LIU78, REA76], Such a particular environment is chosen for this study, because it is typical of many found today in industrial, commercial, and university settings.

In order to provide users with such a high degree of cohesiveness, transparency and autonomy, a separate system-wide control software called network operating system

(NOS) has to be designed to coordinate the operations of local computer systems. This dissertation presents the conceptual design of the MulticomputerIntegrator KE rnel

(MIKE) which is the proposed NOS for the Distributed

Double-Loop Computer Network (DDLCN). 23

2.1 Rela ted Works

To accomplish the design objectives which are outlined in Chapter 1 for MIKE, many techniques and concepts have to be employed. However, due to the diversity of these techniques, concepts, and application environments, it is a difficult job to survey other related works; that is, on the one hand, systems that employ a similar design methodology are nonnetwork operating systems or network operating systems designed for different environments; on the other hand, systems which are designed for the same application environment do not use similar design methodology. However, by providing an overview of previous research, even if not exactly compatible to the DDLCN, we can still be better motivated for the approach we are taking and can appreciate the resulting architecture of MIKE.

Numerous example systems can be found for either category mentioned above. However, we will not attempt to do a comprehensive study of all the related systems, nor will we attempt to give a comprehensive description of any of the systems. Notable among those systems that have been proposed and/or implemented are ADAPT [PEE80], Archons

[JEN80], CAP [WIL79], DCS [ROW75], Desperanto [MAM81], HXDP

[JEN78], [WUL81], ICOPS [VAN 7 6], Medusa [OUS80], 24

MICROS [WIT80], MININET [PEE78], MuNet [HAL80], Octopus

[WAT80], Roscoe [SOL79], RSEXEC [F0R78], StarOS [JON79b],

System 250 [ENG74], TRIX [WAR80], and X-Tree [MCC80]. Some stand out as being especially interesting and can be used as representives in their particular environments. These are

Archons, CAP, Hydra, Roscoe, and StarOS.

It is tempting to compare these systems in an effort to perceive the good and bad points of each design. However, for the reason stated above, such a comparison is doubtful since the goals and implementations are so different. These differences in design goals and system implementations necessitate differences in the resulting systems. Moreover, these differences effectuate major goals of one design to be ignored or considered as minor issues in the other designs.

Therefore, rather than attempting to compare these systems now, we will begin with a brief overview of each system selected and will then enumerate distinct characteristics of

the MIKE architecture. For each characteristic, we will then contrast the approaches taken by those systems where the issue is applicable to them (1). Hopefully, this will

(1) For example, centralized operating systems do not concern themselves with issues such as system transparency and local autonomy. 25 highlight in a vivid way the important aspects of the representive systems in which we are interested, and provide insight into some of the problems which will be addressed later.

2.1.1 Ar chons

The Archons project [JEN80] is aimed at producing a second-generation large-scale distributed system (HXDP

[JEN78] was the first) with a multiplicity of processors which are physically and logically interconnected to form a single system. It permits experimentation with not only the design but also the hardware/firmware/software implementation of decentralized executive control mechanisms from the lowest levels to the operating system user interface. Here, a distributed system is defined as a collection of processor/memory pairs having decentralized system-wide control at the executive levels and below. The constituent components (called application subsystems) of

Archons have disjointed main memories and communicate via explicit message exchange. The Archons project has initially focused its attention on real time control. It interconnects as many as 256 application subsystems, which may be heterogeneous, general- or special-purpose, and 26

contains any number of processors each. ArchOS is an operating system for Archons. It implements decentralized

resource management in such a way that the programmer is

aware that the machine which executes his programs is a distributed multicomputer and thus supports concurrent

operation at all levels of abstraction.

2.1.2 CAP

The Cambridge CAP [NEE7 7 a , NEE7 7b , NEE77c, WIL79] is a

32-bit capability-based computer developed at the University

of Cambridge. The "cap" is an abbreviation for "capability"

(2). CAP is viewed as an experiment in

based on the capability concept [DEN66, FAB74, GRA72, POP74,

SAL75] by its designer and remains more of a research tool

than a production system. Work started on the CAP project

in 1970. The machine began to function in late 1974. Since

then, modifications have been made to improve the original

design, and the work is still continuing. The CAP project

(2) Capability is defined as an unforgeable token used as an identifier for an object such that possession of the token confers access rights for the object [DEN76]. It is analogous to a reference, or a pointer in programming languages; the major difference is that a capability, in addition to pointing to an object, contains protection information [WUL81]. 27 was an exercise in the design of a coordinated hardware, firmware, and software system. The design is based on implicitly loaded capability registers. This is perhaps the most distinct feature of the CAP computer from the hardware point of view. The instruction set of CAP is not hardwired, but rather microprogrammed. It uses a segmented, partitioned memory to organize information into segments containing either capabilities or data but never both.

2.1.3 Hydra

Hydra [WUL74, WUL81] is an operating system kernel built for the Carnegie-Me1Ion multi-mini-processor computer known as C.mmp. Hydra runs on a hardware configuration consisting of sixteen PDP-11 minicomputers which are connected to 2.6 million bytes of shared memory via a cross-bar switch [WUL81]. Hydra is intended to provide a vehicle for exploring algorithms and program structures that exploit asynchronous parallel processing on the C.mmp hardware. Its memory system provides a paged, partitioned virtual memory. The protection system of Hydra is object-oriented and its entire capability implementation has been constructed in software. Hydra is the most extensive effort to date tobuild a capability-based operating system 28 on conventional hardware.

2.1.4 Octopus

The project to develop a tightly-coupled network operating system for Octopus at the Lawrence Livermore

Laboratory is still in the design phase [DON76, DON79,

DON80, FLE80, WAT80]. Octopus is a high performance

local-area network interconnecting hundreds of micro/mini computers [FLE73]. The primary design goal of the NOS is to provide users with a uniform, coherent view of distributed resources scattered around Octopus. The resulting NOS will be efficiently impleraentable and useable as the base operating system on a single system of common current architecture, as well as be impleraentable as a "guest" layer on existing operating systems that support appropriate

interprocess communication.

2.1.5 Ros coe

Roscoe [BRY80, SOL79] is an operating system

implemented at the University of Wisconsin for a network of

LSI-11 microcomputers. It allows the network to cooperate 29

to provide a general purpose computing facility in which resources are shared in a distributed and non-hierarchical fashion. All communication between nodes is by message transfer in a store and forward scheme. Routing is performed by fixed routing tables at each node. In Roscoe, the kernel is designed to support both message passing and communication link mechanisms. Links (3) and messages are idiomatic to Roscoe. The link mechanism used in Roscoe is

inspired and heavily influenced by the concept of the CRAY-1

Demos operating system [BAS77], The Roscoe system appears

to the users as a single powerful machine in which all

processes communicate by way of a uniform communication mechanism, and communicating processes have no need to know

if they are on the same processor and no way of finding out.

2.1.6 StarOS

StarOS [JON77, JON79a, JON79b] is a message-based,

object-oriented operating system developed for Cm*. Cm* is a multi-microprocessor computer consisting of a two-level

hierarchy of buses connecting five clusters of ten

(3) A link [SOL79] combines the concepts of a communications path and a capability [FAB74], It permits a one-way logical connection between two processes. 30

processor/memory pairs (which are known as computer modules or Cm's). Each Cm is made up of an LSI-11 microprocessor, a local memory, and a local switch (the Slocal) which is interposed between the two. Within each, cluster, there is a communication controller (known as a Kmap) managing interprocessor communication. In addition to performing the basic communication functions, the microprogrammab1e Kmaps also provide support for some low level operating system functions. The main objective of StarOS is to support large

collections of concurrently executing, cooperating processes called task forces [JON79b]. Because of this, inter-process communication and synchronization are substantially more frequent and StarOS is designed as a message-based system

that supports rapid and asynchronous message communication.

All but the most primitive functions in StarOS are implemented as task forces. Only a few low level functions are implemented by the Kmaps. As an aid for the construction of task forces, StarOS supports the TASK specification language [JON79a]. The StarOS itself is an example task force.

Before we outline the distinct characteristics of the

MIKE architecture, we will deviate briefly to informally define several terms that will be used throughout the dissertation. A software error [DEN76] is an item of 31 information which is "incorrect" and can be expunged by error recovery algorithms. A hardware error [AVI77, CAR79] is a mechanical defect which may generate, among other things, software errors. Security [LIN76b] means the protection of resources, both data and programs, from accidental or malicious modification, destruction, or disclosure. Protection mechanisms [POP74, SAL75] are system features that are designed to protect against unauthorized or undesirable access to data. Protect ion domain [DEN76,

FAB74, LIN76b] is an environment or context that defines the set of access rights that a process has to objects of the system. Reliability [DEN76, FLY79] of a system is taken to be a measure of the ability to detect early and to localize hardware./software errors. Robus tnes s [DEN76, SV079] is a measure of the ability of a system to continue. to perform usefully when the protective assumptions are removed and it is subject to abusive usage.

2.2 MIKE : An Operating System for DDLCN

It is one of our purposes in this section to help distinguish our system from other current efforts in this general area, especially in the design methodology, the underlying architecture, and the specific application 32

environment.

A number of very different systems are being mislabeled as distributed systems. In many cases, it is not possible to distinguish between systems known as networks and those called distributed systems on the basis of a system's physical attributes; we must also examine how the constituent distributed components interact. In an attempt to provide a common context in which to discuss, compare and evaluate them, we have defined distributed systems to be those systems that contain five essential characteristics as stated by Enslow [ENS78]. These will immediately exclude some systems which are constrained to being what we have defined as a computer network.

The key ingredient of a distributed system is its NOS.

NOSs can be designed in two radically different ways [CHA80,

TAN81, WAT80]. One way is to superimpose a NOS on top of a collection of local hosts. The other approach is to throw away the existing operating systems and to start all over again with a single homogeneous NOS.

The Distributed Double-Loop Computer Network (DDLCN) uses the first approach since it is best suited to the applications on hand as stated above. That is, MIKE is 33 designed toreside on top of existing host operating systems. However, it is worth mentioning here that the result of this research is quite general and can be applied to either approach.

MIKE, which stands for Multicomputer Integrator KErnel, is the NOS proposed in this dissertation to unify and integrate the control of the distributed components in the

DDLCN. In order to minimize the surgery on the local operating systems, MIKE is designed as a NOS kernel to be superimposed on top of a collection of local hosts, so that each host can run its own local operating system. MIKE consists of a set of replicated kernels each residing in the

Loop Interface Units (LIUs) of DDLCN. Each LIU is a separate processor independent of the local host computer attached. That is, each node of DDLCN is a pair of processors: the LIU and its local host.

2.2.1 Pis tinct Characteristics

Traditionally, operating systems play a dual role — multiplexing system resources and furnishing a simpler virtual machine to the user. In a local-area distributed environment like the DDLCN, multiplexing is largely a static 34

or local issue. It is not feasible to dynamically adjust the workload across processors, as the time to re-adjust the load is much longer than the time constant of the load. Our interest is in trying to design a NOS that will integrate a network of computers and provide a virtual environment to the users such that they see only a single computing facility with a high degree of cohesiveness, transparency, and autonomy.

The MIKE research effort is concerned with all aspects of the design of operating systems for distributed systems.

The primary design goals of MIKE are fivefold:

1. To provide system-transparent operation to the

users such that they do not need to know the

different naming and other access mechanisms,

including its distributed or non-distributed

nature, required by the network, each node, and

each service.

2. To maintain cooperative autonomy among local

computer systems such that the local resource

management can be retained.

~ 3 . To provide a reliable and robust system operation

in spite of component node failures. 35

4. To provide an extensible and configurable

environment such that resource sharing and

distributed computing can evolve in a modular

f as hion.

5. To provide a software-directed architecture upon

which MIKE will be executed such that the advanced

design methodology can be supported at the

architectural interface.

2.2.2 Funct ional Capabilities

2 . 2 . 2 . 1 System Transparency

All the users will have a uniform coherent view of resources distributed in the DDLCNsince all irrelevant aspects of the implementation have been isolated from them.

Users need not (although they may) program differently or use different procedures depending on resource location. A user can request a resource or a service without the explicit knowledge of whether a needed resource is local or remote, or even whether there is one. By system default mode, the local resource pool is always searched first since it is convenient and economical. Access by local user processes to local resources is designed to be as efficient 36

as those on existing.single system operating systems. Only requests for services that could not be satisfied by the local host are then automatically passed to MIKE. MIKE will locate the desired resource in the DDLCN if there is one and if the owner honors the request. It will then present the resource to the user as though it were a local one.

Therefore, users are communicating with a "single centralized" system.

Users may wish to take cognizance of this distributed multicomputer system for specific reasons of efficiency.

They can then override those transparency mechanisms and explicitly control where a process is to be run or learn the locations of resources to increase performance.

2 . 2 . 2.2 Cooperative Au t o nomy

The nodes of a distributed system should be autonomous.

This includes the ability to remove a machine from the network in the extreme, but more generally, it means that the internal algorithms and organization of information can be freely selected at each node, independently of other nodes in the system [CLA80, ENS78, STA79]. Decisions about which information is to be kept, how it is to be organized, 37

how it is to be processed, and for what purposes it may be used and shared are all to be locally decided. Therefore, a node can, at its discretion, refuse to share its resources with remote users due to, for example, the upsurge of local demands. This is believed to be the most widely used form of resource management since it reflects naturally existing organizational structures.

The requirement of autonomy must be balanced against

the need for coherence. A distributed system should not be

distributed anarchica 1ly. The cooperative autonomy is

therefore adopted in the DDLCN where certain conventions and

protocols (i.e., MIKE) are maintained system-wide but only

where nodes interact. In particular, there can be no

autonomy in regard to addressing and authentication, and

this is essential to attain many of the benefits listed in

Chapter 1.

2.2.2.3 Rela ted Works

Most distributed systems offer some form of system

transparency in that interprocess communication is carried

out in a uniform (to the user) fashion, regardless of the

location of destination processes. However, this is not the 38

case for Archons. Archons [JEN80] is aimed to fully utilize system resources and yield maximum concurrency. Therefore, users are required to be explicitly aware of the system configuration.

The management decentralization of resource sharing is determined by the number of resource managers involved, and by the relationships among them [CLA80, ENS78, JEN80], The whole spectrum of this management decentralization has been illuminated by [JEN80]. At one end, the resource management is in the form of autocracy where a single entity (local resource manager) unilaterally makes and carries out all decisions on every local resource access. (The DDLCN is in this category.) This is the maximally centralized case in that the network resources are partitioned into disjoint subsets, each of which is controlled independently of the others by one (local) resource manager and no resource is subject to multilateral management. At the other end, the management is in the form of democracy, where all resource managers perform their functions by negotiation and consensus among equals. (Archons [JEN80] is in this category.) This is the maximally decentralized case in that every resource manager executes mu1tilaterally with every other in the handling of system resources, and performance is pursued on a system rather than on a particular host, 39

user, or application basis.

Octopus [DON79] is designed as a tightly-coupled computer network where resource access is always granted as long as the requester has the "capability." Roscoe [BRY80,

SOL79] aims at load balancing as its main resource management task. Processors in the network cooperate in balancing their loads. The method by which processors

cooperate to perform process migration is through pairing with their nearest neighbors and transferring processes from

the more heavily loaded processor to the lighter loaded one

[BRY80, JEN80, SOL79]. Since only two of the system's

resource managers cooperate to managetheir collective

resources, the degree of resource management

decentralization for Roscoe is considered moderately low

according to the criteria of [JEN80].

2.2.3 System Utilities

2 . 2 . 3 .1 Reliability and Robus tness

In order to providea viable product and keep users

immuhe from component failures, MIKE should be reliable and robust. The improved system reliability can aid in the 40 early detection and localization of any errant behavior and in the tolerance of software/hardware failures. The resulting damage will be limited to the component involved and still permit useful system operation to continue. To the DDLCN users, no single processing or data component malfunction should be able to make the system inoperative or act as a major bottleneck, and a certain minimum level of service should always be maintained in a situation such as this.

2.2.3.2 Extensibility and Configurability

In the context of operating systems, extensibility means that application programs execute in the same environment as the operating system itself [KAH81, STA79].

The distinction between operating system and application system is largely a matter of privilege. This is the basis for easily creating, in a modular fashion, new resources or services from existing ones. It is also the basis for configuring MIKE to provide the needed functions without undue penalty from unneeded ones. Therefore, MIKE can be introduced into the DDLCN piecemeal, and its sophistication can •• be proportional to the complexity of the local computer's activities. Distributed systems can be designed to have system transparency and local autonomy and exhibit different degrees of decentralization for their resource management.

Differences among these systems are due to differences among their intended applications, differences among the cost constraints under which trade-offs are made, and differences of opinions among researchers and designers. What really makes the difference, in our view, is the design methodology used on their network operating systems. The following section presents the principles of kernel design which are developed as guides to the system design of MIKE to provide the above mentioned utilities.

2.2.3.3 Principles o f Ke rnel Des ign

The design of a NOS must deal with the problems arising from its distributed nature such as network-wide resource naming, sharing, protection, etc. A NOS is mainly concerned with the management of non-physical resources and their interaction in computer networks. The management task of non-physical network resources is very complex due to the absence of uniformity in distributed environments. To overcome the complexity incurred in the design process, we believe that those problems mentioned above (i.e., naming, 42 protection, sharing, etc.) should not be tackled

independently of one another. A comprehensive methodology for modular and robust system development should be chosen such that all these problems are solved, simultaneously and

consistently, by the same approach. By following this methodology, a low level base can then be developed to

provide basic system functions in a highly structured and

reliable way.

In recent years modern software design methodology which purports to reduce the complexity and implementation

cost of large software systems, as well as to enhance their utility and reliability, has been advocated. The methodology which can be applied to the field of operating

systems design includes data abstraction, caoabi1ity-based

addressing, and domain-based protection. By using these

advanced concepts consistently throughout the NOS structure, we can produce a viable operating system which reflects naturally its underlying network architecture and be assured

that the access to and sharing of network resources will be controlled and protected in an uncircumventable way.

The following subsections will examine in turn, the nature and utility of each of these modern operating system

design principles. 43

2.2.3.3.1 Da ta Abs tract ion

The rationale for data abstraction is that programs are often too complex for human beings to comprehend, and therefore, we need to ignore the details of a problem and deal instead with only its essence. Abstractions are based on mode 1s . We recognize that abstract models are not only needed to cope with complexity, but ultimately they can be used to verify or validate the correctness and system characteristics [JON78, LIS 7 5] .

The concept of data abstraction is used primarily in

language design [KAP80, SHA80, SNY79]. It is motivated by

the need for better programming methodology and is a

successor to the notions of stepwise refinement and

structured programming [DAH72]. But the concept is really

language independent and can very naturally be implemented

in operating systems [GLI79, JON78, KAH81, LIN76a, MAD81,

RAT 8 0, RAT81, ZE181], 44

The Object Model. Objects are used to denote the abstract resources in operating systems, and they correspond to data types as they appear in some modern "data abstraction" languages, e.g., CLU [SCH78], ALPHARD [WUL78], etc. The object model is the basis for these abstract resources — objects. As noted in [JON78], the object model captures fundamental properties that pervade all aspects of modern operating systems: naming, binding, protection, sharing, access control, and error recovery. The key concepts of this model include:

1. the encapsulation of data structures as the

objects,

2. a set of well-defined transformations on and a

uniform message/operation interface to those

objects, and

3. the decomposition of a design via a set of

abs tractions .

Two important properties of data abstraction are especially relevant in distributed systems. One is its enclosure and hiding of implementation and representation, and the other is its abstraction power. The former property contributes significantly to the system transparency since it tends to inhibit dependence on the details and locations 45 of objects. The latter property results from the use of the object model, since it gives us a natural match for our conceptualization of resources in distributed systems. This abstraction power contributes to better system robustness, extensibi1ity/configurabi1ity, and functional modularity.

We will explore, in Section 4.2, the object model and its ramifications with respect to network operating systems and use it to abstract the resource entities in the DDLCN.

Levels of Abs traction. The object model can be used as a primary means to decompose and modularize software systems in a hierarchical way [HAB76, LIN76b, LIN81]. The gap between the virtual system the users are interested in and the bits in raw hardware is usually bridged by many abstract concepts. One concept is said to be at a higher level of abstraction than other concepts if the concept organizes instances of the lower level abstract concepts so that they can be manipulated effectively without having to understand the details of how the lower level concepts interact.

In distributed environments, this technique is used to reduce the design complexity of NOSs [BOC79, CHA80, TAN81],

The net result is a system composed of levels or layers of 46

NOS kernels, and each one is built upon its predecessor.

The number of layers and their respective functions differ

from one system to another. Nonetheless, the purpose of

each layer is to offer certain primitive operations and

services to the higher layers while shielding those layers

from the details of how these network-oriented operations

and services are actually implemented [TAN81]. We will

explore this hierarchical design technique in Chapter 5

where we present our protocol structure.

2 . 2 . 3 . 3 . 2 Capability-Based Addressing

The essence of capability-based addressing [DEN76,

DON80, FAB74, LIN76b, RAT80, WAT80] is that all objects in

the computational universe defined by a NOS have their own

unique identifiers for all time and space. All references

to objects,whether local or remote, must be made via these

unique identifiers. An ideal form for these object

identifiers that meets these requirements is called a

capability. A process has access to an object only if it

possesses a capability for that object, and this capability

is the object's system-wide name. In distributed environments, capabi1ity-based addressing not only controls access to objects, but also enables a process to refer to an object, regardless of whether the destination is local or remote. Therefore, it provides a uniform and general way of naming and addressing all objects in distributed systems. This contributes directly to system transparency, functional modularity, and system re1iabi1ity/robustness.

2 . 2 . 3 . 3 . 3 Domain-Based Protection

The term "computer security" embraces external threats as well as internal protection. Our work will only concern protection since, for designers of operating systems, protection is legitimately a major concern [PAR76], A protection mechanism serves to protect unauthorized access into the address space of a process and to prevent information from inadvertently escaping from the address space. Ideally, it should be possible to protect every small portion of software from errors originating anywhere else in the system. This is what is meant by fine-grained protection. 48

Protection structures in operating systems have evolved from simple supervisor vs. user state (e.g., IBM

System/360) and hierarchical rings (e.g., Multics), to sophisticated multiple domains (e.g., ,

CAP, and Hydra). The key concepts of domain-based protection include:

1. independent address spaces ("access domains") for

processes, and

2. capability-based addressing and access control.

A capability is an encoding of a right; it is a pair that names a unique object and lists a set of access rights applicable to that object. Domain-based protection [BER80,

COO 7 9, DEN66, DEN76, FAB74, GRA72, LIN76b, P0P74, RAT80,

SAL 75] uses these non-forgeable capabilities, which can be thought of as system-wide protected names for objects, to represent the access domain of a process.

Protection is far more complex when the access controls must extend to a network of computers, since distributed systems are particularly vulnerable to hardware and software errors. When failures occur in system components or in intercomponent communications, the system must operate in a resilient manner. Such failures must not destroy the 4 9 integrity of the system. Ideally, the system should continue functioning, although perhaps in a degraded mode.

The domain-based protection concept provides a powerful, dynamically changing protection environment, and contributes to early fault detection and failure recovery because it allows very small access domains for a process which has no more capabilities than required for its immediate task. Therefore, it achieves maximal error confinement by limiting the propagation of errors, both software and hardware, such that the extent of the damage is known and the quantity of repair work is limited.

Capabilities couple very naturally with data

abstraction. Since data abstraction dictates that the

integrity of an object is safeguarded by a set of operators

that are the only means by which an object can be directly manipulated, capabilities can then be used to name the objects and validate the access rights which control the use of the individual operators.

Data abstraction, capability-based addressing, and domain-based protection are modern operating system principles to be adopted in the design of our NOS for the

DDLCN. While it is possible to analyze and apply these 50

concepts as though they were independent, their full impact is seen when they are considered together. The interactions among these three concepts and their contributions to system characteristics are illustrated in Figure 1. In the figure, arrows between terms are to be read as "supports" or

"facilitates

2 . 2 . 3 . 4 Related Wo rks

Many operating systems today are explained with modern design methodology. However, few of them use these advanced concepts consistently throughout their structures. The reasons for systems' reluctance to adopt advanced design concepts seems to be twofold. First, the classic view of a software system as a collection of procedures and data structures has been slow in giving way to the view of a collection of resource objects. Second, the classic view of hierarchical protection domains implemented with the privilege states has been slow in giving way to processes with multiple, disjointed and small protection domains.

CAP supports small protection domains. However, it utilizes a hierarchical structure of capability segments for processes in contrast to the linear swapping table used by 51

Capability-Based Domain-Based Data «< ■■ > Addressing Protection Abstraction

System Transparency/ Extensibility/ Robus tness/

Local Autonomy Configurability Reliability

Figure 1.

Interaction Between and Contribution of the Principles of Kernel Design 52

most other systems; the position of a process in the hierarchy determines the resources it may use. The CAP computer permits simultaneous switching of several capability segments upon a protection domain change, thus facilitating entry and exit from protected procedures.

There are three different classes of capability in CAP: software capabilities, segment capabilities, and enter capabilities. A software capability enables a single protected procedure to be used to perform a variety of related functions with separate protection for each of those functions. Hardware-interpreted segment capabilities name and control access to data or capability segments. Enter capabilities name and control access to protected procedures which are parallel to abstract data types.

The kernel, which is responsible for protection mechanisms, is broken down into two levels: a basic kernel and a high level kernel. The former provides the bare bones of the CAP kernel and the latter implements more complicated functions in terms of the primitives of the basic kernel.

Each level is designed independently of higher levels of the protection functions and is envisaged as a self-contained protection system in its own right. This design philosophy results in that the protection system can maintain its 53 integrity even if the system it supports should fail.

Hydra is partitioned into several pieces. One distinguished piece is called the kernel, whose functions are to provide a uniform protection mechanism, avoid arbitrary policy decisions, and support the existence of the remaining pieces. The remaining pieces, which may be an arbitrary number of them, are called subsystems. Each subsystem specifies the representation of a virtual resource, the nature of the implementation of operations on that type of resource, and all resource allocation (policy) decisions relative to that resource. Nearly all facilities are provided through subsystems at the user level and are without special privilege or status. Adding a new facility consists simply of providing a user-level program that implements the facility. Therefore, Hydra is extensible since it can grow and evolve in unanticipated directions.

The protection system of Hydra is object oriented.

Maintaining the integrity of objects is entirely the responsibility of the kernel. Hydra views an object as the abstraction of a typed storage cell, which is represented as a triple consisting of a unique name, a type, and a representation. Objects are typed. Hydra views a protected subsystem as essentially a "type definition" which defines 54

extended types (called user-defined types). The object

universe is partitioned by type field into disjoint classes where two objects of the same type represent different

instances of the same abstraction.

The representation of each system-defined object or

ext ended-type object has, in addition to a data part, a

capability list. A capability is a pair: a unique name and

allowed rights. The access rights field in capabilities

allows control of user defined type-specific operations.

One of the object types recognized and maintained by Hydra

is a procedure activation record called local name space

(LNS). A LNS object defines the instantaneous protection

domain of a process through the capabi1ity-1ist, C-list, it

contains. The C-list inherits capabilities from two

sources: the C-list of the called procedure object and the

parameters passed to the LNS from the calling procedure.

The C-list is associated with objects and therefore presents

non-hierarchica1 protection domains within a single process.

At the interfaces of protection subsystems, a template

is employed as a run-time specification for a capability to

insure the validity of parameters passed between protected 55 subsystems. Procedures that do not need the access rights amplification facilities use simple templates. Procedures implementing subsystems that need to expand the access rights of parameter capabilities passed to them use amplification templates. In Hydra, the mutual suspicion problem (4) [GRA72, WUL81] is solved by the complete isolation of the two procedures.

StarOS is an object-oriented system. Objects are typed and are distinct and unique. Objects contain either data or capabilities, but not both. Strong typing and capability-based authorization are consistently enforced by

StarOS. System supports user-defined "abstract types."

Representation types are those types defined by StarOS and

are the basis for building all abstract types. StarOS

allows a tree structure of capability requests to be

accessible to a running process. The interpretation of

access rights specified by a capability depends on the type

of the object to which the capability refers.

(4) The mutual suspicion problem arises because the caller of a procedure needs a guarantee that the procedure will not be able to gain access to any of the caller's objects, except those explicitly passed as parameters. The called procedure likewise needs a guarantee that the caller cannot gain access to any objects private to that procedure, except when the procedure explicitly allows it, as stated in [WUL81], 56

In Octopus, an object model is chosen to structure its

NOS framework, although in a rather coarse manner. Objects or resources are entities such as processes, files, directories, virtual I/O devices, databases, etc. The abstract representation of a resource and the operations on the representation are implemented by one or more modules called servers. Capabilities are used, in a more general sense than is common, to resolve network-wide resource naming and sharing. The benefits of small objects and small protection domains from the aspects of reliability and robustness cannot, therefore, be attained.

Two types of capabilities are supported in Octopus: uncontrolled and controlled. Possession of an uncontrolled capability constitutes proof of right of access, with the access rights represented by that capability. Controlled capabilities will only be accepted if from a legitimate holder and can be protected by servers in three ways: with access lists, with encryption using the legitimate holder's address as part of the key, or with capability lists named by the origin address. Resources can be shared by simply passing the appropriate capability in a message. 57

An integrated approach to NOS and protocol design is taken in Octopus such that the basic NOS structure does not require the NOS to spring full-blown into existence with all possible services. A multi-layer protocol structure is presented to support the NOS. The protocol hierarchy consists of three layers: the interprocess communication

(IPC) layer, the service support layer, and the service layer. The IPC layer uses a timer based assurance mechanism

to manage the communication subnetwork. The service support layer defines operations and parameters common to most services in terms of models for resources and servers. The service layer supports basic resources and services, e.g., authentication, logging, files, directories, processes,

clocks, accounting, terminals, etc. In Octopus, design is

still underway to fully specify the protocol structure and

identify the basic NOS services.

2.2.4 Software-Directed Architecture

A considerable variety of hardware and firmware mechanisms are needed to effectively realize the proposed

MIKE architecture. The design methodology we adopt in the desirgn of MIKE implies heavy overhead if no hardware assistance is available. The problem arises because the 58 basic procedure call interfaces and protection domain switching characterized by all of these advanced concepts are very frequently used. Without hardware assistance, this overhead will have significant impact on the overall performance. Therefore, a software-directed architecture is needed to support object-oriented computation such that MIKE can run sufficiently fast while providing robust resource sharing for the network users.

2 . 2 .4 .1 Related Wo rks

Data abstraction, capabi1ity-based addressing, and domain-based protection have attained a reputation, according to some computer professionals, for being uniraplementable. The current generation of systems incorporating these concepts promises greater re1iabi1ity/robustness and more flexible sharing. However, more hardware/firmware mechanisms are needed to upgrade conventional hardware to narrow the huge gap between the abstraction needed and the features available [DEN76, DEN79,

JAG80]. Otherwise, these systems cannot be feasibly implemented. 59

An integrated hardware/software approach is used to design CAP, since it is a specially built computer to provide capability-based protection. The CAP microprogram unit contains, in addition to a microprogram which implements the regular instruction set and input/output operations, the microprogram which performs operations on capabilities. Special circuits are provided within "the capability unit" for performing address bound checking, for checking access rights, and for adding the segment base to

the offset of the required word. All accesses to the main memory in CAP are made via the capability unit, and the validation is always performed. The machine contains an associative memory for holding recently referenced capabilities, which allows rapid access to capabilities without the need for programmable capability registers.

Hydra is a software system which has been constructed for conventional hardware and therefore allows more flexibility. It is also inevitably slow and expensive in protection manipulation since kernel intervention is always required .

Capabilities in StarOS are managed by Kmaps.

Collectively, Kmaps mediate each processor reference placed on the bus, and sustain the illusion of a single large 60 memory. Kmaps also maintain a table of recently referenced capabilities to expedite accesses to frequently used segments. Although the manipulation of capabilities are supported by hardware/firmware mechanisms, the implementation of capability mechanisms in S tarOs is not ideal. The reason is that the LSI-11 microprocessors do not themselves support capabi1ity-based addressing, and this fact has led to a certain degree of imperfection in the protection mechanism and addressing scheme. This is a deficiency since the addresses generated by direct memory access I/O devices cannot be mapped by Kmap or Slocal.

Functions controlling these devices must make use of the physical addresses of the objects used as buffers. However, the hardware implementation of capabilities has meant that capability operations are speedy enough to be used wherever they logically should be.

Recent trends in operating systems design (either network-oriented or nonnetwork) have shown that the merits of incorporating advanced design methodology are becoming more appreciated. Further, software-directed architecture has gained wide attention as can be seen by the recently announced Intel iAPX 432 micro-mainframe and its "silicon" operating system [KAH81, RAT80, RAT81, ZEI81]. To incorporate the functionalities and features mentioned above into the design of MIKE, we can see clearly that ad hoc arrangements will not be adequate; some systematic design methodology has to be used to effectively realize these goals. In search of better designed and more reliable operating systems, we adopt an Integrated approach to the design of a NOS for the DDLCN. Advanced operating system design concepts have been applied consistently throughout the NOS framework to generate reliable and robust system software. Moreover, the NOS model and protocol structure have also been integrated to provide an extensible and configurable NOS for the DDLCN.

Finally, since the system-wide executive resides on

LIUs, we have several degrees of freedom in the design process. This makes the system more independent of application hardware and software and allows the incorporation of special hardware to support the modern design methodology. Further, this provides concurrent execution environments for application and executive and enables us to design a better, more efficient, and more reliable system at low cost. 2.3 Breakdown of Research

The first two chapters in this dissertation provided an introduction and background material for the further development of this research work. This chapter also presented several previous systems which overlap with our system design philosophy. The remainder is basically broken into 4 main topics, each being detailed by one chapter:

(1) the system design of DDLCN (Chapter 3), (2) the model structure of MIKE (Chapter 4), (3) the protocol hierarchy supporting the NOS services (Chapter 5), and (4) the LIU architecture (Chapter 6). The summary and conclusions are presented in Chapter 7. CHAPTER 3

THE DISTRIBUTED DOUBLE-LOOP COMPUTER NETWORK (DDLCN)

This chapter presents the system design of the

Distributed Double-Loop Computer Network (DDLCN). The DDLCN

is a local area distributed system that interconnects midi, mini, and microcomputers using a fault-tolerant double-loop

network. The network operating system, MIKE, presented in

this dissertation is designed for use in distributed systems

in general, and for use in the DDLCN in particular.

Therefore, in addition to the reason that MIKE is designed

for the DDLCN, we present the system design of DDLCN here because it is a specific and trulydistributed system that we can use as a vehicle to exemplify the roles and functions

the NOS performs, and hence, we can form a more concrete conceptualization about the nature and function of MIKE.

63 64

3 .1 Gene ral System Ove rview

Conceived of as a means of investigating fundamental problems in distributed processing and local networking, the

Distributed Double-Loop Computer Network (DDLCN) [LIU78,

LIU79, LIU80, LIU81, WOL78, WOL79a, WOL79b] is designed as a fault-to1erant distributed system that interconnects midi, mini and micro computers using a double-loop structure in a local environment. It is the successor to our previous single-loop network, called DLCN (the Distributed Loop

Computer Network) [LIU75, LIU77, REA7 6].

The DDLCN is designed in such a manner that its users will see only a single, integrated computing facility with greater power and many available resources without being aware of the system's actual organization and method of operation. A seven-node prototype of DDLCN, interconnecting six PDP-11/23 microcomputers and one DECsystera-20 computer system, is currently being implemented under a grant from the National Science Foundation to the Department of

Computer and Information Science of The Ohio State

University (see Figure 2). 65

2020

LSI-11/23: DECsystem—202 0:

128 KBytes MOS RAM 512 K Words MOS RAM Dual Floppy Disk 2 Disk Drives VT-100 Terminal Magnetic Tape Drives I/O Devices & Terminals L IU (Loop Interface Unit):

16-bit bit-sliced microprogrammable microprocessors(AM 2900 based)

Figure 2.

Prototype of DDLCN 66

Research concerning the DDLCN has concentrated on three areas of its system design: the communication subnetworks

(interface design and communication protocols), distributed programming systems, and distributed database systems.'

Several new features and innovative ideas have been integrated into the hardware, communication, software, and applications of DDLCN, so that it can realize its potential of becoming a powerful and unified distributed system.

3.2 Reliable Communicat ion Network

The communication subsystem of DDLCN consists of a double-loop communication network that uses twisted-wire

pairs (called communication links) to interconnect

individual nodes through hardware devices called the Loop

Interface Units (LIUs). Two levels of protocols are supported by LIU: the loop access protocol and the multi-destination interprocess communication (IPC) protocol.

The former controls the access to communication links, whereas the latter provides a mechanism for IPC demanded by the network operating system. The routing algorithm used by

LIU is the shortest distance routing (i.e., the message is sent; in the direction having the smallest number of nodes from the source to the destination). 67

3.2.1 Loop Interface Design

An intelligent Loop Interface Unit (LIU) [OH77, TSA79] is used to attach each host computer to the communication subnetwork. It is a special-purpose raicroprogrammable microcomputer (based on Advanced Micro Devices' AM2900

[ADV79]). LIU will handle the following communication control tasks for the attached host computer:

1. accept messages from the attached host,

2. make routing decisions,

3. transmit messages at the next available time,

4. check all incoming message traffic from both loops

and remove the message if it is destined for this

host or relay the message if it is not,

5. generate an acknowledgment for the message

received and transmit it back to the original

host, and

6. continuously monitor both loops to detect link

faults, and if a positive detection is made,

determine and execute corrective procedures and

broadcast status messages to the rest of the

network. 68

Extension to the interface design has also been completed [TSA80a, TSA80b], making LIU fau1t-tolerant not only with respect to link failures, but also with respect to the failure of the LIU components. The enhanced interface is dynamically reconfigurable, adaptable and fau1t-tolerant due to its incorporation of two novel architectural concepts: the Sliced Computer Module (SCM) and bit-sliced processing. It can adapt itself to the variation of system workload by redistributing its SCMs and can provide highly reliable operation in hostile environments by isolating/replacing faulty SCMs. These architectural features are made possible by the raicroprogrammability and dynamic reconfigurability of the loop interface and are critical to the reliability of the distributed operating system that supports concurrent computing activities.

Details of its design and operation can be found in

[TSA80b].

3.2.2 Loop Operat ion

The loop channel access protocol described by Reames and Liu [REA75] is incorporated into the design of the

Distributed Loop Computer Network (DLCN) [LIU75, REA76].

Message transmission is accomplished through the use of a 69

shift-register insertion technique performed by the loop interface, whereby the loop may carry multiple variable-1ength messages at one time. This type of protocol allows the simultaneous and direct transmission of variab1e-1ength messages onto the loop by more than one interface without the use of any centralized control.

There is very little fault tolerance in a single-loop design if a communication link should fail. Service is stopped altogether with just one link-fault, since either a message or acknowledgment of its receipt is blocked. For this reason a new double-loop network design has been adopted for the DDLCN [WOL78, WOL79a]. The use of two loops instead of one will naturally provide some fault tolerance simply through redundancy. However, the degree of fault tolerance is minimal, since two link faults (one on each loop) may render the network non-operationa1. Thus some additional design modifications are required to enable the network to have a greater degree of fault tolerance.

The solution incorporated into the DDLCN interface is the use of tri-state control logic connecting the input and output delay buffers together on each side of the interface.

These tri-state controls allow a dynamic reconfiguration of the physical links, should some of them fail. The addition 70 of the tri-state logic to the interface is in the form of a hardware component which is controlled by the microprogram within the interface. With this addition, the network is capable of dynamically restructuring its message transmission directions as link faults occur, so that maximum loop utilization may be achieved under all c ir cumsta nc e s .

Simulation results [WOL79b, WOL79c] on a six-node DDLCN have shown that the DDLCN provides a much lower message delay time for the network under fault-free and various fau1t-present conditions, as compared with DLCN (which has the best simulated performance among several proposed single-loop networks).

3.2.3 Multi-Destination Protocols

Three classes of multi-destination IPC protocols, each providing a different degree of reliability, have been developed for the DDLCN to facilitate efficient and reliable exchange of messages [PAR78, PAR79a, PAR79b]. These three layers of protocols are: unreliable multi-destination protocol, reliable best-effort-to-de1iver protocol, and reliable guarantee-to-deliver protocol. Each layer of protocol provides a different degree of

reliability. Therefore, the distributed software can choose

the one which is best suited for its purpose in a cost-effective way. Furthermore, the multi-destination IPC protocols provided in the DDLCN enable the system to avoid

sending separately single-addressed messages, thereby

reducing network traffic.

3 . 3 Distributed P r og rammi ng Systems

A distributed programming system is under development

to minimize the high communication costs and to cope with

the absence of shared memory as synchronization tools in

distributed environments [PAR79a]. This language system can

be used as a tool for the implementation of assorted

distributed algorithms, e.g., distributed synchronization

models, distributed operating systems, and distributed

database systems. 72

3.4 Distributed Database System

One of the major services provided by the DDLCN is

Distributed Loop Database System (DLDBS) [CH079]. DLDBS is concerned with the design of the Distributed Loop Data Base

Management System (DLDBMS), the management of the database directory (schema and location directory), the database distribution, and the distributed database architecture by

incorporating such a database distribution.

The design of DLDBMS involves designing three

distributed algorithms (DAs): distributed concurrency

control, distributed query processing, and crash recovery.

Two new concurrency control mechanisms, one for fully

duplicated DLDBS [CH080a] and one for partially duplicated

DLDBS [CH080b], have been developed. The mechanisms use

distributed control and are deadlock free, simple to

implement, and robust with respect to failures of

communication links and hosts. They do not use global

locking, do not reject transactions, and exploit potential

concurrency among transactions. Arguments for the

correctness of the algorithms are given in [CH080a]. A

crash recovery mechanism is available to ensure the

continuing and correct operation of DLDBS under abnormal

conditions (e.g. , node crashes or communication link 73 failures). Data definition and manipulation languages for

DLDBS as well as distributed query processing have also been developed [CH081].

3.5 Ot he r Research on DDLCN

Other research efforts include the design and implementation of a DDLCN measurement and control center, further works on multi-destination protocols, and the design and development of real-time programming language constructs for the DDLCN. It is hoped that with the proper design of hardware, software, communications, and applications, the

DDLCN can meet its expectation and become a testbed and forerunner of distributed systems.

The network operating system proposed in this dissertation is intended to fill the gap between the low level communication subnetwork and the high level application software such as DLDBS. The layered structure of MIKE provides flexible operating system services where resource sharing and distributed computing can evolve in a modular fashion to meet the needs and the sophistication of the host computer. CHAPTER 4

NETWORK OPERATING SYSTEM MODEL

4 . 1 The Hierarchical Framework 0f MIKE

MIKE is the network operating system (NOS) which is used to integrate a collection of autonomous and heterogeneous computer systems in the DDLCN. It presents to the user a single, powerful computing facility with a high degree of cohesiveness, transparency, and autonomy.

Logically, MIKE is organized as a meta-system and runs on top of existing local operating systems. Therefore, it must deal with the problems due to its distributed nature and the heterogeneous systems on which it is based: naming, protection, sharing, heterogeneity, error recovery, and other performance problems arising from the distributed environment of the network. These functional capabilities are needed to coordinate the operation of autonomous computers and to provide system transparency to the users. 75

Creating an extensible, coherent set of services in a distributed and heterogeneous environment is a tremendous undertaking. An integrated approach is taken here which uses sound principles to structure the framework and model of MIKE and creates a software-directed hardware environment upon which MIKE will be executed. It is felt that only then can the utility and reliability of MIKE be greatly enhanced, and its complexity and cost of implementation and maintenance be drastically reduced.

Designing a network operating system which will be the only operating system running in a local computer is a much easier job compared to designing a network operating system

such as MIKE which runs on top of existing local operating

systems. In the former case, no restriction (due to the

existing operating systems) was imposed on the design task,

the "pure" form of modern design methodology can be applied consistently even from the primitive level and used consistently. However, due to the need of exploiting existing software investment wherever possible, MIKE is designed as a meta-system. The approach taken in designing

MIKE is to minimize the modification of existing operating systems and software systems while allowing them to be embraced by the model proposed here, which incorporates modern design methodology, as consistently as possible. 76

This will lead to a certain degree of imperfection in the conceptual level of design. Unfortunately, this fault cannot be eliminated due to the specific environment of

DDLCN.

The framework of MIKE is hierarchically structured.

The most distinguishable characteristic of the MIKE framework is its explicit and enforced modular structure.

This enables us to describe the system at three different levels of abstraction. Relevant processes are then segregated in each level to shield the implementation details from other layers.

The MIKE hierarchy, described from bottom up, consists of three layers :

1. the inter-process communication (IPC)layer,

2. the system support layer, and

3. the virtual machine layer.

The IPC layer uses three types of multi-destination protocols to provide internode message communications for higher layers of protocol. Each type of these multl-destination protocols provides a different level of reliability. The most reliable protocol is implemented with 77 extensions (abstractions) built from primitive IPC protocols. The system programmers can then choose among these protocols to best suit their programming applications.

The system support layer facilitates the reduction of internode communication overhead and abstracts common mechanisms to provide a complete set of primitive services for the upper layer. The virtual machine layer provides a virtual machine where the distributed and heterogeneous nature of the systems on which MIKE is based are all masked out, and user processes interact without the awareness of the network architecture.

To give a proper perspective about the MIKE framework,

Figure 3 shows a three-node DDLCN based on the "task" notion which we will discuss in Section 4.2.3. Each node in the

DDLCN consists of a number of tasks. A task is a logical

grouping of processes and objects and forms an autonomous

and protected subsystem. Each task has its local resource management policy, and therefore can accept or refuse the resource sharing requests from other tasks as it sees fit.

Furthermore, each task guards its internal integrity by controlling the access to its subordinate processes and objects. 7 8

DECsystem-20

/ N '

PDP-1 1 PDP-1 1

: Task

O : Process

^ : Data Object

Figure 3.

An Overview of a Three-Node DDLCN 79

This chapter is organized in the following way.

Section 2 is devoted to the logical structuring and grouping of system resource entities. From Section 3 to Section 5, we present the interaction protocol among these resource entities. Finally in Section 6, we describe how the system-transparent resource sharing can be achieved by using the notions presented.

4.2 The Ob ject Model

4.2.1 Introduction

The complexity in the design of a robust operating

system, when conjoined with multiple computers networked

together, is staggering. The object model is used, both as

a concept and as a tool, to characterize the components of a

software system and to express clearly the relations among

these components [FLY79, GLI79, JON78, LIN76a, LIN76b,

LIS 7 5, LUN79, MAD 81, SHA80, SNY 7 9] . It also provides

guidelines to decompose a design via a set of abstractions

such that the complexity is more manageable. 80

Objects are abstract system resources, either logical or physical. In order to preserve its integrity, an object can only be accessed, be modified, or stand in relation to other objects in the way appropriate to that object [JON78].

The object model describes that the behavior of the objects can be observed only by applications of a specific set of operators . That is, to alter or even to determine the state of the object, an appropriate operator must be invoked.

Because many objects essentially have the same invariant and representational properties, it is convenient to define a single set of operators, perhaps parameterized, that are equally applicable to many objects of an equivalence class. Two objects are said to be of the same abstract type (i.e. equivalence class) if they share the same set of operators. Scope rules are defined so that only these well-defined operators are allowed to manipulate the representation of an object of that abstract type.

The external specifications of an abstract type are therefore separated from their internal representation. All knowledge about these representational and operational details are contained and hidden within the data abstraction boundary. Thus an abstract type can be used without the knowledge of its implementation and implemented without the 81 knowledge of its use [LIS75, LIS77],

To express a new data abstraction, new abstract types

(called extended types) are defined in terras of existing abstract types [JON78, LIN76b, WUL74, WUL81]. These

extended types are those types that are not directly

implemented by the system. Primitive types are provided by

the machine at the architectural interface. Extended-type

objects are represented in terras of other component objects.

The protection mechanisms should control access to objects

of extended types in terms of the operators defined

specifically for that extended type.

As pointed out by [JON78], the object model is merely a

structuring tool; it does not imply a particular design

technique and is very flexible in the sense that it is

amenable to the design technique adopted. At each step in

the design process, the model enables users implementing an abstract type to ignore the unnecessary detail. The protection mechanisms can be extended to handle these newly defined types. The users only focus on the representational and operational details of the newly defined abstract type and on the specifications of the more primitive abstract

types- they are using to create the new abstract type. 82

4.2.2 The Object Model of MIKE

\

MIKE adopts the object model to structure its resource management since it is a familiar and sound semantic model of computation. This model also enables us to study more effectively how to achieve cooperative autonomy and system transparency.

As stated in the previous section, it will be a relatively easy task to apply data abstraction to MIKE if

MIKE is designed as a base operating system for the DDLCN.

However, due to our unique environment, MIKE is a NOS kernel layered on top of existing operating systems. Therefore, the object model of MIKE has to take into consideration these local operating systems. In order to apply a data abstraction concept consistently and uniformly to the non-physical resource management, a unique "task" concept has been developed here so that a task can be treated as a unit of autonomous abstract resource. The importance of this task concept is that it enables the object model to embrace both MIKE and the local operating system regardless of the latter's internal organization. We will describe in the following section the task concept and its utility to system transparency and cooperative autonomy. 83

4. 2 >'2.1 Type Def ini t ions

By using the object model, MIKE consists of a set of entities (1), each of which can be thought of as a kind of resource. Some resources have a direct physical realization, such as I/O devices. Others are non-physical

resources such as processes, semaphores, mailboxes, files,

etc. An entity is a triple: name, type, and

representation. A name is the unique identifier for a particular entity that differs from those of other entities.

An entity is addressed through an access descriptor called

"capability" which we will discuss later. The capability

indicates, among other things, the starting address of its

associated entity. A displacement from the starting address

of that entity has to be furnished when accessing an item in

that entity. The representation of an entity includes

either a data segment, or a capability segment, or both.

The data segment contains, of course, data. The capability

segment holds a list of capabilities. Thus an entity may

(1) Up to this point, we have used the phrase "object model" because it is a term in the lexicon of the computer professional. In order to avoid any future confusion, we will- use "entity" to substitute for what we meant before by the word "object" in the following discussion. "Objects" will then be used to indicate passive entities, whereas active entities are called "processes." However, because it is not expected to cause any misinterpretation, we will continue to use the phrase "object model." 84

reference other entities. This kind of entity representation permits non-hierarchical protection domains within a single process and enables the mutual suspicion problem (see footnote on page 55) to be solved.

Entities are typed, and each type of entity is specified by:

1. the description of a data structure which

implements entities of the given type by

constructing them from more basic entities, and

2. a set of access operators, called a "type module"

[JON78] , which is the only means to manipulate the

internal structure.

The type definition gives information about, among other things, the possible operators on entities of that type and on their realization. Different operations are possible on different types. These type definitions are represented inside the computer system by type manage rs [GLI79, LlN76b].

The type manager forms a protected subsystem for entities of its type and safeguards the entities' integrity by ensuring that- the manipulation of these entities is allowed only through the well-defined operators. 85

The type attribute (e.g. INTEGER) (2) of an entity

(e.g. 382) is in fact the name of another entity. Of course, this particular entity must itself have a type attribute. We have given this type attribute the special unique name, ROOT. Therefore, initially the system requires a single distinguished entity whose name is "TYPE" and whose abstract type is "ROOT." Figure 4 illustrates the hierarchy of the abstract type definition of MIKE. Besides the

"ROOT," we have two MIKE-defined types (PROCEDURE and FILE) and one user-defined type (STACK). Also shown in the figure are three instances of type PROCEDURE (RUNOFF, PR0G1, and

PL/1).

New types are definable and the protection system can be extended to handle these newly defined types. To create a new type, users invoke a MIKE-defined operator, CREATE, on the object named TYPE of type ROOT and specify the object's type. Figure 4 illustrates a user-defined type called

"STACK." Entities of a type can be created once the type has been defined, either by the system or by the user. These entities are distinct from the type itself.

(2) We will capitalize the type attribute of an entity for emphasis. 86

name : TYPE

type ROOT

name PROCEDURE name FILE name STACK type: TYPE type: TYPE type: TYPE

name : RUNOFF name : PROG 1 name PL/ 1 type PROCEDURE type PROCEDURE type PROCEDURE

Figure 4.

Type Hierarchy of MIKE 87

4.2 . 2 .2 Processes

Typed entities can be further divided into two distinct kinds: active (called process) and passive (called object).

A process, which is a schedulable unit for asynchronous computation, is an active entity which moves through the instructions of a procedure as the procedure is executed by a physical processor [BRI73, DEN76]. Technically, a process contains execution information (e.g. priority for process scheduling) and a stack of objects with the MIKE-defined type called PROTECTION/ACCESS DOMAIN, or PAD for short

[WUL81].

In order to know more about PAD we need to present another MIKE-defined type called PROCEDURE. A PROCEDURE object contains both data and capability segments. Its data segment usually holds what we call a program or subroutine.

A PROCEDURE object is reentrant and potentially recursive and can be called by a process; that is, the process invokes the operator "CALL" (which is also MIKE-defined) on a PROCEDURE object. Mike will respond to the invocation by creating a PAD object and stacking it on top of the stack in that process. The initial access domain after the process called the procedure is defined by the capabi1ity-1ist in the newly created PAD object. This capabi1ity-1ist is 88

loaded from the capability segment of that procedure just called and the parameters passed from the calling procedure.

Therefore, the top PAD object of the stack defines the

instantaneous protection and access domain of its process.

The fact that a process can only reference the objects

through its PAD objects is the fundamental point of

domain-based protection and is of paramount importance to

the reliability and robustness of MIKE.

Processes can be classified into two categories:

transient and cyclic. A process is said to be t rans ient if

it will terminate after the function execution is completed.

A special kind of process called guardian is said to be

cyclic because it is pre-initialized to execute a particular

function, and once being initialized, it will exist

"forever." (See Section A.2.3.1 for the explanation of the

guardian concept and the cyclic process.)

Non-process entities, which we call objects , are

passive, i.e., they do not originate any activity. Table 1

lists some entity types in the MIKE structure and their

associated operators. 89

Examples of Types and Their Type-Specific Operators

Type Operator

PROCESS CREATE DESTROY FORK JOIN START (or SCHEDULE) STOP (or UNSCHEDULE)

STACK CREATE DESTROY PUSH POP EMPTY FULL

FILE CREATE DELETE OPEN CLOSE READ WRITE APPEND

MESSAGE CREATE DESTROY SEND-REQUEST SEND-REPLY RECEIVE-CONDITIONAL RECEIVE-ANY WAIT-ANY ABORT STATUS 90

4.2,3 Tasks

By using the object model to structure MIKE, the entity universe is further divided into mutually exclusive sets.

Each of these sets is called a task. A task consists of one or more processes and possibly some objects. Entities within a task form the address domain of that task. All the constituent entities, both processes and objects, must reside in the same physical node. A given task can assume the role of either a resource provider (called server) or a resource user (called customer) or both at the same time. A process can refer directly to the objects in the address domain of its task, but can only operate indirectly to

"non-private" objects (i.e., objects in other tasks) by sending messages to the appropriate "guardian" of those objects.

4 . 2 . 3 .1 Guardians

One of the processes in a task is called the guard i an

(3). The guardian, which is an essential and indispensable

(3) Although guardian is a kind of process, we will reserve the word "process" exclusively for transient process. However, we will use the phrase "process interaction" to indicate the interaction among processes of any type. 91

element to a task, safeguards the integrity of objects in its task's address domain. In MIKE, a guardian is a special kind of process. Each distinct task is associated with one and only one guardian. A guardian contains, among other things, the following specifications for its task:

1. the state descriptors of objects in the address

space,

2. the synchronization constraints, and

3. the scheduling policy, including (optional) local

resource management policy.

At task creation time, the guardian is the only entity in its task. During the time MIKE is operational, the guardian will spawn processes either to perform some functions or as the results of service requests initiated by processes from other tasks. Guardians are cyclic processes; that is, once created, they will exist "forever." The word

"cyclic" means that the creation and destruction of guardians is more static than that of transient processes

(or just processes). Guardians, after satisfying the requests from other tasks, will block themselves while awaiting the arrival of further service requests.

Therefore, the loci of control or processing activities of a 92 guardian are cyclic and give us the illusion that guardians exist "forever." Actually, a guardian will cease to exist by the request of its owner (either system or user).

Requests to operate on objects in the address domain of another task will be intercepted by its guardian. That is, a guardian safeguards the integrity of guarded objects by controlling all access to them. Tp be more specific, a guardian, based on the state information of its objects,

initiates requested operations on behalf of other tasks. If

two or more requested operations are to be enabled at the same time, the resulting ambiguity can be resolved by the synchronization constraints and the scheduling policy.

Requests that cannot be processed immediately are put onto one of its internal queues. Thesedormant requests are

reactivated later according to scheduling strategies if the

state permits. Access operations can be delayed somewhere during the course of their execution due to the synchronization constraints until an appropriate state of

the object is reached by the effect of operations performed

by other users. Guardians exploit possible parallelism by 93 allowing simultaneous access to the same object if the synchronization constraints are not violated (4).

One of the important functions guardians perform is to enforce local resource management policy. The guardian of each task can, according to its internal (optional) management policy, selectively discriminate against certain users (e.g., users from different physical nodes) by rejecting their requests for service. The guardian construct allows us to model different degrees of autonomy according to the local management policy which can be changed dynamically. This also allows the cooperative autonomy to be realized in MIKE in a more organized manner and in a more refined granularity. We will explore more about cooperative autonomy later in this chapter.

(4) Note that Brinch Hansen's monitor concept is a restricted version of our guardian construct. A monitor allows procedures (similar to operators of a particular entity type here) to be executed strictly one at a time. On the contrary, guardians will initiate several processes simultaneously if they operate on different objects. Further, if synchronization constraints permit, several processes can access a particular object at the same time (e.g., two processes READ the same object simultaneously). 94

4.2.3.2 Task Class if ication

In MIKE, tasks are classified into three categories.

Tasks in each category are formed from the same kind of binding force. The first kind of binding force is that component entities of a task collectively realize one of the data types in the system, that is, the type manager together with the objects of that type form a task. Since operations on a particular type of objects are possible only through the invocations to its type manager, the processes initiated by the type manager to perform the well-defined operations on the objects also belong to the same task. Tasks in this category are referred to as type tasks .

The second kind of binding force is that component entities of a task work together harmoniously to provide a certain distributed service as one of the MIKE utilities.

We will call it a service task. Examples of these distributed service tasks are the Consistency Enforcer of the Distributed Loop Data Base System (DLDBS) [CH081] , the

Virtual User service task, the Virtual Resource service task, and other service tasks in the system support and IPC layers. 95

The last and also a very important binding force is that the local operating system and its user processes form a task. This is a unique and "super" task, and we refer to it as the operating system t ask, or OS task for short. Each node has one and only one OS task but has numerous type and service tasks.

The merits of this logical grouping, i.e., task concept, are twofold. First, it provides us with a higher level structuring tool to analyze and control the components' interaction in a hierarchical way. As outlined above, MIKE consists of three layers: the IPC layer, the system support layer, and the virtual machine layer. Within each layer, relevant tasks are segregated to provide the designated function of that layer. The OS task can only reside in the virtual machine layer. Various type tasks and service tasks can exist in any of the three layers as needed. Second, each task forms a unit of protected subsystem and acts as an (optional) autonomous unit.

Therefore, each individual taskhas the capability to manage its resources independently of other tasks, and the management policy can be changed dynamically to reflect its interest. The integrity of this subsystem is the responsibility of its guardian. Figure 5 shows a profile of

MIKE in terms of the task concept. DECsystem~2 0 PDP-11/23

Moni t o r Guardi an; u a r d i an RT-1 1

Virtual Machine Layer

/,

System Support Layer

Guard iauardia uardian uard ian

IPC Layer

Physical Boundary Interconnected By Communication Subnetwork

© ■ Task

Figure 5.

MIKE Profile Based on Task Concept 97

4 . 2 . 3 . 3 Type Tasks

In MIKE, the type is encapsulated. A type task is used to realize one of the entity types defined in the system

(e.g. , STACK, FILE). The guardian of a type task is its type manager. A type manager, as described above, is used to enforce the integrity of the objects of its associated type. Figure 6 illustrates three snapshots of the address domain of a type task called FILE, which implements type

FILE, during its lifetime.

As illustrated in Figure 6a, when a type task is just created (5), it only contains the guardian and nothing else.

Figure 6b illustrates the situation where, at that moment, no request is received by the guardian to manipulate the objects (FI and F2) it guards. Therefore, type task FILE contains, in addition to the guardian, objects of type FILE created before by the requests of other tasks.

(5) A type task is created when that type is initially defined. A type is defined either by the system at system generation phase if it is a system-defined type, or by the user at the time the user makes the initiative, if it is a user-defined type (i.e., extended type). Type Task. FILE

(a)

Guardian ■---ILE _

F 1 F 2

(b)

Guard ian FILE__

READ WRITE

F 1 F 2

(c)

Figure 6.

A FILE Task 99

In Figure 6c, there are a guardian, two processes and two objects in the address space of type task FILE, and the objects are all type FILE. The data objects are created by the guardian on the behalf of other tasks. The processes, which have resulted from the requests of other tasks to operate on the FILE objects, are spawned by the guardian.

These processes are actually several instances of the type

FILE operators. They are activated to perform operations

(which can be the same or different operations) either on different objects or on the same object. Remember, that as long as the synchronization constraints and the integrity of the guarded objects are not violated, the guardian will exploit parallelism to improve response time and perf ormance.

4 . 2 . 3 . 4 Service Tasks

Service tasks are used to provide network-oriented services for users. The guardian of a service task is responsible for initiating appropriate actions according to the service requests it receives. An example of this category is called the Virtual Resource service task. Its function is to provide a local "virtual" resource which is requested by the user in the same physical node, but the 100

resource is not locally available. There is another service task called the Virtual User service task, which is a direct counterpart of the Virtual Resource service task; this task will generate a "virtual" user to use a particular resource which is available locally but needed by users at a remote nodes. The virtual user will use the resource on behalf of

its remote "real" user counterpart. Unlike type tasks, the

guardians of these two tasks will not create, in their

address domains, any data object. They only spawn processes »» which will provide a certain illusion that either all the

distributed network resources are available locally or all

the resource accesses are made by the local users. The

contribution of these two service tasks to

system-transparent resource access will be explained later.

4 . 2 . 3 . 5 Operating System Tasks

Each node in the DDLCN has one and only one operating

system task (or OS task for short). The OS task resides

physically in the local computer system. The OS task

contains, in addition to user processes, the local operating

system. To end users, this OS task is their universe and

they communicate only to the guardian (i.e. the local

operating system) of its OS task. In order to form a solid 101 conceptualization of the nature and the function of the OS task, we will use the TOPS-20 Monitor to exemplify this existing local operating system in the rest of this dissertation. One of the reasons to choose the T0PS-20

Monitor is that it is the operating system for the

DECsystera-20 (6), which is one of the nodes in the DDLCN .

The guardian of a OS task is the existing local operating system. In our case, Monitor is the guardian of this OS task as it safeguards the integrity of the resources it controls. In MIKE, each OS task may have its own

"personality" which depends on the original design philosophy of its guardian (i.e., Monitor in our example).

Since we stated at the outset that each task is an autonomous and protected subsystem, it can run its own resource management policy independently of other tasks but in a cooperative manner. Furthermore, the OS task is treated by the rest of the system as just another task which acts as a resource provider/user. We will examine closely

(6) The DECsystera-20 is a medium scale computer system which has both time-sharing and batch computing capabilities and runs the T0PS-20 Monitor [DIG78]. The TOPS-20 Monitor (or Monitor for short) provides each user with a 256K-word virtual memory environment. In addition to many locally developed software packages, a number of compilers, interpreters, cross-assemblers, and utility programs are available under Monitor. 102 in Section 4.6 how the MIKE characteristics can be achieved by using this task concept.

4 . 3 Naming

Creating a uniform layered approach to resource naming is central to NOS design since it is intimately interrelated to the protection issue and has a great impact on MIKE's system utilities. Both the naming and protection issues are resolved based on the capability concept. We will concentrate on the naming issue in this section; the protection issue will be treated in Section 4.5.

The naming mechanism in MIKE is intended to:

1. To support system transparency by permitting

services to be requested only by name. The

server's physical (or machine) address does not

have to be identified.

2. To support two forms of names, one convenient for

people (mnemonic) and one convenient for machines;

r there should be a clean separation of mechanisms

between the two forms. 103

3. To support distributed name generation.

A 4. To allow multiple resources (either entities or

tasks) to have the same mnemonic name so as to

support resource relocation, logically equivalent

generic services, multiple copies of resources for

reliability and efficiency, and other needs.

5. To support resource sharing and the building of

new resources out of old ones without name

conflicts .

Capabilities can be used as a form of names for resource entities. Capability-based addressing means the use of a capability to address and control access to an entity. It is adopted in MIKE to resolve naming and to provide regulated access to entities.

A capability is a system-created token which contains a unique system-wide name and is used as an identifier for a particular resource entity. The possession of the token confers access rights to the named entity. The format of a capability consists of 1) a system-wide name unique to the local name space, 2) access rights, and 3) other information needed to distinguish a particular item within the identified resource (e.g., the displacement from the base 104

address of the named entity). In general, a process (either guardian or transient process), at any given time, has a list of name tickets, refered to here as a capabi1ity-list

(or C-list for short). Every reference to an entity must be made through this C-list.

4.3.1 Nami ng System

The naming system of MIKE is designed to facilitate system transparency and local autonomy. In the DDLCN , MIKE dictates that only the guardians' names are known network-wide and that all message exchanges are initiated by and terminated on guardians. The processes' names, except those of guardians, are known only to the guardians of their respective tasks. This naming scheme is strictly enforced by the underlying hardware, as will be explained later.

In MIKE, "task" is a logical concept for entity grouping; therefore, it does not exist physically.

However, we often refer to a specific task by its name; the name of a particular task is actually the name of its associated guardian. To the system, guardians, which are cyclic processes, do physically exist and are known by their names. Every process in MIKE, whether cyclic or transient, 105 has a unique system-wide name (with respect to the local name space). However, only a subset of these names (i.e., those of guardians) are network-wide accessible. As mentioned before, the goal of MIKE is to facilitate system-transparent resource sharing while performance is also taken into consideration whenever possible. Due to the transiency of a process's existence, we do not think it is necessary to provide a facility for communication between

these transient processes. The result of this provision contributes to the conceptual clarity in the design of MIKE protocol structure and to more efficient system operation.

Guardians' names are a set of non-hierarchical

identifiers since there is no control hierarchy imposed

among guardians within the same physical node. Processes

and objects do have their names; however, their names are

further qualified by the name of their creator (i.e., the

guardian's name). Each task is an autonomous unit and

therefore administers its own naming system. However, it should allow other tasks to be able to specify, without any

ambiguity, the desired resource and/or service (7).

(7) For example, a command (or request message) such as (STACK, PUSH, SI, TEMP) should indicate clearly it is a request to STACK guardian to PUSH an item TEMP onto stack SI. Another example is (Monitor, PASCAL, SOURCE) which indicates to Monitor (which is the guardian of the OS task) its desire to compile and execute a PASCAL program called SOURCE. 106

The naming system of MIKE has several distinct advantages. It simplifies the Name service task and facilitates the binding process it performs (see next section). It also simplifies the generation process of globally unique identifiers (in all space and time). It facilitates system-transparent resource sharing and supports cooperative autonomous operation by hiding the resources

"behind their guardians," such that the resource access is regulated by its guardian according to the local resource management policy. The adopted naming scheme is motivated by the need for conceptual uniformity, by the unique application environment of DDLCN, and by the performance considerations. The usefulness and the impact of the MIKE naming system on other conceptual issues will be elaborated upon later.

4.3.2 Name Service Task

The unique address contained ineach capability is machine-oriented and therefore is not convenient for users.

To support human-oriented naming of resources, a Name service task can be provided in MIKE. The Name service task is a higher-level naming mechanism which pairs the human-oriented name with the machine-oriented address in 107

capabilities. Every layer in the protocol structure has a

Name service task which either performs the binding process

(8) or initiates other actions if it cannot bind the name to

an entity in that layer. The use of the Name service task

will allow the sharing of names and permit names of related

resources to be grouped in a way convenient to system

prog r amme rs.

4.3.3 A Note on Implementation

It might seem incongruous to inject a note on

implementation at this point; however, the presentation of

a possible implementation can be used to clarify the

conceptual issues we are discussing. In M IK E , a

network-wide name (address) consists of a unique (with

respect to the D D LC N ) node identifier as a prefix and a

unique (with respect to the local name space) guardian name.

These network-wide addresses are used for internode

communication only. Figure 7a gives an example of the machine-oriented name of the Virtual User service task

residing in the DECsystera-20 node. Two bits (bit 8 and 9)

(8) Binding is the process of resolving a reference to an entity in MIKE by replacing the reference with the entity's machine-oriented address. 108 are used to indicate the task type, where "00" indicates the

OS task, "01" indicates the type task, "10" indicates the service task, and "11" indicates the task not in this particular node. The potential use of "11" in bit 8 and 9 will be clarified later. Further, Figures 7b and 7c illustrate two unique (with respect to the local name space) names for a process called PUSH and an object called SI in type task STACK. The first 16 bits are actually the qualifier (i.e., the name of their task) for their names.

Figure 7d shows the guardian name of type task STACK.

The network-wide naming system facilitates system-transparent resource sharing. For example, the

Virtual Resource service task can broadcast a message to the

DDLCN addressed to any Virtual User service task using its generic name such as "1011111111111111" illustrated in

Figure 7a. The hierarchical naming system for processes and objects facilitates the implementation of "inter-guardian" communication mechanism since MIKE can uniquely identify a destination entity's master (i.e., its guardian) from its machine-oriented name. We will use the examples in Figure 7 to help the discussion of other conceptual issues later. 109

0 0 0 0 0 0 1 1 1 0 1111111111111

0 8 9 10 23

Node Identifier: Task Type: Task Name: DECsys tem-20 Service Virtual User Service Task Task

(a) Machine-Oriented Name for the Virtual User Service Task o 0 1 0 10 10 10 1 0 1 0 1 0 1 1 o • • •

0 1 2 15 16 17

(b) Machine-Oriented Name for the STACK Object SI

0 1 0 10 10 10 1 0 1 0 1 0 1 0 0 . . . 0 1

0 1 2 15 16 17

(c) Machine-Oriented Name for Process PUSH in Type Task STACK

0 1 0 10 10 10 10 1 0 1 0 1 0 0 . . . 0 0

0 1 2 15 16 17

(d) Machine-Oriented Name for Guardian STACK

-1 Task Type bit 2-15 Task Name 6 : Process/Object bit 17-. • • Entity Name

Figure 7.

Machine-Oriented Names for Entities 110

4 .4 Process Interact ion Model

4.4.1 Message Passing Versus Procedure Calling

In their paper, Lauer and Needham [LAU78] state that, f°r single processor s ys terns , message passing and procedure calling are essentially equivalent. That is, the message-oriented system and the procedure-oriented system are dual of each other and that a system which is designed according to one model has a direct counterpart in the other. Moreover, neither model is inherently preferable, and the main consideration for choosing between them depends on the machine architecture upon which the system is being built .

This duality of operating system structures cannot be extended to distributed systems. There are potential semantic differences between invocation of an operation on a local entity and a remote entity. The invocation of an operation on a local entity, which appears to be a simple invocation and does not involve internode communication, will be obscured semantically if the message passing mechanism is used. Further, a local invocation has two possible outcomes: either the operation is completed 111

successfully or some error will occur. For an operation on a remote entity, however, another kind of outcome is possible. In particular, if no reply is received from the remote node, it is simply not known what has actually

happened. The message to or from the remote node could have been lost, or the node could have failed before, after, or

in the middle of processing the request.

A unique combination of the message passing and

procedure calling mechanisms has been used to model process

interaction in MIKE. For processes residing in the same

task, either message passing or procedure calling can be

used to structure their interaction, since the duality

principle applies in this case. Which mechanism will be

used in each instance is based on semantic and performance

considerations. To reference a potentially remote entity,

message passing is always used. The procedure calling mechanism is not feasible in this case since there is no

shared memory to synchronize the activities and no guarantee

that there will be a response to the request; these two are

the prerequisites for its application. 1 12

4.4.2 Two-Leve1 Process Interaction Model

The DDLCN consists of a collection of autonomous and

heterogeneous nodes. Most of the existing local operating

systems, e.g., T0PS-20 Monitor, use a mixture of message

exchange and procedure invocation for their process

interaction. It would not be natural to mandate that the

system communicate by message exchange only, since

abnormality can be found in the OS task. Therefore, we will

adopt a two-level process interaction model for MIKE such

that the conceptual uniformity and performance consideration

are both taken care of. The model dictates that inter-task

communication is through message exchange and intra-task

communication is through either procedure invocation or

message exchange.

4 . 4 . 2 . 1 Intra-Task Commun ication

The lower level of the process interaction model

applies within a task boundary only. Processes executed

within a task are communicated through procedure invocation

or message exchange, and the guardian maintains mailboxes

for intra-task message communication. All the components of

a task are confined in the same physical node and they are 1 13 working harmoniously toward the same problem. Therefore, it is most efficient and semantically sound to model the intra-task activities in this way.

One of the design goals of MIKE states that access by local user processes to local resources should be as efficient and no more involved than is common on existing local operating systems. In MIKE, users see the DDLCN as a single integrated computer; therefore, they access resources as though the resources were locally available by submitting the request to the guardian (e.g., Monitor) of the OS task to which they are subordinate. The communication between the user and Monitor (both processes reside in the OS task) can be accomplished by either message passing or procedure calling. Which mechanism will be used depends on the original design considerations of local operating systems and is no concern to MIKE. A scenario which illustrates system-transparent resource access will be presented in the final section of this chapter.

By using this intra-task communication model in MIKE, conceptual uniformity is not compromised, since the computational activities in the OS task can be modeled in the same way as other intra-task activities in service tasks or type tasks. 114

4.4.2.2 Inter-Task Communication

Logically, MIKE consists of a collection of tasks, and each of which acts as an autonomous resource provider/user and is scattered around the DDLCN. Since MIKE is designed to be extensible and configurable, new tasks can be added and old tasks can be deleted dynamically. Based on this assumption, all communication among tasks is entirely message based.

A message is logically a package of information of arbitrary length. Basically, there are two types of messages: request and reply. Request and reply messages consist of address, control, and data parts. The address field includes source and destination task names (i.e., the guardian names). Request messages contain operation specification in the control part and parameters in the data I part which are directed to the guardian of the destination task. Reply messages contain results of requests (e.g., success, failure, etc.) in the control part and results (if any) in the data part. 115

4.4.2.2.1 Message-Based Invo ca t ion

A process can operate directly on the objects in the address space of its task. However, a process can only access a remote entity (i.e., entity not in the same task) by sending messages containing operation specification and parameters to the guardian of the target entity. The guardian serves as a communication gateway to its subordinate processes when they wish to establish a communication channel with an outside entity. The guardian will send request messages and receive reply messages on

behalf of its subordinate processes. These messages will be

intercepted by the guardian of the destination task. After

the request has been validated, this guardian will then act

as an agent and dispatch processes to perform the requested

operation on its guarded objects in a well-defined manner,

and/or send additional messages to other tasks to aid it in

carrying out the original request. When a request is

executed, a reply message which indicates the result of the

operations will be sent back to the requesting guardian.

Guardians have facilities to allow,customers to request

a one-time service, or to negotiate the establishment of a

connection. Once the connection has been established,

continuity can be provided by a session of conversation, and 1 16 this could be used to improve response and performance.

MIKE is basically concerned with the inter-task activities with message passing as an underlying semantic concept. Tasks are autonomous units, and the internal integrity and management is the responsibility of their respective guardians. Moreover, nothing can be shared or exchanged among processes from different tasks except by explicit arrangement, i.e., all interaction is prohibited unless explicitly allowed. This "explicit arrangement" is through message passing without regard to the location of the destination task.

4 . 4 . 2 . 2 . 2 Messengers

The sending and receiving of messages is controlled by a service task called the Messenger. Message communication in each layer of the protocol structure is serviced by a dedicated Messenger as depicted in Figure 8. Messages sent by guardians are always trapped by the Messenger at that layer. The Messenger is actually an interface between two adjacent layers of the protocol structure. One of its important functions is to maintain a set of mailboxes, one 117

'Kuardia Guardian, .Guardian, Guardian

Virtual Machine Layer* Messenger

Guardian Guardian Guard ian Guardian

System Support Layer Messenger

Guardian Guardian Guard ian Guardi an

IPC Layer Messenger

Communication Subnetwork

Figure 8 .

Inter-Guardian Communication 118

for each guardian residing in that layer (9), and to communicate with the Messengers in the layer below and above it. Messages destined for a particular guardian will be queued at the corresponding mailbox by the Messenger at that layer.

The syntax and semantics of message passing is independent of the residing nodes of the two communicating guardians. However, certain optimizations can be performed by the system if both guardians reside at the same node, e.g., messages destined for the guardian co-residing on the same processor are not actually placed on the communication subnetwork.

4 .5 Resource Protection

To support rigid reliability and robustness of MIKE and to meet its performance, flexibility, and resource sharing requirements, a sound protection model is needed. In MIKE,

(9) Messengers only maintain mailboxes for guardians. These mailboxes are for inter-task communication only. Processes other than guardians do not have their own mailboxes in Messengers; hence they cannot communicate with other tasks. the protection model is designed to control the computing

environment so that no process has more access rights than required for its immediate operation. Although

hardware/software errors cannot be prevented, this

protection model will limit error propagation. To be more

specific, the protection model for MIKE is aimed at error

confinement, since error confinement is the most fundamental

issue in system reliability and robustness. Without it,

other protection measures, such as error categorization,

system reconfiguration and error recovery, cannot

realistically be successful [DEN76].

MIKE uses the capability-based architecture to support

this protection model. The capability-based architecture

has been advocated as a uniform method of implementing the

highest possible degree of error confinement [C0079, DEN66,

DEN76, ENG74, GRA72, LIN76b]. In this section, we will

discuss four issues which enhance the MIKE reliability and

robustness by confining errors in their immediate access

environments. The issues covered are: small protection

domains, residual control, action verification, and error

recovery. 120

4.5.1 Sma11 Protection Domains

Capability-based architecture mandates that every access to the system entities is made via its capability and the validation is always performed without exception. As stated before, a capability is an unforgeable ticket, which when presented can be taken as incontestable proof that the presenter is authorized to have access to the entity named in the ticket [LIN76b]. Capability mechanisms provide a natural implementation of two concepts which are essential to successful error confinement: closed environments and the least privilege principle.

4.5.1.1 Closed Environment

A system is a closed e nvi r onme nt if no process has any access right which has not been explicitly granted [DEN76], that is, constraints on access rights must be removed explicitly in a closed environment. In an open env ironment, constraints on access rights must be imposed explicitly.

Although these two kinds of systems are functionally equivalent, in an open environment the access domain of a process tends to be expanded due to the errors and omissions, whereas in a closed environment errors and 121 omissions tend to shrink this access domain [DEN76].

Therefore, the latter is more error resistant. Following this philosophy, processes are isolated from each other; nothing can be shared or exchanged among processes except by explicit arrangement. Therefore, it is possible to check them all for consistency and validity as desired.

A set of capabilities which delineates the access domain of a process is called that process's capabi1ity-1ist. The capabi1ity-1ist method is a natural and the most efficient implementation known of a closed environment. Although other methods of implementation are

possible such as authority-based protection [DEN76, RAT80],

they are not suitable to be applied in distributed

environments; hence, they will not be discussed here.

4 . 5 . 1 . 2 The Least Privilege Principle

Within a closed environment, a process is isolated by

its access domain. This domain-based protection can be

further refined, according to the least privilege principle,

to implement a tighter confinement for hardware/software

error occurrence [DEN80, GLI79], The least privilege

principle is defined as: every process of the system should 122

be executed using the minimal set of privileges necessary to complete the computation. The principle of _least privilege forces processes to communicate along prespecified channels, thereby eliminating unexpected interference.

Four requirements have been noted which must be met in order to implement the least privilege principle [DEN80,

GLI79]. These four requirements are:

1. Every authorization must be explicit;

2. Every access must be checked for authorization;

3. Each small object must be protectable; and

4. The domain of a process must change frequently.

In MIKE, we follow these general concepts closely and consistently to provide a set of reliable yet efficient protection mechanisms to protect individual resource entities. The protection model uses a capabi1ity-based architecture to form a good basis for building a reliable and robust system. The following discussion will concentrate only on general protection considerations, as the ^implementation issue is beyond the scope of this dissertation. 123

4.5..1.3 System Protection Mechanisms

Capabilities are permanent, confer authorization of access, and can be grouped arbitrarily to define access domains. This is the basis for domain-based protection

[BER80, COO 7 9 , DEN76, LIN76b, POP74, RAT 8 0, RAT 81, SAL 7 5].

Based on the least privilege principle, the protection model demands that the protection domains should be as small as possible while still allowing process to access what they are entitled to.

4.5.1.3.1 Domain Isolation

MIKE is structured by the object model as a set of entities: active or passive. Entities are further grouped into tasks in which processes can operate directly only on objects residing in the same task if they possess the right capability. Capabilities provide one reasonable way to implement the small protection domain since a capability corresponds to a set of access rights for a single object in the protection model. The model mandates that not only must the process have the capability to access an object, but also it must be allowed only to perform those operations specified in the access right fields of the capability. By 124

using this protection model, the process's access domain is therefore isolated, and no unexpected form of interference can occur .

During a specific moment of execution, a process typically only needs access to a smaller number of objects than it needs during its lifespan. If a process can be further restricted in this immediate domain at that moment, then the protection domains can be kept even smaller, that is, small protection domains can be enforced only if a process executes in many different domains and constantly switches between these protection domains during its execution.

4 . 5 . 1 . 3 . 2 Protection Doma in Switching

It is natural to integrate protection domain switching with the calling of a protected procedure . The phrase

"protected procedure" usually means that every time this procedure is called, the invocation involves a domain switch. This concept is usually associated with the mutual suspicion problem (see footnote on page 55) in which two interacting procedures belong to two untrusted persons.

MIKE resolves the mutual suspicion problem by separating the 125

execution of these two procedures into two processes residing in two different tasks hence, in two mutually isolated domains. The mechanism isfurther reinforced by action validation which will be addressed in Section 4.5.3.

To support a robust system, the concept of small protection domains is even enforced within the task boundary. As stated early, processes within the address domain of the same task are working together harmoniously toward one common goal. Therefore, they do not pose as the potential threat to each other. However, to further limit the propagation of unexpected hardware/software errors, a single process is allowed to execute in many different protection domains, and each domain is associated with a different procedure.

As previously stated in Section 4.2.2.2, a process contains, among other things, a stack of PAD objects (10).

During a process's lifetime, a PAD object will be created each time a PROCEDURE object is called by that process.

MIKE will copy the capabi1ity-1ist (or C-list for short) from the capability segment of that newly called PROCEDURE

(10) PAD, which stands for PROTECTION/ACCESS DOMAIN, is one of the MIKE-defined types. 126 object (this object is what we usually refer to as program or subroutine) into the capability segment of the just created PAD object. This PAD object is then pushed onto the stack maintained in that process. The C-list of this PAD object, which is the top element of the stack, defines the newly switched access domain for the process. When control is returned from the currently executing procedure to its caller, the top PAD object will be popped out and the calling PAD object will again be defining the process's access domain. Attenuation or amplification of access rights is allowed during procedure invocation, and this can be accomplished through "template mechanism" as described in

HYDRA of C.mmp [DEN76, WUL81]. (This will not be discussed f ur t he r here.)

The concept of small protection domains contributes significantly to error confinement since the execution environment of each process is rigidly defined. The contributions can be seen in the fact that it facilitates debugging, testing, fault detection, maintenance and modification, and proving properties of programs [LIN76b]

(11). However, the feasibility to realize small protection

(11) The benefit of small protection domains has been manifested during the development of CAP operating system where most programming errors almost immediately violate access rules before the error has done much damage [WIL79], 127 domains relies on several factors such as the efficiency of protection domain switching, the size of objects, the flexibility for defining different modes of access to objects, etc. We will address the primary factor, the efficiency of domain switching, in Chapter 6.

4.5.2 Residua1 Control

4 . 5 . 2 . 1 Intra-Node Residual Control

Residual control is exercised due to the belief that no information should be exchanged through residual values left behind in the internal state of resources [DEN76],

Prominent examples of the resources that apply in this intra-node residual control are central processing unit and main memory. Based on this principle, resources, after being freed, should be set to a predefined state (e.g., a null state) before they can be reassigned. Residual control can isolate access domains completely and prevent information from being disclosed either unintentionally or maliciously. 1 28

The concept of residual control is very simple; however, the cost of implementation is not cheap. For example, to nullify a portion of the main memory after the occupant has been preempted, each memory cell has to be accessed individually. Although a carefully engineered solution can be found by either hardware or software means, further reduction of this overhead is achieved by the sound protection model of MIKE without any potential threat to system reliability. In MIKE, the task concept is used to relax the high cost of intra-node residual control. Since processes from the same task are working harmoniously toward the same goal and are trusted by each other, and if resources change hands between two such processes (12), residual control will not be exercised and the resources can be immediately reassigned without being set to the predefined state. In other cases, residual control should be enforced to ensure the complete domain isolation. Based on this principle, the MIKE scheduler can be designed accordingly so that the overhead due to memory page fault or processor multiplexing can be reduced.

(12) Determining that two processes are from the same task can be accomplished easily by checking bit 0-15 (the guardian name) of their machine-oriented names (see Figures 7b , 7c , and 7d). 1 29

4 . 5 . 2 . 2 Inter-Node Residual Cont rol

Another interpretation of residual control by MIKE is as follows. If a user requests a compiler, say PASCAL, to compile and execute his source code, and if the compiler is available only at a certain remote node, then, according to

the DDLCN convention, a copy of the source code will be

transferred to that remote site for compilation and

execution (and not the other way around) (13). The Virtual

User service task at the remote node that acts as an agent

for this particular user, should make sure that the source

code copy is destroyed immediately after the requested work

has been completed. It is possible that the system-supplied

Virtual User service task is replaced by some other

user-supplied task with a similar function. (This is

legitimate in the DDLCN since MIKE is extensible and

configurable.) A task template, which resides in the system

support layer of the protocol structure for creating such a

user-supplied task, can be used to enforce this policy. The

issue will be discussed in the next section and again in

Chapter 5.

4r

(13) See Section 4.6 for the detail of this remote resource access . 4.5.3 Action Verification

Action verification is done in part by MIKE and in part by destination guardians, which is an important means for error detection. The data abstraction concept allows us to form various protected subsystems where the integrity of data within the subsystem is guaranteed. Capabi1ity-based addressing permits us to further verify the legitimacy of the requests by detecting either system malfunctions due to hardware errors or invalid access requests, and is performed without exception on every object reference. MIKE will perform the checking on the format of the reference, validating the access right against the capability it has, and mapping and binding the object name to its physical address. The verifications performed by the guardian are as f ollows:

1. Transmission Check. The guardian will checksum

the received messages to see if they are correctly

received.

2. Operation Check. The guardian will verify that

the operation requested is a valid one.

'3. Object Check. The guardian will verify that the

object to be operated upon does exist and is under 131

its control.

4. Consistency Check. The guardian will verify that

the operation is consistent and is executable with

the current state of the object.

The requested access or operation will be honored if

the request passes all the checks. The purpose of this mechanism is to locate errors by spotting inconsistencies and to confine errors by rejecting invalid requests.

4.5.4 Error Recovery

Several error recovery mechanisms are needed to deal with errors arising from different situations. For example,

errors due to the communication subnetwork can be masked out

by the reliable IPC protocols, which will be addressed in

Chapter 5. Recovery from errors occurring in higher levels,

such as parity errors or invalid access requests, can be

attempted either by retrying the request or aborting the originating process.

i_All in all, system protection mechanisms provided in

MIKE contribute to the error confinement and hence to the reliability and robustness of the system operation. The 132

robust protection model on which all these protection mechanisms are based are the result of the integrated application of those kernel design principles mentioned in

Chapter 2. However, it will be a major undertaking to achieve an effective combination of these ideas since significant overhead will be incurred. Comprehensive hardware and firmware mechanisms have to be incorporated in order to make the protection feasible. We will look at the performance issues again in Chapter 6.

4.6 System-Transparent Resource Sharing

4.6.1 Application Environments

It is essential to maintain a clean separation of a policy decision from a basic mechanism such that arbitrary policy decisions can be made in the user level (i.e., the virtual machine layer) [RAT80, WUL81], Therefore, the resource management mechanisms presented in this dissertation are distinct from resource management policies and are supported directly by hardware/firmware. 1 33

. In order to detail the usefulness of these mechanisms in our environment, we will present resource management policies adopted in the DDLCN. As recalled from the previous discussion, system-transparent resource sharing is one of the main functions of MIKE. Because of this, users of DDLCN can concentrate on their application logic without the concern of managing the communication subnetwork and its protocols .

Network resources in the DDLCN can be placed into one of two very rough categories: hardware and software.

Hardware resources include peripheral devices, main memory, central processing unit, etc. Software resources include special software packages such as language translators, system utilities, and other locally developed programs. All of these are administered by the local operating system

(which is the guardian of the OS task) and are potentially sharable with other nodes in the DDLCN except those that are specifically excluded (e.g., programs owned by individual users ) .

Those resources which are usually requested by users are said to be shared in an explicit way. There are some resources which are said to be implicitly shared (e.g., network-oriented service tasks) during the course of 134

system-transparent resource access. Each of these sharable resources, either explicitly or implicitly shared, is a potential autonomous subsystem (14), and the success of remote resource access depends on the availability of every resource involved. However, users will only be informed that either the access of the resource is successful or that

no such resource is available at that moment. This is the

DDLCN system default mode of operation. Non-transparent

operation can and does exist and allows users to explicitly manipulate and migrate the processing activities. These can be accomplished by sending an explicit indication to the

guardian of its OS task. One more note is in order: the usage of the remote resource is limited at the node where

the shared resource is located and no dynamic migration of

processing activities is assumed. All of these are managerial issues and can be changed easily with the

underlying mechanisms if it is so desired.

(14) "Potentially autonomous" is used since a task can choose to respond to service requests unconditionally, e.g., the FILE type task. 135

4.6.2 Remote Resource Acces s

We have introduced the underlying object model for NOS services and have described the interaction protocol and protection mechanisms for components of the NOS model. In order to demonstrate that a robust foundation has been laid down for the MIKE framework, we will present a scenario of system-transparent resource sharing in the DDLCN. Through

this scenario, we will see that the various system

characteristics of MIKE described in Chapter 2 can be

achieved through an integration of the conceptual issues

presented above. But before we do that, we will give a

brief summary about the contributions of these conceptual

issues to the utilities of MIKE.

The conceptual issues presented in this dissertation

are geared toward providing a realistic and workable design

for MIKE. They aim to provide system-transparent operation

to users and maintain cooperative autonomy among local

computer systems while using sound modern design methodology

to increase its utility and reliability. The system design

provides the following good features which are essential to

the utilities of MIKE.

ir 136

1. System-Transparent Operation. Users can request

resources/services by name regardless of their

physical locations. System transparency is

actually an illusion created by the active

cooperation among autonomous service tasks

distributed in the DDLCN. The fact that users see

the DDLCN as a single integrated machine greatly

increases the utility of distributed systems,

especially in our application environment.

2. Cooperative Autonomy. The task concept provides a

refined granularity for autonomous and protected

subsystems. Each task acts as an autonomous and

protected unit, guarding its resources and

responding to requests as it sees fit.

3. Extensibility and Configurability. By using the

task concept to structure NOS services, each node

can selectively configure MIKE to provide the

functions they need without undue penalty from

facilities they do not need. They can also extend

or replace the MIKE-defined tasks to provide

services unique to their applications. 137

- 4. Reliability and Robus tnes s. The natural outcome

of using the task concept provides atomic

transactions (15) which are essential to system

transparency and reliability [GRA78, STA79]. From

the user point of view, either everything happens

(commits) or nothing happens (aborts); thus, the

transaction becomes a unit of recovery. From the

NOS services (in the virtual machine layer) point

of view, messages and data cannot be lost or

partially changed. The latter functionality is

provided by the reliable protocols at the IPC

layer.

Based on the environment laid out above, we will list all the intra-task, inter-task, and inter-node activities which are initiated due to a request from the user, "JOHN," for the compilation of his PASCAL source code PAS.source.

Figure 9 contains a set of compressed pictures of resource access activities. In the figure, each arrow is labelled by a unique number. These numbers correspond chronologically to a series of events detailed below:

fc-

(15) A transaction is semantically atomic if it either commits or aborts. 138 R.node

T-ll w

Tie i le

UST, i to

OHN.PASCA

RSZ US L itoi

IHN. PASCAL] jjOHN.PASCA

ir

Figure 9.

System Transparent Resource Sharing 139

Figure 9 (continued)

S.node

(JO Hti &>4SC*qg VOHN. PASCAL jP A S .sourc i i & J^\fpAS.source\

AS. so u rce )

1 usz on i tor

UOHN.PASCAt)) \Q I j Oh N.PASCAL) so u rce ) source)

UST

(p a s Cf) 140

(1) A user process, JOHN, is executing inside a PDP-11 node, which is designated as the request-node (or R-node for short), and is requesting a PASCAL compiler from its guardian (e.g., RT-11 operating system of PDP-11 computer) to compile his source code called PAS.source (see Figure

9b) .

(2) The guardian (i.e., RT-11), after realizing no such compiler is available locally, responds to the request by sending a message through the Messenger at the virtual machine layer to the Virtual Resource service task (VRST) and by indicating the desire to access a remote resource called a PASCAL compiler (16) (see Figure 9b).

(3) After receiving the request from the guardian of the OS task, the VRST will create a dedicated process called

JOHN.PASCAL in its address domain to handle the remote resource access. The VRST in R.node will then broadcast a message onto the communication subnetwork indicating the request from its OS task to all other Virtual User service tasks (VUSTs) in the DDLCN. Note that every node should have a VUST residing in its LIU, and every such VUST in the

i*

(16) System "transparent" resource access applies to users only . 141

Ddlcn should receive a copy of the broadcasted resource request message (see Figure 9b).

(4) The VUST in every other node will also create a similar dedicated process (i.e., JOHN.PASCAL) to handle the request. It will then relay the request to its local operating system (i.e., the guardian of the OS task) since it guards all the sharable resources in that node (see

Figure 9b).

(5) We assume one node, which is designated as the

service-node or S-node for short, is willing to share its resource with R-node. Let us further assume that the S-node

is actually a DECsystem-20 which runs the TOPS-20 Monitor.

Monitor, the guardian of the OS task, will then send a message to the VUST indicating the acceptance of the request

from R.node (see Figure 9c).

(6) The VUST in S.node will then relay the message back

to the VRST of R-node, thereby establishing a resource

sharing access channel between R.node and S.node. The

establishment of the resource access channel means a session

of conversation will be created by the VRST in R.node and C* the VUST in S.node to handle the remote interaction and to

reduce communication overhead. To user JOHN, the illusion 142 is formed that the VRST in LIU of R.node is the local resource provider with which he communicates through the

RT-11, that is, the VRST acts as a "virtual resource" to

JOHN. The VUST in S.node acts as a "virtual user" of the resource provided by Monitor in S.node. These provide a semantically coherent and simple model for remote resource sharing (see Figure 9c).

(7) The VRST in R.node will then inform the RT-11 about the successful connection of this access channel (see Figure

9c).

(8) A copy of the PAS.source file will be transferred to the FILE task in the LIU of R-node. This copy of the

PAS.source file will now be under the control of the VRST

(see Figure 9d).

(9) The PAS.source file will be transmitted to the FILE task in the LIU of S-node. So, the VUST also has a copy of the PAS.source file (see Figure 9d).

(10) The FILE task in S-node will pass the PAS.source, by the demand of the VUST, to Monitor where the compilation

tr will 'actually be taking place (see Figure 9d). 143

(11) through (14) A newly created file, called

PAS.object which is the result of this compilation, will be sent back to RT-11 of R-node and hence back to user JOHN

(see Figures 9d and 9e).

(15) Finally, by following the inter-node residual control principle, various copies of PAS.source and

PAS.object created during this process will be destroyed, and processes executed in the VRST of R.node and the VUST of

S.node will then cease to exist, thereby breaking the

resource access channel.

Several issues related to this system-transparent

resource sharing should be mentioned in order to clarify the

brief presentation of resource access activities. The above

scenario only gives an example of simple request (i.e., compile PAS.source). Actually, a series of requests or

interactions can also be achieved in this remote resource access by special facilities available in the VRST of R.node and the VUST of S.node to provide the continuity and to reduce overheads.

There are several other interactions which happen from C" the moment the guardian of S.node expresses its willingness to share the resource with the remote user until the time 144 the resource sharing access channel is established. These activities are engaged in, for example, the selection of a bid from those submitted from potential resource providers.

These are related to the resource management policy of distributed systems; hence these will not be considered here .

The heterogeneity among protection models used by different OS tasks will be dealt with by special mechanisms incorporated in VUST/VRST of their respective nodes. The notions of "virtual user" and "virtual resource" not only facilitate system-transparent resource sharing, but also help to safeguard the integrity of the protection systems of individual operating systems. The data translation problems due to heterogeneous machine architectures will be handled by the system support layer of protocol structure.

The VRST of R.node is the actual resource user (where the bill should be sent to) as far as the guardian of the OS task in S.node is concerned. In the DDLCN sharing of the general resources will always be granted provided that the local management policy is observed. However, sharing of special resources, such as a particular utility program

c* owned' by an individual user or a classified database at the

S.node, can be further regulated by checking the identity of 145

the actual user, e.g., user JOHN in our scenario.

Capabilities with passwords or encrypted user identification can be used to demonstrate the access authorization [DON76,

D ON 8 0 ] .

In Figure 9, all the inter-task communication is conducted as if the messages are sent directly to the destination task. However, all this illusion is created by the lower level protocols, especially those tasks at the IPC layer in both nodes. We will explore this in Chapter 5.

In the scenario, atomic transactions can be accomplished as follows: after the compilation is done and when a complete copy of object code PAS.object is received by the VRST in R.node, only then will the output files requested by JOHN (e.g., PAS.object) be passed over to him.

There are some other potential uses of these atomic transaction concepts in the DDLCN. For example, the two-phase write in updating distributed databases [CH081,

GRA78] can be accomplished by a similar scheme.

This chapter has discussed the MIKE framework by presenting the conceptual issues which are used to structure «r NOS services. A task was logically formed as an autonomous and protected subsystem, and has served as an underlying semantic base by which the process interaction is modelled.

Comprehensive protection mechanisms have been provided such that reliable and robust operation can be achieved. The notions and models introduced can consistently be applied to embrace both the local operating systems and other network-oriented services, and to cover all three layers of protocol structure. We have demonstrated the utility of

these concepts from the viewpoint of the virtual machine protocol layer in Section 4.6. Chapter 5 will present the underlying protocol structure (i.e., the IPC and system support layers) that supports the virtual machine as seen by users at the virtual machine layer. CHAPTER 5

PROTOCOL STRUCTURE

5 . 1 Introduction

A protocol is a formal set of message formats and exchange conventions or rules that the communicating guardians in the same layer of the protocol structure use for control and synchronization of their communication functions. Therefore, it has a set of allowed operations and parameters together with their formats and contains rules for specifying correct sequences of operations [AKK74,

DIG74, STA79, SUN75, TAN81, ZIM80]. Between each pair of adjacent layers there is an interface. The interface defines which primitive operations and services the lower layer offers to the upper layer.

f

147 148

Our major emphasis in Chapter 4 was the system operations and services viewed mainly from the virtual machine layer. This chapter will present the protocol structure which supports the network operating system (NOS) services provided for the virtual machine layer. The protocol hierarchy, described from the bottom up, consists of three layers: the interprocess communication (IPC) layer, the system support layer, and the virtual machine layer as shown in Figure 10. The figure also depicts the intra-layer structures which we will explore in this chapter.

The overall protocol structure of MIKE can be described by using the concept of virtual machines. To the users,

MIKE coordinates the operation of autonomous computers such

that the whole DDLCN appears as a single integrated virtual processing machine and is controlled by its local operating system. This distributed virtual machine is actually an illusion formed by the cooperation of network-oriented service tasks which are part of the virtual machine layer residing at LIU of each node. To the tasks in the virtual machine layer, the system support layer and the IPC layer form a virtual communicat ion machine such that the tasks in

ir the upper layer think they communicate with one another directly. The system support layer, in addition to XqoaejsTH 100050^5

•01 san8T,i

I00010JJ u o p B u p saa-T ^TnK BTqBTT3Jun ¥

Xooo^oai uo-;38UT5S3a-TqinW siq^TI9^ ¥ T i T i ^ f

I00 0 3 0 J£ u o i a e u p saa-TqinW paaiueiBtig ¥ ojj

«/

XOOOaOJd UOJSS0S 9I

Xoooqojd uojssbs aiqefxa'a * suopBasdo ots eg ¥ T i T i f q

I00050IJ uo7ssas p993UBjBng ¥ sa^Bxduiax^ sbj ¥ rjjoddng

J3XBxqng uop9BJ3qux aaXejqns u o t ^ob j j sqy u T S T s T g

(yiyvvvvuu«wuwwy«/uc/iywwo/ywwwv(y9/uwuuv«/yuywo/viywo/uo/w^uwwiyo/wvu«/o/vwy fofofofofofofofofafofofofofofofofofofofofofo/ofofofofafofofofofofofofofofofofofofofofofofofom fofofo/ofofofofofofo/ofofo

T i T e T -

S337AJ3S SON oiseg * sujqoBW

S3DJAJ3S paqu3Tao-u°TqB3Tlddv T3A3'i-J3qSTH ¥ IBn^JTA

6^71 150 abstracting common mechanisms to enhance the MIKE extensibility and reliability, supervises the inter-node communication executed by the IPC layer and provides continuity for their interactions. The IPC layer is responsible for exchanging messages of arbitrary length among nodes. It is concerned with running the raw communication links (1), and conditions them such that the system support layer can assume it is working with an error-free (virtual) channel.

Each layer of the protocol structure comprises a collection of autonomous tasks replicated in every node where inter-task communication is achieved uniformly through the message passing mechanisms. For those tasks which reside in the same layer, but not necessary on the same node, we will call them peer tasks. Logically, peer tasks communicate directly. In reality, no messages are directly transferred from a layer on one node to its peers on another node (except in the IPC layer). Instead, each layer passes messages to the layer immediately below it, through the

Messenger at that layer as stated in Chapter 2, until the

IPC (the lowest) layer is reached. At the lowest layer it

(1) A link is the communication channel which connects two DDLCN nodes (or LIUs, to be precise). 151

is connected to other nodes by a double-loop communication link. This is an unreliable communication facility, as opposed to the virtual communication machine used by the virtual machine layer.

The Reference Model of Open Systems Interconnection

(OSI) [DES81, IS080, ZIM80] developed by the International

Standards Organization (ISO) consists of seven layers of protocol: the physical layer, the data link layer, the network layer, the transport layer, the session layer, the presentation layer, and the application layer (see Figure

11). The MIKE protocol hierarchy, due to its unique environment, consists of three layers: the IPC layer, the system support layer, and the virtual machine layer. This classification is based on the design principles of MIKE protocol structure and is considered best suited to the

DDLCN environment as we will see in the following discussion.

5.1.1 Principles of Protocol Design

The protocol structure of MIKE is designed based on the f follo'wing principles: 152

Layer ISO MIKE

7 Applicat ion Virtual Machine

6 Presentation

System Support

5 Session

4 Trans port

3 Ne two rk

IPC

2 Data Link

1 P hysical

Figure 11.

Approximate Correspondence Between ISO and MIKE Protocol Hierarchy 153

1. Layered Hierarchical Structure. Based on the

concepts of modularity and hierarchical layered structuring,

MIKE is described at different layers of abstraction. Lower

layer modules create a foundation upon which higher layers

are built. That is, each layer in the protocol structure

uses the functions and services provided by the lower layers

through their interfaces, and provides additional functions

and services to the higher layers above it through its

interface. The implementation details of each layer are

transparent to all other layers in the protocol hierarchy,

and therefore can be modified on a 1ayer-by-1ayer basis.

2. Reliable and Efficient Operation. Due to the

anomalies of the communication subnetwork and processing

systems, message exchange between nodes in the DDLCN is

considered very unreliable. Robust protocols should be

devised and employed to guarantee the reliability of message

exchange to the degree a particular application may need.

The availability of these reliable protocols enables network-oriented services in the upper layers to concentrate on their application logic without the concern of the communication subnetwork. Furthermore, the underlying network topology should be exploited to reduce network C* traffic, message delay and queueing time. 154

3o Session Orientation. The potential NOS services in the DDLCN aresession oriented. The protocol structure is

therefore designed to support an extended interaction, called a session, among remote guardians. Facilities should be provided to facilitate the establishment and management of the session such that application-oriented functions in the virtual machine layer can have immunity from anomalies dueto the distributed nature of DDLCN.

4. Support o f Modular Evolution. Since the

implementation details of each layer will be transparent to

all other layers above it, a change at one layer should not

require a change to the entire system. Thus, MIKE can evolve gracefully. Common aspects of NOS services and

resources, such as their logical structures, naming,

protection, synchronization, and other common operations applicable to them, should be provided as primitives to

facilitate the evolution.

5 . 2 Interprocess Communica t ion Layer

In distributed systems, a major design issue of the IPC t- layer is to capitalize on the characteristics of its underlying network topology. The characteristics of a 155 particular IPC protocol should reflect the underlying network topology and provide a set of efficient and reliable communication primitives for the upper layers which are based on its communication medium. According to this design philosophy, distributed systems with different communication subnetworks will have significant differences among their

IPC protocol structures.

Providing reliable IPC services in a distributed environment requires a layered set of protocols within the

IPC layer. Furthermore, reliability mechanisms should be incorporated to assist in detection, reporting, and recovery from component failures. We will begin working our way up the "inner" protocol hierarchy (of the IPC layer) starting at the bottom. But before doing that, we will present the functions of this IPC layer first.

5.2.1 Func t ions of IPC Layer

The IPC layer supports the transport of uninterpreted messages of arbitrary length with possibly multiple destinations in the DDLCN. It provides reliability

cr mechanisms to guarantee the delivery of messages delimited by beginning-of-message (BOM) and end-of-message (EOM) 156

ma rks .

The unpredictable nature of communication links and

individual nodes in the DDLCN significantly complicates the mechanisms which provide these functions. Several areas can be identified to substantiate this claim. Lack of shared memory among autonomous computers contributes to this unpredictability of inter-node communication. The issues of

delay, bandwidth, and routing resulting from transporting

messages along communication links further aggravate the

complexity of IPC mechanisms. Based on the abstraction

concept presented in Chapters 2 and 4, a layered set of IPC

mechanisms for the DDLCN can be designed using the schemes

suggested in [LIU75, PAR7 8, PAR79a, PAR79b, REA7 5, REA76].

Furthermore, the NOS model and the associated task notion

provide a structuring and implementation tool for the

realization of these reliable IPC mechanisms.

5.2.2 Communica t ion Facility of DDLCN

This section briefly identifies the important

characteristics of DDLCN communication facility that the fc- lowest level protocol must use in providing an augmented

service. Several characteristics of communication medium 157

which affect network performances are:

1 . variable message delay and queueing time

2 . duplication of messages

3. loss and damage of messages

4. out-of-order message delivery

5. message size

6 . bandwidth

The communication facility of DDLCN is a double-loop structure which uses two high-speed digital communication channels arranged in a closed loop formation and logically

assigned to transmit messages (2) in opposite directions.

The loop architecture provides an excellent facility to

implement multi-destination communication, where messages

can be either fully broadcasted (i.e., a message addressed

to all other nodes) (see Figure 12), or partially

broadcasted (i.e., a message addressed to many, but not all,

other nodes) (see Figure 13) [PAR79a, PAR79b]. Message

routing is performed by the sending node based on the

(2) Logically, a message is a unit of information with arbitrary length. However, a message may have to be divided into fix-sized packets due to physical limitations. Attached Host Computer

Figure 12.

Full-Broadcast Message Transmission .* A ttached Host Computer

Figure 13.

Partial-Broadcast Message Transmission 160

shortest distance algorithm [WOL79a, WOL79b], and the message on the loops will be checked and forwarded to its destination(s ) by the nodes between the source and destination(s). Detailed discussion of the network

operation can be found in [LIU75, LIU78, LIU79, LIU81,

WOL78, WOL79a, WOL79b, WOL79c].

5.2.3 Loop Access Protocol

The loop access protocol deals with the actual message

transmitted to and received from the raw communication

links. This is the lowest level primitive provided by the

IPC layer. The loop access is accomplished by the

Transmitter service task (T.task) and the Receiver service

task (R.task). R.task controls the receiving of

multi-destination messages. Actions taken by R.task on the

incoming messages from the loop can be grouped into the

following three classes: 1) the message destined to this node will be copied and relayed back onto the loop; 2) the message not destined to this node will simply be passed to

the T.task for relaying to the next LIU downstream; and

3) the message sent by this node will be removed. 161

T.task controls the transmission of multi-destination messages by using Rearaes' shift-register insertion technique

[LIU75, REA75, REA76], Its function is to place both incoming messages relayed from the R.task and messages generated locally by other service tasks onto the loops.

T.task is capable of merging these two message streams into one without interference and without the use of centralized control [TSA79]. These two service tasks mainly concern physical communication channels, and they interact directly with the local communication channel controller, e.g., the universal synchronous/asynchronous receiver/transmitter

(USART) [WEI78], There are other communication-oriented service tasks at this level which cooperate with R.task and

T.task to perform the following unique communication control for the DDLCN [REA75, REA7 6, WOL79b].

1. Link fault detection and reconfiguration;

2. Lost and damaged messages detection and removal;

a nd

3. Network traffic regulation and loop domination

prevent ion.

b’ 162

By using these loop access service tasks, higher-level protocols can be defined which control the transport of messages among the DDLCN nodes as requested by tasks in the system support layer. These high-level protocols should incorporate reliability measures in order to ensure the safe delivery of messages. However, reliability mechanisms do incur potentially large overhead, and their use should not be imposed on all communications. The basic communication primitives do provide a simple, fast, and inexpensive way of inter-node communication. Therefore, rather than insisting on the scheme which ensures reliable transport of messages, programmers should be able to choose from several different multi-destination protocols. Each of these high-level protocols should offer a different level of reliability.

The most reliable protocol is implemented with extensions

(or abstractions) built from primitive ones. We will now describe these protocols starting at the bottom.

5.2.4 Unreliable Multi-Des tination Protocol

This type of multi-destination protocol implements a facility by which a guardian at the system support layer can fc- send a message to multiple remote guardians. Damaged messages are discarded; however, no attempt is made to 163 acknowledge messages. Therefore, the sending site does not know whether or not the message has been correctly delivered. The objective of this type of protocol is to provide a minimum-overhead facility which can be used either by some guardians in the system support layer which can take care of the unreliable situations or by more reliable multi-destination protocols which we will discuss next. The main functions of this type of protocol aremaking the routing decision and forming the source and destination address (3) fields of the outgoing message.

5.2.5 Reliable Multi-Destination Protocol

This type of protocol demands a positive acknowledgment from the receiving ends. Thus, the sending guardian will be notified whether or not the message has been successfully delivered. The reliable multi-destination protocol service task accomplishes its function by requesting the unreliable multi-destination protocol service task to send the message.

Furthermore, it will ask the latter to retransmit the message at intervals determined by a "retransmission

*T

(3) A destination bit map can be used in the destination field which contains one bit per DDLCN node and therefore indicates the destination(s) of the message. 164

timeout" if any of the destination nodes has not

acknowledged the message. However, this retransmission can

be requested only for a certain number of times or the monitoring job will be executed within a certain time bound

and then quit. So this reliable multi-destination protocol

can be considered as a "best effort to deliver" protocol

[PAR78, PAR7 9a, SUN75, SV079].

5.2.6 Guaranteed Multi-Des tination Protocol

This type of protocol guarantees to the sending

guardian that its multi-destination message will eventually

arrive at every destination even if some are not operational

at that moment. This guaranteed multi-destination protocol

service task will request the reliable multi-destination

protocol service task to send the message and resubmit the

request until the message safely arrives at every

destination. Information regarding the message context and

its transport status should be kept at stable storage (e.g.,

nonvolatile memory), so that if the sending or receiving

node fails during the processing of the message, upon

becoming active again it will still have the same initial

i r copy of the message. This type of protocol can be used to

ensure that every destination will receive the message, even 165 if some node has failed or is unreachable at the time it attempts to send the message.

The IPC layer is assumed to have the capability to handle messages of arbitrary length. Although Reames' shift-register insertion technique is capable of transmitting variable-length messages, the physical buffer space does place a limit on the maximum size of a message it can accommodate at any one time. Therefore, a service task is needed to convert the messages into fixed-length packets and back again, i.e., disassembling, sequencing. buffering, and reassembling. Further, elaborate mechanisms have to be employed to filter the duplication of packets, to rearrange out-of-order packet delivery, and to deal with problems of lost and damaged packets.

The multi-destination protocols per se and their interaction within the IPC layer can be modelled by using the object model and its associated task notion presented in

Chapter 4. The more powerful and reliable mechanism, which is similar to an "extended type," accomplishes its job by requesting the guardian of the primitive mechanism to perform a series of more basic operations. The

c implementation of the IPC layer can then be decomposed into smaller and more manageable tasks. 16 6

5.3 Sys tern Support Layer

5.3.1 Func tions o f Sys tem Support Layer

The system support layer provides two distinct kinds of services for the virtual machine layer. The first of these two support service groupings (called abstraction sub layer) is oriented toward system reliability and extensibility. It abstracts common mechanisms to facilitate the introduction of new NOS services while it maintains the integrity of the protection model. A run-time environment is therefore created for the virtual machine layer embodying the common service support features in terras of library tasks, utility

service tasks, or other building block mechanisms.

The second of these groupings (called the interaction sublayer) is oriented toward network communication. It deals with anomalies of processing and communication systems in a higher level than those in the IPC layer, and provides a virtual communication channel for the duration of

interaction for guardians in the virtual machine layer. All the functions provided can be classified as

«T support-oriented; therefore, it is called the system support layer. 167

5.3.2 Abstraction Sublayer

The major functions of the abstraction sublayer are to provide templates for the construction of user-defined

service/type tasks and to provide primitive system utilities whereby guardians in the virtual machine layer can build more sophisticated or higher-level application-oriented

tasks. The facilities added in this way can then be assumed

to have an equal status with pre-existing ones. We will not

attempt to identify all the system primitives and task

templates needed in this layer; instead, we will only present some representative mechanisms.

5 . 3 . 2 . 1 Task Templates

The rationale for providing these templates at the

system support layer is based on the need to safeguard the

integrity of the protection model, to allow MIKE to evolve

in a modular fashion, and to stick to the principle of

policy/mechanism separation. At the system support layer,

there will be standard library task templates available to

allow programmers in the virtual machine layer to create

cr their own versions of type/service tasks in a substantially

simpler and less error-prone way. Typically, system 168 programmers will invoke these templates to create a new type out of existing types or to add/replace NOS service tasks.

The former case actually amounts to defining new

(user-defined) type tasks. The latter is concerned with the replacement of standard NOS service tasks and the extension of the MIKE system utilities, so that, by providing additional operating system functions, the system can be tuned toward a particular application. The result of this will be that MIKE is running with several co-existing tasks defining essentially the same services, for example, several

FILE type tasks or several Virtual Resource service tasks.

The use of these task templates not only provides a convenient facility for programmers to tune the system to their own needs, but also allows the system protection mechanism to be extended to cover these newly created tasks in the same way as it covers the old ones. The templates provide, among other things, a uniform view of a resource entity model which incorporates essential components to ensure the integrity of inter and intra-task operations.

For example, the organization of the data segment and capability segment in an object should conform to the MIKE convention so that protection mechanisms can be applied in a

tr uniform ma nne r. Anot he r example is that the max imum interval for any guardian to check its mailbox in the 169

Messenger should be observed to prevent the mailbox from being flooded with uncollected messages.

Furthermore, for some special NOS service tasks, certain protection measures for inter-node residual control principles have to be enforced in the replacement task in order to conform to the MIKE resource protection model. For example, the Virtual Resource service task can be replaced with a version suitable to the need of a particular node.

However, codes have to be inserted to delete the temporary copy of the user's file after the remote resource access has been completed (see step 15 in Figure 9).

5 . 3 . 2 . 2 Pr imi t ive NOS Utilities

Primitive and common operations applicable to inter-guardian communication and interaction, especially those network-oriented, should be provided at the system support layer. Higher-order mechanisms can then be built out of these primitive building blocks in the virtual machine layer. This will facilitate the construction of application-oriented NOS services. We will use the

C" synchronization problem to exemplify the utility of this c once p t. 170

Many guardians' activities that exhibit concurrency require harmonious cooperation if the results are to be correct. The overall problem of enforcing an orderly cooperation is called synchronization [BRY79, PAR79a].

Distributed synchronization deals with the cooperations among distributed guardians. Any such synchronization problem occurring in the DDLCN can be resolved in a self-contained manner; that is, the algorithm is modified

to enforce the order of events without any system intervention. However, such an approach is usually unreliable.

By using our approach, a primitive synchronization model (e.g. , a distributed semaphore model) can be provided

in the system support layer such that it is simple to use, of high performance, highly reliable, and also suitable for verification. By providing this low-level synchronization model in the system support layer, high-level synchronization mechanisms, e.g., control abstractions

[LIU81], can be built in the virtual machine layer such that

the solution to a synchronization problem can be highly structured, semantically clear, and suitable for the application on hand. 171

5.3.3 Interaction Sublayer

The interaction sublayer deals with, as the name implies, the interaction of message senders and message recipients. By using the same approach we take in designing the IPC layer, we devise three protocols, each with a different level of reliability and sophistication.

In MIKE, it is a rare case that the interaction among remote guardians only involves simple transactions in which a guardian sends a message, waits for a reply, and then forgets about it. The most common inter-node activities are initiated by remote system resource sharing, in which a series of interactions is needed to accomplish the access.

A connection will therefore be established among cooperating guardians residing in different nodes of DDLCN to reduce communication overhead and to provide coherent interactions, and we call this connection a session.

Due to our special environment, we will present a layered set of protocols mainly dealing with session; however, simple transaction is only a restricted version of session, and the protocols we present can be applied easily

tr to th'at situation. One more note is in order concerning the difference between the session protocols we are going to 172 present next and the multi-destination protocols in the IPC layer. The multi-destination protocols are concerned with the transport of a unit of message. Therefore, a message is either delivered or not delivered to a destination. In the session protocols at the system support layer, the emphasis is placed on the interrelation and coherenee of multiple messages which are exchanged to accomplish some work in the virtual machine layer.

5 . 3 . 3.1 Unreliable Session Protocol

The main function of this unreliable session protocol is to invoke the multi-destination IPC mechanisms provided in the IPC layer to transmit a message to remote destinations. A variable-length message together with its naming association will be transferred to the particular mu1ti-destination protocol service task it chooses, depending on its reliability requirement. The naming association is a n-tuple (n_>2) where the first element in a naming association is the name of the source guardian and the rest of the elements indicate the multiple destinations the message addresses. The multi-destination protocol t- service tasks in the IPC layer can then use this naming association to form a destination bit map for the outgoing 173

message and make the appropriate routing decision. Forming the naming association for a message involves the binding of remote guardians' names which will be executed by the Name service task at the system support layer. Neither delivery assurance nor crash recovery will be attempted for the « messages during a session of conversation among interacting guardians.

5.3.3.2 Reliable Session Protocol

This type of protocol uses the unreliable session protocol to implement a reliable exchange of messages. This is analogous to implementing a reliable multi-destination protocol using the unreliable multi-destination protocol in

Section 5.2.5. This protocol relieves the guardians in the virtual machine layer from the responsibility of masking out the anomalies of remote processing systems and the communication subnetwork.

The main function of this reliable session protocol, in addition to invoking the unreliable session protocol to form the naming association and to supervise the message transport, provides delivery assurance and crash recovery, but within a certain time limit. The delivery assurance 174

guarantees that messages will be correctly received, and that the messages received will not be duplicated, missequenced, or misdelivered. The result is that the same request will not be repeated or out of order activities will not happen. Further, messages which have been safely delivered to the destination guardians at the virtual machine layer of other nodes will not be lost due to node failure while the messages are being processed. This reliable session protocol will try its best effort, within the pre-specified time limit, to help the system to recover from its crash and return to a known state.

5.3.3.3 Guaranteed Session Protocol

This type of protocol guarantees that a session of interaction will be precisely carried out as specified by communicating guardians. Again, stable storage is needed to store the "session record" in order to survive the crash if the node fails abruptly. It fulfills its goal by repeatedly invoking the reliable session protocol at pre-specified intervals . 175

This layered set of session protocols provides support for extended inter-guardian interactions in the virtual machine layer. Together with the layered set of multi-destination protocols in the IPC layer, they form a virtual communication machine that guardians at the virtual machine layer will see.

5 . 4 Vi rtual Machine Layer

All of the mechanisms available in the lower layers of the protocol hierarchy, i.e. the system support layer and the IPC layer, form a protective layer shielding the virtual machine layer from the anomalies of processing systems and the communication subnetwork. Further, the system support layer also oversees the integrity of the protection model by supplying task templates for task creation at the upper layer. Guardians in the virtual machine layer, therefore, can concentrate only on their application logic. This abstraction concept is crucial to the design of MIKE.

Without this concept, it would be impossible to partition the design of MIKE, an unmanageable problem, into several smaller, manageable design problems, namely the design of

ir the individual layer. 176

A complete set of standard and basic network-oriented services for system-transparent resource sharing, such as

Virtual User service task, Virtual Resource service task,

FILE type task, Name service task, or even high-level error handling and recovery tasks, should be defined in the virtual machine layer. Other tasks, such as high-level distributed synchronization service task and distributed database service task, can also be provided depending on the

application environment on hand.

In the MIKE protocol hierarchy, each guardian in one

layer always communicates with other guardians in the same

layer, although they can be in the different physical nodes.

The peer guardians in each layer conceptually think of their

communication as being "horizontal" using that layer's

protocol (see Figure 14). However, this perception is

formed through the Messenger at that layer. The Messenger

will first attempt to perform the binding by searching the

local name space. If the destination guardian is not at

that node, then the message will be redirected to the

Messenger immediately below after appending a header to the

message. At the receiving node, the message moves upward

from layer to layer, with headers being stripped off as it

progresses. None of the headers appended by a Messenger at

a certain layer will be passed up to layers above it. VUST

Mess enger VML Messenger

Mes seng e r SSL Mes senger

IPCMessenger Mes senger

Legend:

- - > c Pseudo Communication Path — Real Communication Path VML: Virtual Machine Layer SSL: System Support Layer IPC: Interprocess Communication Layer

Figure 14.

Inter-Guardian Message Exchange 178

Let us now use the scenario in Section 4.6 to illustrate the usefulness of this concept in particular and the protocol hierarchy in general. At this point we can pose a question, which we have purposely avoided previously, about how the service tasks residing in different nodes can communicate with one another. We will elaborate upon the events occurring during the execution of step 3 in the scenario (see Figures 9 and 15).

(1) The Virtual Resource service task (VRST) at the resource request node (R.node) broadcasts a message to all

Virtual User service tasks (VUSTs) in the DDLCN to request a

PASCAL compiler.

(2) The message to be broadcasted is actually sent to the Messenger at the virtual machine layer.

(3) After realizing that the destination guardians are not at the same physical node, the Messenger will route the message to the Messenger at the system support layer.

(4) The Messenger at the system support layer will direct the message to one of the session protocol service

C" tasks to request for the establishment of a session. 179 .node

1SL

_V M ± _ Messenger------Messenger Messenger

SSL Messenger Messenger Messenger

\

ktesi 10

I P C L Messenger — Messenger — — Messenger

_-5>: Pseudo Communication Path *: Real Communication Path

Figure 15.

Inter-Guardian Communication Using the Virtual Communication Machine 180

(5) The Session Protocol service task (SPST) will then communicate with all the peer SPSTs in the DDLCN to establish connections. This situation is similar to (1) above .

(6) The connection establishment request is actually sent to the Messenger at the system support layer.

(7) The Messenger will realize that the destination guardians are not at the same node, and therefore, will

relay the message to the Messenger at the IPC layer.

(8) The Messenger at the IPC layer will use one of the

multi-destination protocols to multiplex the message onto

t he loop.

(9) The multi-destination protocol service task will

then send the message to its peer multi-destination protocol

service tasks in the DDLCN. Again this is similar to (1) and (5) above.

(10) This request is actually sent to the Messenger at

the IPC layer.

«r 181

(11) The Messenger at the IPC layer will direct this

"non-local" message to the T.task which will actually place the message onto the loop.

After receiving the message from the multi-destination protocol service task at R.node, the receiving node will first establish a connection between two SPSTs at the system support layer. Then based on these connections, a series of conversations can then be conducted between the VRST at

R.node and the VUSTs at receiving nodes. We can see from

this example that the system support layer and the IPC layer actually form a virtual communication machine which enables the guardians in the virtual machine layer to communicate

"horizontally," and the protocol structure of MIKE achieves its goal to support coherent and object-based NOS services at the virtual machine layer. CHAPTER 6

LIU ARCHITECTURE

6 . 1 Introduction

In this chapter, we will demonstrate how to use existing techniques to provide a software-directed architecture for LIU in which MIKE is running. Up to this point and throughout this dissertation, the term

"software-directed architecture" has been used but not defined. Figure 16 illustrates the relationship of an operating system to its basic computer hardware. A conventional architecture implements a set of primitive entity types (e.g., integer, real) and a set of associated primitive operators (e.g., ADD, SUBTRACT, LOAD). System software is used to extend these primitive types and operators to provide higher-level extended types and operators for the programmers so that they see a more friendly (virtual) machine. A software-directed

182 183

.High-Level Language Machine

■^-Extended Machine

^■Architecture Interface (Assembly Language Level Machine)

^Bare Machine

fe-

Figure 16.

Relationship Between Virtual and Physical Machines 184

architecture means that the extended types and their associated operators of the operating system are supported at the hardware/firmware interface (1).

The motivation to have a software-directed architecture

for LIU is largely due to performance considerations. The

design of MIKE is based on advanced software methodology.

The realization of these advanced concepts in a distributed

environment has often proven to be costly in both time and

space [DEN76, DEN79, DEN80, FLY80, GEH79a, LIN76b]. The

problems arise because all of these concepts characterize

the basic procedure call interface (which results in

frequent protection domain switching) and demand reliable

message exchanges (which result in a significant higher

volume of message traffic, both intra-node and inter-node).

Therefore, a heavy overhead is associated with imposing the

logical structure of MIKE onto conventional hardware, since

the NOS model and protection mechanisms cannot be supported

at the architectural interface, and software means have to

be employed to realize these high-level operating system

concepts.

(1) A" software-directed architecture is different from a language-directed architecture. The latter is one in which the high-level types and operators provided by the language translator are supported at the hardware/firmware interface [DEN79, FLY80, JAG80]. 185

To improve MIKE performance, we will incorporate additional hardware/firmware mechanisms in LIU to facilitate the inter-task communication and intra-task protection. The design of MIKE is based on the concepts of the inter-task communication by message exchange and the intra-task protection by domain switch. The feasibility of this approach is largely dependent upon having minimum-ove rhead mes sages and minimum-overhead processes.

6.1.1 Minimum-Ove rhead Messages

"Minimum-overhead messages" means minimum overhead in generating, processing, and delivering messages. Message passing, which is used heavily in MIKE, appears to be very expensive. Most approaches to distributed systems are based on the concept of multiple processes communicating by exchanging messages. In these systems, the message passing concept is enforced uniformly on every communication among processes for the reason of semantic purity, even if the interacting processes are known to exist in the same physical node. The overhead of adopting this concept for inter-process communication is high, especially if the machine architecture upon which the system is built is better suited to procedure calling [LAU78]. 186

Furthermore, inter-node message communication is unreliable due to undeterministic behaviors of the communication subnetwork and remote nodes. Therefore, the network protocols which provide reliable message exchanges require the overhead of additional messages.

Although mechanisms have been built into the process interaction model of MIKE to reduce message traffic, we still need to have minimum-overhead messages in order to afford the use of the message passing mechanism in MIKE.

6.1.2 Minimum-Ove rhead Processes

"Minimum-overhead processes" means minimum overhead in multiplexing processors among active processes, and in switching protection domains when a process enters/exits a protected procedure. The need to have fast context swaps during process switching has been understood and is the focus of every operating system design [SIT80].

The need to have efficient domain switching is due to the use of small protection domains in MIKE. We recall that

c- the error confinement is the emphasis in our protection model, and small protection domain is the most effective way 187 to realize the error confinement. However, the use of the small protection domain concept necessitates frequent domain switching. Domain switching occurs when a process enters/exits a protected procedure. At that time, certain objects are made accessible (or inaccessible) by altering the capability-list to which the process may refer.

It has often been assumed that a protected procedure call must be extremely rapid if it is not to degrade overall system performance. Systems built without specialized hardware for domain management must use software to implement domain changes which will be very expensive. In

MIKE, we need to have minimum-overhead processes in order to afford frequent domain and process switching.

This chapter examines those expensive software functions associated with the management of messages and processes which can be effectively supported by hardware/firmware means. For MIKE in LIU, we integrate various hardware/firmware mechanisms, which are developed under different contexts, to keep time and space overheads

1ow. An attempt will be made to restrict these proposals to features which can be implemented in current technology. 188

6.2 Inter-Task Communication

This section is concerned with expediting inter-guardian message communication. The strategies we use

in MIKE are as follows:

1. reducing the number of messages generated,

2. expediting the process of message mapping, and

3. optimizing message flow with hardware assistance.

6.2.1 Message Generation

In Section 4.4, we presented the two-level process

interaction model of MIKE. The model dictates that

inter-task communication is through message exchange and

intra-task communication is through either procedure

invocation or message exchange. This discipline

considerably improves system performance without impairing

the conceptual uniformity of MIKE.

The rationale for using message passing as a underlying

semantic concept for inter-task communication has been

presented in Chapter 4. Besides the reason that there is no 189 shared memory to synchronize the inter-node concurrent activities, the message passing mechanism elegantly models the undeterministic nature of communication links and the autonomous characteristics of each individual task. The autonomous behavior of tasks can be exemplified by using our scenario for system transparent resource access (see Section

4.6). In the scenario, after the VRST broadcasts the resource request message onto the DDLCN, it is not necessary for every VUST in the DDLCN to respond to the request should it choose not to. The semantics of message passing allows this kind of response to the request to occur, whereas procedure invocation, which always associates a procedure return with each procedure call, cannot be adopted for inter-task communication.

As for the intra-task communication, both message passing and procedure calling are allowed. It is this level of process interaction model which reduces the message traffic. Components associated with a particular task coexist in the same physical node as mandated by our NOS model. The fact that they have shared memory and/or are in the same address space can be used to coordinate the interaction. Asynchronously cooperating processes can still

er communicate by exchanging messages which will be monitored by their guardians. While within the same address space, 1 90

the procedure invocation facility can take a process rapidly from one context to another without exchanging messages.

This process interaction discipline will result in fewer messages being generated, which in turn will expedite message flow.

6.2.2 Mes sage Mapp ing

Part of the overhead involved in using the message passing mechanism is incurred during message delivery.

Specifically, the overhead is associated with transforming human-oriented names into machine-oriented names and with locating their physical addresses. The naming system of

MIKE allows only the guardians' names to be known by other tasks. That is, all messages exchanged are initiated by and terminated on guardians. Therefore, Messengers only maintain mailboxes for guardians. Processes other than guardians do not have their ownmailboxes in Messengers; hence they cannot directly communicate with other tasks.

The rationale for using this naming system are as f oilows: 191

1. Guardians are responsible for safeguarding the

integrity of their respective tasks; therefore, a

guardian should have control over the activities

occurring inside its task. That is, inter-task

message communication has to be monitored and

regulated by the destination guardian such that

the internal scheduling and resource management

policies can be observed.

2. Since processes have a relatively transient

nature, it would be less likely that one process

knows the name of another process and interacts

with it directly.

3. This naming system significantly decreases the

overhead in the mapping process, since the number

of entries (names) to be searched is reduced.

This unique naming discipline of MIKE contributes to the availability of minimum-overhead messages for inter-task c ommunica t ion. 192

6.2.3 Multiprocessor Configuration

Support of message handling software by the hardware is one of the MIKE design goals which will undoubtedly provide minimum-overhead messages. We will exploit the hierarchical framework of MIKE by using a specifically configured system organization to carry a higher volume of message traffic and to increase overall system performance.

The MIKE protocol hierarchy consists of three layers.

Peer guardians in each layer think they communicate with one another horizontally. In reality, messages sent by guardians for inter-task communication are always trapped by the Messenger at that layer. The Messenger will deposit the message into a locally maintained mailbox if the communicating guardians coexist in the same physical node.

Otherwise, the messages will be routed to the Messenger at the layer immediately below.

Based on this observation, we can see that the

Messenger at each layer is a potential bottleneck for the message flow. Therefore, a hardware processor can be dedicated to the Messenger at each layer to speed up the

e- message mapping process. An intelligent mailbox memory

[F0R75] which provides hardware features for efficiently 193 coordinating the mailbox access can further reduce the overhead of the message passing mechanism. Moreover, each layer of the MIKE protocol hierarchy can be dedicated to at least one hardware processor to execute its designated functions. With this multiprocessor organization, LIU can improve the overall performance considerably.

We show in Figure 17 this specially configured multiprocessor organization. This integrated software and hardware approach to expediting the message flow can handle a higher volume of message traffic, both intra-node and inter-node, and provide minimum-overhead messages for inter-task communication.

Re configurable LIU Architecture

Most "distributed systems" increase the system performance and mitigate the complexity of local operating systems by offloading protocol interpreters and NOS primitives into a network front-end processor (FEP) (2)

[HEA70, MAN 7 6, LIU81, TSA80a, TSA80b]. The upsurge in demand for network FEPs can be traced back to the day when

(2) A FEP is functionally equivalent to our LIU in the DDLCN. 1 94

Virtual Machine

Layer Processor A

Messenger Directions of

Message Flow

System Support

Layer Processor V

Messenger

Legend

□ : A Hardware Processor IPC Layer

Processor

Messenger

Figure 17.

A Multiprocessor Organization for LIU 195

teleprocessing just became economically and technically

feasible [NEW72]. Due to recent advances in LSI technology,

hardware and software means have been used to increase the

intelligence and reliability of FEPs for computer networks

[PIE77]. A good example is the Pluribus system of ARPAnet, which uses a tightly-coupled, but flexibly configurable, multiprocessor architecture to provide its modularity,

reliability, and functionality [HEA73, KAT78, ORN75].

In this section, we will present a hardware

architecture for LIU which will not only improve the overall

throughput, but will also adapt dynamically to the variation

of system workload. The architecture is based on two

architectural concepts: Sliced Computer Module (SCM) and

bit-sliced processing [LIU81, TSA80a, TSA80b].

6.3.1 Sliced Computer Module and Bit-Sliced Proces s ing

A computer system can be viewed as an ensemble which

comprises a number of identical sections (slices). Each

slice, called the sliced computer module (SCM), is a

self-contained CPU with a sliced random access memory. The

c- SCM, which is a computer in its own right, is conceived as a

basic and sole building block such that a variety of 196 architectural configurations can be assembled from it. A powerful computer can be formed by assembling in parallel a number of SCMs to obtain the desired word length. Since

SCMs are identical and can perform different tasks under microprogram control, they can take any position in the configuration by properly executing an appropriate microprogram, and as a whole, can process data coherently.

Furthermore, a bit-sliced processing technique is devised to coordinate the processing of individual SCMs in the ensemble. By adopting this bit-sliced processing technique, the integrated ensemble can operate in its full speed to process information of a pre-specified word size in parallel if there is a large enough number of SCMs available. Otherwise, it still can operate in a degraded mode to process the pre-specified word size of information once the failed SCMs are discarded without replacement

[LIU81, T SA8 Oa, TSASOb] (3).

(3) ‘For example, by using the bit-sliced processing technique, a 4-bit SCM can still process 8-bit information coherently by alternately processing the most and least significant 4 bits. 197

6.3.2 Sys tem Architecture of LIU

Depicted in Figure 18 is one of the possible architecture configurations for the LIU of DDLCN based on the two architectural concepts presented. From the hardware point of view, the machine architecture is homogeneous in the sense that it consists of an adjustable number of SCMs

(see Figure 18a). Logically, LIU consists of six dedicated processors (4): virtual machine processor, system support processor, IPC processor, and three Messenger processors (5)

(see Figure 18b). It also has an optional spare pool of

SCMs, a system main memory, a system software read-only memory, and a synchronization module.

The system is flexible because it can assume many architectural configurations. Each of these configurations is characterized by the number of processors in the system and the number of SCMs in each individual processor.

Increasing the number of dedicated processors in the system

(4) In this section, the word "processor" is used to indicate a processing system which consists of either a uniprocessor or multiple processors organization.

c- (5) This is a full-fledged configuration for LIU. Other less powerful architectural configurations can also be used for LIU, e.g. , we can use only one processor to execute the functions of the IPC layer and the system support layer. 198 will enable the system to execute more concurrent tasks, whereas increasing the number of SCMs in a processor will enable that particular processor to process more bits in parallel. The optimal system configuration is determined by the trade-off between cost and performance and is also determined by the functions and the size of the particular attached host system in the DDLCN. The profusion of these options provides a flexible foundation for the cost-effective evolution of LIU, since it can be easily tailored to meet the demand of its workload.

Furthermore, the architectural configuration can be dynamically changed since the system is firmware reconfigurable . The degradation of the system performance can be used to trigger the system into a reconfiguration phase, whereby the SCMs in the system can be redistributed according to the system workload and the new configuration can better match the processing need and increase the throughput (6). Details of SCM design, bit-sliced processing technique, and reconfigurable system operation

(6) For example, if processing load is heavy in the IPC layep processor, then SCMs which belong to other light~ly-loaded processors can be re-assigned to process the IPC functions. This resource redistribution is at the expense of other processors since they may then run at half of their normal speed because some of their SCMs have been re-assigned to the IPC layer processor. I

Synchronization Module

B B SCM

System Sys tem

(a) From Physical Point of View

Figure 18. 199 General System Architecture iue 8 (continued) 18Figure

Synchronize tion Module

Virtual System M achine Support Messenger Layer Messenger Spare L ayer Layer Pool Processor Proceessor Processor

System System

t o o (b) From Logical Point of View o > 201

can be found in [TSA80a, TSA80b].

6.4 Software-Directed Architecture

We have demonstrated in the previous sections that minimum-overhead messages can be provided by using the robust mechanisms built into the conceptual model and the dedicated hardware configuration for LIU. In this section, we will examine the hardware/firmware assistance needed to support the mechanisms used for domain-based protection.

6.4.1 Notion of Processes

"Process" is an abstract entity that demands and

releases various resources as it carries out a computation.

The abstraction of a process permits more efficient management of the physical processor(s) as well as

indirectly contributes to the ease of management of all other resource entities. The users also benefit from the process abstraction. With it, they can establish sets of cooperating concurrent processes which not only take maximum

c" advantage of the system's parallelism, but also result in clear formation of the problem to be solved. 202

Although procedure abstraction is familiar to conventional architecture, the notion of processes and data abstraction are not fully supported by the hardware. We will examine the hardware/firmware mechanisms needed to support the notion of processes, especially concentrating on the following areas:

1. Efficient dispatching which guarantees finite

progress of all progressable processes,

2. Facilities which synchronize process interaction,

a nd

3. Hardware components which facilitate domain

switching.

6 . 4.1.1 Dispatching

The Scheduler, which deals with medium-term scheduling

[BRI73], is a service task at the virtual machine layer.

The Scheduler should guarantee finite progress for all active processes. Since each node can have several coexisting Schedulers (due to the extensibility of MIKE), a

ST certa'in discipline will have to be incorporated in its task template to ensure the progress of active processes. 203

The Dispatcher, which deals with short-term scheduling

[BRI73], will multiplex the physical processor among active processes. Hardware/firmware mechanisms have to be employed in order to speed up the dispatching process. There exist a variety of schemes for implementing the dispatcher, e.g., sorting networks [THU74], associative memory [BER.70], and microcoded dispatcher. A detailed investigation of these schemes is beyond the scope of this dissertation. We merely note that they can provide efficient process dispatching and are valuable in the environment which encourages process f orking.

6.4 . 1 . 2 Process Synchronization

Scheduling and synchronization are closely related. If a process performs a P(semaphore) operation [DIJ68] and blocks itself, the Dispatcher has to select an active process to run the processor. Mechanisms for coordinating asynchronous activities are required by users and system alike. Major means for the exchange of information among processes and the coordination of concurrent operations are synchronization mechanisms [WET80]. 204

In MIKE, the system in principle supports arbitrary small objects; hence, it also supports fine-grained synchronization. By providing fine-grained synchronization, a large number of objects can be independently locked; therefore, the probability of contention for any one of them will be decreased. However, the price paid for this fine-grained synchronization is more frequent execution of

the synchronization primitives. Thus, we need a set of fast synchronization mechanisms to provide a reasonable

t hroughput.

Furthermore, multiple levels of synchronization should be provided in MIKE to suit a variety of needs. Low-level mechanisms can be used by the Dispatcher for fast, but short

processor synchronization, whereas high-level mechanisms can

be used by MIKE for process synchronization. Among these

synchronization mechanisms, high-level ones are more

expensive than their lower-level counterparts in execution

time, but less expensive in terms of the resources tied up by a blocked process.

The classic hardware feature to support process

synchronization is in rudimentary form. In some systems, a

LT hardware instruction like "test and set" is used to

synchronize concurrent activities. In other systems, the 205 only way is through software means while the interrupt is disabled. Although it is adequate to accomplish any process synchronization by using these primitive mechanisms, the overhead is apparently intolerable in our environment. We need special hardware to realize the otherwise

time-consuming synchronization primitives such as semaphores, monitors, conditional critical regions [BRI73,

BRI77, BRI78], path expressions [CAM74], etc.

To provide an architecture that directly supports MIKE, we need to realize some synchronization primitives in

hardware and build other higher level mechanisms on top of

these primitives. Various hardware/firmware schemes have

been proposed, such as the semaphores using associative

hardware in [SNE79], and the monitor construct in [BRI78].

With these hardware/firmware assisted synchronization

mechanisms, the LIU architecture can better support the

advanced concepts in MIKE and increase overall system

perf ormance. 206

6 . 4 .1 . 3 Domain Switching

We recall from Section 4.2.2.2 that a process contains execution information and a stack of objects with the

MIKE-defined type called PROTECTION/ACCESS DOMAIN, or PAD for short. A PAD object defines the instantaneous protection/access domain of a process. In order to enforce domain-based protection, it is necessary to change the set of objects accessible to a process during procedure activation and termination. This amounts to creating a new

PAD object and pushing it on top of the process stack. This frequent domain switching is one of the characteristics of the MIKE protection model, and some form of appropriate hardware support is necessary in order to make the model feasible.

6.4.1.3.1 Stack Architecture

Two types of computer architecture might be suitable for LIU: memory-to-memory (e.g., Texas Instruments TI9900) and stack (e.g., Hewlett-Packard HP3000, Burroughs B6700)

[BLA77, BUL77, MCK80]. The register-oriented architecture

6r (e.g.*, IBM system/370) is definitely not suitable for LIU, since all the contents of registers must be saved during 207 domain switching which amounts to a substantial overhead.

Although a memory-to-memory architecture exhibits a compact machine state for domain switching, it does not have the favorable characteristics possessed by a stack architecture for domain-based protection, namely the support for allocation of procedure activation records and locality of references [GEH79a].

A stack architecture provides natural support for the allocation of procedure activation records (PAD objects) and also for the domain switch mechanism [BLA77, BUL77, GEH79a,

MCK80, SIT80]. By using hardware stack processors in LIU, a set of procedure activation records can be kept at the process stack (7) which represent a restricted addressing environment of the process. The instantaneous protection/access domain of the process is defined by the

top PAD object in the stack.

The stack is actually contained in an ordinary main memory which is a large array accessed via subscripts (i.e., absolute addresses). In a simple stack architecture, this stack must be accessed each time an element is pushed onto

(7) This stack is supported by hardware/firmware and is not simulated by software. 208 or poppedfrom the stack. Improvement can be made if the top elements of the stack are held in the register file of

the stack processor (called the stack file). This speeds up computation considerably, since they are most frequently referenced elements. Figure 19 illustrates this stack

architecture.

In Figure 19, TSM (top of stack in memory) is a CPU

register pointing to the top element of the stack in the main memory. Note that TSM does not necessarily point to

the actual top of the stack, as the top elements of the

stack may be in the stack file. There is another CPU

register, called TOP, which points to the top element of the

stack file in the stack processor. Actually, TOP contains

the number of top-of-s tack registers in the stack file which

in turn contains data valid for the current process.

Each time an element is pushed onto the stack, TOP is

incremented by one. If the stack file is full and a push is

executed, the bottom element in the stack file is pushed

onto the main memory portionof the stack to make room for

the newly pushed data, and TSM is incremented by one. On

the other hand, each time an element is popped from the

er stack", TOP is decremented by one. If the stack file is

empty, then an element is pulled from the main memory and 209

Stack File

T0P_ Top PAD Object

Second-to-Top ► «~PAD Object

TSM

Stack in Memory

Main Memory

Legend:

TOP: Top of Stack in Stack File TSM: Top of Stack in £Iain Memory

Figure 19

An Extended Stack Architecture 2 10

put into the stack file, and TSM is decremented by one.

These pushes and pulls between the stack file and main memory are automatically managed by the hardware/firmware so that they are normally transparent to the programmers.

As shown in Figure 19, the stack file can contain more than one PAD object. Therefore, the contents of the stack file need not all belong to one protection/access domain, and the "running" process cannot access all the registers in the stack file unless it is the top PAD object. This is an essential difference from a register-oriented architecture where all registers in the register file are accessible by the running process and the contents of the register file have to be saved during domain switching, which incurs a substantial overhead [GEH79a].

While a stack processor does not have a

"general-purpose register file," it does have certain special-purpose registers, such as the program counter and other protection/access domain related registers. During process switching, these special-purpose registers can be saved at the top of the process stack as part of the execution information of the process. 211

Additional hardware support can be provided to smooth out the transfer between the stack file and the stack in main memory. For example, some form of lookahead can be used to transfer more than one word of data to reduce the need for memory references during domain switching [BLA77,

BUL77J. The optimal numberof registers in the stack file has a close relationship with the instruction set. A detailed investigation of a stack-based instruction set and its relationship with the optimal number of registers in the stack file is beyond the scope of this dissertation and will not be discussed further.

6.4.2 Capability Mechanisms

Capability-based addressing is a uniform mechanism oriented toward both the protection and sharing of information and resources. It contributes significantly to the reliable and robust system operation as explained before. The use of this capability mechanism incurs some overhead as can be expected, especially the time taken to validate the access rights and the storage space for capabilities . 2 12

6 .4 . 2 .1 Authority Checks

A capability is generally considered to be a data type containing a protected name of an object and a definition of the access rights possessed by the object. The only way to make a reference to an object is by the possession of a capability to that object. The possession of a capability is then the sole determinant of access rights; hence, authority checks must be performed every time an instruction attempts to operate on objects. The authority checks involve fetching a capability and verifying its type, validity, the authority of the process to operate on the object, and the existence of the named object.

Because of a desire to use existing technology and because of the changing nature of the architecture, the authority checks in MIKE should be embodied largely in firmware which can then offer a faster and safer authentication process. A fast mechanism such as associative storage array should also be used for the translation of object names to physical storage locations.

Furthermore, other fundamental system operations, such as domain switching and capability manipulation, should be

e r microcoded to safeguard the system integrity and increase system performance [MYE80]. 213

6 . 4 . 2 . 2 Typed Memory

Capability-based architecture has traditionally used one of two kinds of memory organization to enforce the distinction between capability and data and to safeguard the integrity of capability mechanisms: tagged memory and partitioned memory [BUL77, DEN80, GEH79a , GEH79b, HAY78,

KUC78], All segments in a partitioned memory organization are divided into two classes: capability segments and data segments. Although this scheme is the most widely used

architecture of capability mechanisms, e.g., CAP and HYDRA,

it has some severe problems. For example, it requires many

small segments and enforces an unnatural treatment of data

structures as noted in [GEH79a, GEH79b].

In a tagged memory, each word in computer storage is

associated with a small tag, which is interpreted by the

hardware, to distinguish capability from data [DEN80, DOR75,

FAB74, ILI68, JAG76]. A tagged architecture has many advantages: it promotes small object code, it removes the need to segregate capabilities, and it facilitates hardware implementation of commonly used data structure [DEN80,

GEH7 9a]. 214

In LIU, a more general version of tagged memory, called typed memory, will be used which is adopted from [GEH79a,

GEH79b]. On a system with typed memory, each elementary object is associated with a type indicator. Objects can be aggregated in sets of two kinds: vector and record. Figure

2 0 illustrates:

1. a single elementary object (e.g., an integer),

2. A vector which is a set of homogeneous objects

(e.g., an array), and

3. A record which is a set of heterogeneous objects

(e.g., a PAD object).

In Figure 20b, component objects in a vector are of the same type, and a single type indicator gives the type of all objects in the vector. In Figure 20c, each component object in a record has its own type indicator because they are of mixed type.

Typed memory, which can be realized by using off-the-shelf semiconductor memory, incurs less overhead in tagging each addressable unit of memory. Further, it

*r provides a type indicator for each object to support dynamic checking. The type indicator, which is less restrictive 215

type representat ion

("integer") ("153821")

(a) An Elementary Object

type number type of representation

of component of 1st ■ ■ ■ ■

("vector") elements elements element

(b) A Vector Object

type number type of representation

B ■ ■ ■ of 1st of 1st

("record") elements element element

(c) A Record Object

Figure 20.

Entity Representations in a Typed Memory 216 than the mechanisms used by the other two memory organizations, can not only distinguish capability from data, but can also indicate any object type, i.e., either the hardware-defined type or MIKE/user-defined extended type. Additional hardware buffer registers can be used to speed up the microcoded run-time checking. The use of typed memory coupled with other software schemes, such as the variable-size capabilities [GEH79a, GEH79a], can support very small objects and, therefore, can achieve flexible and efficient protection in MIKE.

In summary, this chapter has proposed several hardware/firmware mechanisms to support the NOS model and safeguard the protection system of MIKE. All these hardware/firraware mechanisms can be constructed by using off-the-shelf components. The costs of dealing with the complexity of MIKE may be lower if it is paid once by the system designers in cheap hardware, than if it is paid by each of many users in expensive software. By incorporating these hardware/firmware mechanisms, the resulting LIU architecture can narrow the gap between the abstractions called for by MIKE and the capabilities directly realized by the conventional hardware, thereby reducing the overhead associated with the use of advanced software design methodology. CHAPTER 7

SUMMARY AND DIRECTIONS FOR FUTURE RESEARCH

We have presented in the previous chapters the NOS model, protocol structure, and the underlying architecture of the Mult ic ompute r Integrator Kerne1 (MIKE) for the

Distributed Double-Loop Computer Network (DDLCN). The DDLCN was envisioned as a local-area, distributed-control computer network. Research concerning the DDLCN was directed towards geographically local communities of autonomous computer users, who occasionally have need of resources or computing services which are present elsewhere in the system, yet which are not available locally. Such an environment is typical of that frequently found today in many commercial, industrial, university, and research settings. Bringing the resource sharing, cost advantages, and performance improvements of distributed computer systems to such groups would be a very significant achievement and is, therefore,

c* the major design goal of MIKE.

2 17 218

This study of MIKE was a major undertaking, since it is

to be superimposed on top of a collection of existing operating systems. Furthermore, the design of MIKE should not be just a pure research project, but should lead to a feasibly implementable product. The design goal for MIKE therefore emphasizes the following areas:

1. To provide a reliable and structured network

system software such that it is easy to implement

a nd maintain,

2. To provide system-transparent resource sharing for

the users while maintaining cooperative autonomy

among local computer systems,

3. To minimize the surgery of local operating

s ys terns, and

4. To use existing technology such that it can be

easily implemented.

7.1 Summary of MIKE's Significant Features

Overall, the contribution of this research is on the

c- NOS model we devised, such that the network operating system

(i.e., MIKE) in LIU and local operating systems in host 2 19 computers can be fitted consistently into this conceptual model. Most researchers are reluctant to design this type of distributed system which is based on heterogeneous local operating systems; instead, they throw away the existing operating systems and start all over again with a single

homogeneous "distributed" operating system. The most

important reason for this is that the existing heterogeneous operating systems usually cause imperfections or exceptions

in their otherwise coherent conceptual models. In our

DDLCN, the conceptual model embraces both MIKE and local

operating systems coherently in such a way that the system

operation of DDLCN can be described consistently in terms of

components and protocols of the model.

The MIKE structure is based consistently on the object model. Any non-physical resource is typed and is so

indicated in its memory representation. Each object can only be manipulated in terms of well-defined functions or operations according to the principle of data abstraction.

A novel "task" concept is used to further group resource entities. It is this entity grouping that allows the system

operation to be modeled in an elegant way. Each task is

safeguarded by one and only one guardian. Processes running

c" in a given task only deal with their superior guardian.

This is exactly the situation in existing operating systems, PLEASE NOTE:

This page not included with original material. Filmed as received.

University Microfilms International 221

conforms with the least privilege principle by allowing minimum access capabilities for processes; consequently it limits the propagation of errors, both software and hardware. Other measures such as resource residual control and action validation have also been incorporated to safeguard the integrity of the system operation.

The protocol, which isbased on a layered design, provides abstraction of commonality to support the NOS services. The protocol hierarchy, described from the bottom up, consists of three layers: the IPC layer, the system support layer, and the virtual machine layer.

A uniform treatment of user and operating system processes is adopted in MIKE at the virtual machine layer to provide an extensible and configurable environment. The only distinction between them is largely a matter of privilege. This leads to a very flexible and cleanly structured end-product containing no artificial boundaries to complicate design. To the tasks in the virtual machine layer, the system support layer and the IPC layer, form a virtual communication machine such that those tasks think

they communicate with one another through a well-defined

«r communication link. 222

For MIKE in LIU, we integrate various hardware/firmware mechanisms to reduce time and space overhead arising from the use of advanced software design methodology. The resulting LIU architecture narrows the gap between the abstractions called for by MIKE and the features directly realized by conventional hardware.

These features can be used to make MIKE a very reliable and efficient network operating system for the DDLCN. It can provide system-transparent resource sharing for the users while allowing individual guardians to guard their respective resources and to respond to requests as they see fit. Furthermore, the users retain the original local operating system with which they are familiar and see the

DDLCN as a single integrated computer system controlled by their local operating systems. Thus, it seems that the design of MIKE is rather successful in accomplishing its original goals .

7.2 Areas of Future Research

The task concept and related inter-task interaction t" protocols are quite general and can be applied to other distributed systems to facilitate their system software 223

design. However, we anticipate that the most fruitful extension of this work will be an elaboration of the design it presents, since the implementation of MIKE and the operation of DDLCN can be used as a testbed for a variety of other application-oriented research projects.

As stated in the introduction, this dissertation

research is concerned with the conceptual design of the MIKE

framework which includes the NOS model and the protocol structure. There are many more additional areas of interest

that are not possible to investigate at this time. The following topics should yield some additional refinement to

the MIKE design.

Conceptual Model. Many areas are still left uncovered

in the conceptual model of MIKE, especially on the

constructs and interaction protocols specifications. For

example, the syntactical constructs for active entities

(both process and guardian) and objects have to be precisely

specified. The format of capabilities and messages should also be defined. The luxury of uniformity and flexibility

should always be balanced against performance considerations when specifying these constructs. 224

System Configuration. Many important tasks needed in each layer of protocol hierarchy have to be identified. The task templates should be defined for every task in the virtual machine layer such that system integrity can be observed. Based on the good features of the MIKE object model (e.g. , the natural support for the concept of atomic transaction), other application software, such as fully/partially duplicated distributed database, should also be designed in order to increase the DDLCN functionality.

LIU Architecture. The essential hardware and firmware mechanisms needed to support the advanced software design methodology should be pinpointed. Exact formats and schemes should be carefully evaluated and analyzed against the system performance and the sophistication of individual hosts such that incorporation of these mechanisms is warranted.

These areas of future research should provide sufficient direction to follow, and when all of them are completed, MIKE will provide a computer network with a high degree of cohesiveness, transparency, and autonomy for the users. BIBLIOGRAPHY

[ADV79] Advanced Micro Devices, Inc., The AM2 90 0 Fami1y Data Book, Sunnyvale, California, December 1979.

[AKK74] Akkoyunlu, E., et al., "Interprocess Communication Facilities for Network Operating Systems," IEEE Computer , pp. 46-55, June 1 974 .

[AVI77] Avizienis, A., "Fault-Tolerant Computing — Progress, Problems and Prospects," IF IP Congress Proceedings , pp. 405-418 , 1977.

[BAS 7 7] Baskett, F. H., et al., "Task Communication in Demos," Proceedings of 6th Symposium on Operating Systems Principles , pp. 23-31, November 1977 .

[BER70] Berg, R. 0. and Johnson, M. D., "An Associative Memory for Executive Control Functions in an Advanced Avionics Computer Systems," Proceedings of 19 7 0 IEEE International Computer Group Conf erence, pp. 336-342, 1970.

[BER80] Berstis, V., "Security and Protection of Data in the IBM System/38," Proceed ings of 7th Annual S ympo s ium on Computer Architecture, pp. 245-252, May 19 80.

[BHA79] Bhandarkar, D. P., "The Impact of Semiconductor Technology on Computer Systems," IEEE Compute r , Vol. 12, No. 9, pp. 92-98, September 1979.

[BLA77] Blake, R. P., "Exploring a Stack Architecture," IEEE Computer, Vol. 10, No. 5, pp. 30-39, May 1 977 .

[BOC79] Bochmann, G. V., Architecture o f Pis tributed Compute r Sys terns, Lecture Notes in Computer Science, Vol. 77, Springer-Verlag, New York, 1979.

225 226

[BRI73] Brinch Hansen, P., Operating System Principles , Prentice-Ha11 , Englewood Cliffs, N. J., 1973 . t B R17 7 ] Brinch Hansen, P., The Architecture of Concurrent Programs, Prentice-Hall, Englewood Cliffs, N. J., 1 977 .

[BRI78] Brinch Hansen, P., "Multiprocessor Architecture for Concurrent Programs," ACM SIGARCH Computer Architecture News , Vol. 7, No. 4, pp. 4-23, De cerabe r 19 78.

[BRY79] Bryant, R. E. and Dennis, J. B., "Concurrent Programming," in Research Directions i n S o f twa re Technology, (Wegner, P., editor), pp. 584-610, The MIT Press, Cambridge, Mass., 1979.

[BRY80] Bryant, R. M. and Finkel, R. A., "A Stable Distributed Scheduling Algorithm," Technical Report, Computer Science Department, University of Wisconsin, Madison, Wisconsin, September 1980.

[BUL77] Bulman, D. M., "Stack Computers: An Introduction," IEEE Compute r , Vol. 10, No. 5, pp. 18-28, May 1977.

[CAM74] Campbell, R. H. and Habermann, A. N., "The Specification of Process Synchronization by Path Expressions," Lecture Notes in. Compute r S c ience , Vol. 16, pp. 89-102, Springer-Verlag, New York, 1974 .

[CAR79] Carter, W. C., "Hardware Fault Tolerance," in Comput ing Sys terns Reliabili ty , (Anderson, T. and Randell, B., editors), pp. 211-263, Cambridge University Press, London, 1979.

[CHA80] Champine, G. A., Pis tributed Compute r Systems, North-Holland Publishing Company, New York, 1980.

[CH079] Chou, C. P., Liu, M. T., and Pardo, R., "Distributed Data Base Design for a Local Computer Network (DDLCN)," Proceed ings o f Firs t Internat ional Sympos ium on Policy Analys is and Information Sys terns, pp. 42-49, June 1 979 .

[CHO&Oa] Chou, C. P. and Liu, M. T., "A Concurrency Control Mechanism and Crash Recovery for a Distributed Database System (DLDBS)," in Pis tributed Data Bases, (Delobel, C. and Litwin, W editors), pp. 201-214, North-Ilolland , New 227

York, March 1980.

[CH08 Ob] Chou, C. P. and Liu, M. T., "A Concurrency Control Mechanism for a Partially Duplicated Distributed Database System," Proceedings of 19 80 Compute r Ne tworking Symposium, pp. 26-34 , De ceraber 19 80.

[CH081] Chou, C. P., "Design of the Distributed Loop Data Base System (DLDBS)," Ph.D. Dissertation, Department of Computer and Information Science, The Ohio State University, Columbus, Ohio, June 1981 .

[CLA80] Clark, D. D. and Svobodova, L., "Design of Distributed Systems Supporting Local Autonomy," Proceedings of COMPCON'8 0 Spring, pp. 438-444, February 1980.

[C0079] Cook, D., "In Support of Domain Structure for Operating Systems," Proceedings o f 7th Sympos ium 0 n Operating Systems Principles, pp. 128-30, December 1979.

[DAH7 2] Dahl, 0-J , Dijkstra, E. W., and Hoare C. A. R., Structured Programming, Academic Press, New York, 1 972 .

[DEN66] Dennis, J. B. and Van Horn, E. C ., "Programming Semantics for Multiprogrammed Computations," Communicat ions ACM, Vol. 9, No. 3, pp. 143-155 , Ma rc h 1 9 6 6 .

[DEN 7 6] Denning, P. J., "Fault Tolerant Operating Systems," ACM Comput ing Surveys, Vol. 8, No. 4, pp. 361-486, December 19 76.

[DEN7 9] Dennis, J. B., et al., "Research Directions in Computer Architecture," in Research Directions in Sof tware Technology, (Weger, P., editor), pp. 514-555, The MIT Press, Cambridge, Mass., 1979.

[DEN80] Dennis, T. D., "A Capability Architecture," Ph.D. Dissertation, Department of Computer Science, Purdue University, West Lafayette, Indiana, May 1980.

[DES81 ] desJardins, R., "Overview and Status of the ISO Reference Model of Open System Interconnection," ACM SIGCOMM Computer Communication Review, Vol. 11, No. 2, pp. 9-14, April 1981. 228

[DIG74] Digital Equipment Corporation, Introduction to Minicomputer Ne tworks, Maynard, Mass., 1974 .

[DIG78] Digital Equipment Corporation, TOP S-2 0 Ve rs io n 3A Operating Sys tem: Monitor Calls Reference Manual, Maynard, Mass., September 1978.

[DIJ68] Dijkstra, E. W., "Cooperating Sequential Processes," in Programming Languages, pp. 43-112, Academic Press, London, 1968.

[DON76] Donnelley, J. E., "A Distributed Capability Computing System," Proceedings o f 3rd International Conf erence o n Computer Communication, pp. 432-440, August 1976.

[DON7 9] Donnelley, J. E., "Components of a Network Operating System," Computer Networks, Vol. 3, No. 6, pp. 389-399, December 1979.

[D0N80] Donnelley, J. E. and Fletcher, J. G., "Resource Access Control in a Network Operating System," Proceedings of ACM Pacific'80 Conference, November 1980.

[DOR7 5] Doran, R. W., "The International Computer Ltd. ICL 2900 Computer Architecture," ACM SIGARCH Computer Architecture News , ACM SIGARCH, Vol. 4, No.3, pp. 24-47, September 1975.

[ENG74] England, D. M., "Capability Concept Mechanisms and Structure in System 250," In te rna t ional W o rks ho p o n Protection i n Operating Sys terns , IRIA/LABORIA, pp. 63-82, August 1974.

[ENS7 8] Enslow, P. H., "What is a 'Distributed' Data Processing System ?" IEEE Computer, Vol. 11, No. 1, pp. 13-21, January 1978.

[FAB74] Fabry, R. S., "Capability-Based Addressing," Communications ACM, Vol. 17, No. 7, pp. 403-412, July 1974.

[FLE73] Fletcher, J. G., "The Octopus Computer Network," Datamation, Vol. 19, No. 4, pp. 58-63, April

[FLE80] Fletcher, J. G. and Watson, R. W., "Service Support in a Network Operating System," Proceedings o f C0MPC0N'80 Spring , pp. 415-424, February 1980. 229

[FLY 7 9] Flynn, M. J., et al. , Operating Systems : An Advanced Course, Springer-Verlag, New York, 1979 .

[FLY80] Flynn, M. J., "Directions and Issues in Architecture and Language," IEEE Computer, Vol. 13, No. 10, pp. 5-22, October 1980.

[F0R78] Forsdick, H., Schantz, R. E., and Thomas, R., "Operating Systems for Computer Networks," IEEE Computer, Vol. 11, No. 1, pp. 48-57, January 1 978 .

[GEH7 9a] Gehr.nger, E. F., "Functionality and Performance in Capability-Based Operating Systems," Ph.D. Dissertation, Department of Computer Science, Purdue University, West Lafayette, Indiana, May 1 979 .

[GEH7 9b] Gehringer, E. F., "Variable-Length Capabilities as a Solution to the Small-Object Problem," Proceedings o f 7th Sympos ium on Operating Sys terns Principles, pp. 131-142, December 1979.

[GLI79] Gligor, V. D., "Architectural Implementation of Abstract Data Type Implementation," Proceedings of 6th Annual Symposium on Computer Architecture, pp. T O ^ O , April 1 979 .

[GRA72] Graham, G. S. and Denning, P. J., "Protection - Principles and Practice," AFIPS Conf erence Proceedings , Spring Joint C omp ute r Conference, pp. 417-429, 1972.

[GRA78] Gray, J., "Notes on Data Base Operating Systems," Technical Report RJ2188, IBM Research Laboratory, San Jose, California, February 1978.

[HAB7 6] Habermann, A. N., Flon, L., and Cooprider, L., "Modularization and Hierarchy in a Family of Operating Systems," Communications ACM, Vol. 19, No. 5, pp. 266-272, May 1976.

[HAL8G] Halstead, R. H. and Stephen, A. W., "the MuNet: A Scalable Decentralized Architecture for Parallel Computation," Proceedings of 7th Annual Symposium

tr on Computer Architecture , pp. 1 39-145 , May 1980.

[HAY7 8] Hayes, J. P., Compute r Architecture and Organization, McGraw-Hill Book Company, New York, 1978 . 230

[HEA70] Heart, F. E., et al., "The Interface Message Processor for the ARPA Computer Network," AFIPS Conf e re nee Proceedings, S pr ing Joint Computer Conf erence, Vol. 36, pp. 551-567 , June 1970.

[HEA73] Heart, F. E., et al., "A New Minicomputer/Multiprocessor for the ARPA Network," AFIPS Conf erence Proceed ings, National Compute r Co nf e rence, Vol. 42, pp. 529-537 , June 1973 .

[ILI68] Iliffe, J. K., Basic Machine Principles , American Elsevier, Inc., New York, 1968.

[IS080] International Standards Organization (ISO), Data Processing - Open Sys tern Interconnection - Basic Reference Mode 1, ISO Draft Proposal 7 498, New York, December 1980.

[J AG 7 6] Jagannathan, A., "An Implementation Model for a Multiprocessor Operating System on a Descriptor Oriented Architecture," Master Thesis, Rice University, Houston, Texas, July 1976.

[JAG80] Jagannathan, A., "A Technique For The Architectural Implementation of Software Subsystems," Proceedings o f 7th Annual Sympos ium on Computer Architecture, pp. 236-244 , May 1980.

[JEN78J Jensen, E. D., "The Honeywell Experimental Distributed Processor - An Overview," IEEE Compute r, Vol. 11, pp. 28-38 , January 1 978 .

[JEN80] Jensen, E. D., et al., "Decentralized Resource Management in Distributed Computer Systems," Final Year Interim Report, RADC, PR B-l-3503, October 1 980.

[JON77] Jones, A. K., Chansel, R. J., Durham, I., Feiler, P., and Schwans, K., "Software Management of Cm* - A Distributed Multiprocessor," AFIPS Conference Proceedings, Vol. 46, pp. 637-644, 1 9 77 .

[JON78] Jones, A. K., "The Object Model: A Conceptual Tool for Structuring Software," in Lecture Notes

c - in Computer Science, Vol. 60, (Bayer, R., Graham, R. H., and Seegmuller, G., editors), pp. 8-18, Springer-Verlag, Berlin, 1978.

[JON7 9a] Jones, A. K. and Schwans, K., "TASK Forces: Distributed Software for Solving Problems of 231

Substantial Size," Proceeding of 4 th Sof tware Engineering, pp. 315-330, September 1979.

[JON79b] Jones, A. K., et al., "StarOS, A Multiprocessor Operating System for the Support of Task Forces," Proceedings of 7th S ympo s ium on Operating Sys terns Principles , pp. 117-127, December 1979 .

[KAH81] Kahn, K. C. and Pollack, F., "An Extensible Operating System for the Intel 432," Proceedings of COMPCON'81 Spring, pp. 398-404, February 1981.

[KAP80] Kapur, D. "Towards a Theory for Abstract Data Types," Ph.D. Dissertation, Laboratory for Computer Science, MIT, Cambridge, Mass., May 1980.

[KAT78] Katsuki, D., et al. , "Pluribus - An Operational Fault-Tolerant Multiprocessor," Proceedings o f IEEE, Vol. 66, No. 10, pp. 1146-1159, October 1978.

[KUC78] Kuck, D. J., The S tructure o f Computers a nd Comp uta t ions , Vol. 1, John Wiley & Sons, New York, 1978.

[LAU78] Lauer, H. C. and Needham, R. H., "On the Duality of Operating Systems Structures," Proceedings of Second International Symposium on Operating Sys terns, IRIA, October, 1978, reprinted in ACM SIGOP S Operating Sys terns Rev iew, Vol. 13, No. 2, pp. 3-19, April 1979.

[LIN7 6a] Linden, T. A., "The Use of Abstract Data Types to Simplify Program Modifications," Proceed ing s o f Conf erence on Data; Abs traction Definition a nd Structure, pp. 12-23, March 1976.

[LIN7 6b] Linden, T. A., "Operating Systems Structures to Support Security and Reliable Software," ACM Comput ing Surveys, Vol. 8, No. 4, pp. 409-445 , De cembe r 1 9 7 6.

[LIN81] Lindsay, D. C., "On Binding Layers of Software," ACM SIGOP S Operating Sys terns Rev iew, Vol. 15, No. 2, pp.33— 37, April 1981.

[LIS/5] Liskov, B. H. and Zilles, S. N., "Specification Techniques for Data Abstractions," IEEE Transact ions on Sof tware Engineering, Vol. SE-1, No.l, pp. 7-19, March 1975. 232

[LIS77] Liskov, B., et al . , "Abstraction Mechanisms in CLU," C ommunicat ions ACM, Vol. 20, No. 8, pp. 564-576, August 1977.

[LIU75] Liu, M. T. and Reames, C. C., "The Design of the Distributed Loop Computer Network," Proceedings o f 19 75 Interna t ional C omp ut e r S ympo s ium, Vol. 1, pp. 273-283 , August 1 975 .

[LIU77] Liu, M. T. and Reames, C. C., "Message Communication Protocol and Operating System Design for the Distributed Loop Computer Network (DLCN)," P roceedings of 4th Annual Sympos ium o n Computer Architecture, pp. 193-200, March 1977.

[LIU78] Liu, M. T., "Distributed Loop Computer Networks," i n Advances inC omp ut e rs, Vol. 17, (Yovits, M. C., editor), pp. 163-221, Academic Press, New York, 1978.

[LIU79] Liu, M. T., et a l ., "System Design of the Distributed Double-Loop Computer Network (DDLCN)," Proceedings o f First Internat ional C onf erence o n Pis tributed Comput ing Systems, pp. 95-105 , October 1979.

[LIU80] Liu, M. T., Mamrak, S. A., and Ramanathan, J., "The Distributed Double-Loop Computer Network (DDLCN)," Proceedings of 19 80 ACM Annual Conference, pp. 164-178, October 1980.

[LIU81] Liu, M. T., et a l ., "Design of the Distributed Double-Loop Computer Network (DDLCN)," Journal of Digital Systems, Vol. 4, No. 4, April 1981.

[LUN79] Luniewski, A. W., "The Architecture of an Object Based Personal Computer," Ph.D. Dissertation, Laboratory for Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts, 1 979 .

[MAD81] Madsen, J., "A Computer System Supporting Data System Abstraction," ACM SIGOP S 0 p e ra ting Sys terns Review, Vol. 15, No. 2, pp. 38-78 , April 1979 .

[MAM81] Mamrak, S. A. and Berk, T. S., "The Desperanto Research Project, Technical Report 0SU-CISRC-TR-81-2, Computer and Information Science Research Center, The Ohio State University, Columbus, Ohio, February 1981. 233

[MAN 7 6] Mann, W. F., Ornstein, S. M., and Kraley, M. F., "A Network Oriented Multiprocessor Front-End Handling Many Hosts and Hundreds of Terminals," AFIPS Conf erence Proceedings, Na tional Computer Conf erence, Vol. 45, pp. 533-540, June 1976.

[MCC80] McCreery, T. D., "The X-Tree Operating System: Bottom Layer," Proceedings of COMPCON/80 Spring, pp. 340-343, February 1980.

[MCK80] McKeeman, W. M., "Stack Computers," in Introduction t o Computer Archi tec ture, 2nd e d. , (Stone, H. S., editor,) pp. 319-362, Science Research Associates, Inc., Chicago, Illinois, 1 980.

[MYE80] Myers, G. J. and Buckingham, B. R. S., "A Hardware Implementation of Capability-Based Addressing," ACM SIGARCH Compute r Architecture N ews, Vol. 8, No. 6, pp. 12-24, October 1980.

[NEE7 7a ] Needham, R. M. and Walker, R. D. H., "The Cambridge CAP Computer and its Protection System," P roceedings of S i x t h Symposium on Operating Sys terns Principles, pp. 1-10, November 1977.

[NEE7 7b] Needham, R. M. and Birrell, A. D., "The CAP Filing System," Proceedings of Sixth S ymp os ium o n Operating Systems Principles, pp. 11-16, November 1 9 77 .

[NEE7 7c] Needham, R. M., "The CAP Project: An Interim Evaluation," Proceed ings of Sixth S ympo s ium on Operating Systems Principles, pp. 17-22, November 1 977 .

[NEW7 2] Newport, C. B. and Ryzlak, "Communication Processors," Proceedings of the IEEE, Vol. 60, No. 11, pp. 1321-1332, November 1972.

[OH77] Oh, Y. and Liu, M. T., "Interface Design for Distributed Control Loop Networks," Proceedings o f 1 9 7 7 National Telecommunicat ion Conf erence, pp. 31.4.1-6, December 1977.

[0RNZ5] Ornstein, S. M., et al., "Pluribus - A Reliable Multiprocessor," AFIPS Conf erence Proceedings, National Computer Conference, Vol. 44, pp. 551-559, 1975. 234

[OUS80] Ousterhout, J. K., et al., "Medusa: An Experiment in Distributed Operating System Structure," Communications ACM, Vol. 23, No. 2, pp. 92-105, February 1980.

[PAR76] Parker, D. B., "The Future of Computer Abuse," Proceedings of Man and the Computer Symposium, pp. 59-68, November 1976.

[PAR78] Pardo, R., Liu, M. T., and Babic, G. A., "An N-Process Communication Protocol for Distributed Processing," Proceedings Sympos ium on Compute r Ne two rk Protocols, pp. D7.1-10, February 1 978 .

[PAR7 9a] Pardo, R., "Interprocess Communication and Synchronization for Distributed Systems," Ph.D. Dissertation, Department of Computer and Information Science, The Ohio State University, Columbus, Ohio, August 1979.

[PAR7 9b] Pardo R. and Liu, M. T., "Multi-Destination Protocols for Distributed Systems," Proceed ings of 1979 Computer Networking Symposium, pp. 176-185, De cembe r 19 79 .

[PEE7 8] Peebles, R. and Manning, E., "System Architecture for Distributed Data Management," IEEE Computer , Vol. 11, No. 1, pp. 40-47, January 1978.

[PEE80] Peebles, R. and Dopirak, T., "ADAPT: A Guest System," Proceedings of COMPCON'80 Spring, pp. 445-454, February 1980.

[ PIE 7 7] Pierce, R. A. and Moore, D. H., "Network Operating Systems Functions and Microprocessor Front-End," Proceedings of COMPCON'7 7 Spring, pp. 325-328, February 1977.

[POP 74 ] Popek, G. J., "Protection Structures," IEEE Compute r, Vol. 7, No. 6, pp. 22-31, June 1974 .

[ RAT 8 0] Rattner, J. and Cox, G., "0bject-Based Computer Architecture," ACM SIGARCH Computer Architecture News, Vol. 8, No. 6, pp. 4-11, October 1980.

[RAT81] Rattner, J. and Lattin, W. W., "Ada Determines Architecture of 32-bit Microprocessor," Electronics, pp. 119-126, February 1981. 235

[REA75] Reames, C. C. and Liu, M. T., "A Loop Network for Simultaneous Transmission of Variable-Length Messages," Proceedings of Second Annual Symposium on Computer Architecture, pp. 7-12, January 1975. (Also reprinted in Distributed Processing, Liebowitz, B. H. and Carson, J. H., editors, IEEE Catalog EH 0127-1, pp. 3.31-3.36, September 1 977 . )

[REA76] Reames, C. C., "System Design of the Distributed Loop Computer Network (DLCN)," Ph.D. Dissertation, Department of Computer and Information Science, The Ohio State University, Columbus, Ohio, March 1976.

[ROW7 5] Rowe, L. A., "The Distributed Computing Operating System," Technical Report 66, University of California, Irvine, June 1975.

[SAL75] Saltzer, J. H. and Schroeder, M. D., "The Protection of Information in Computer Systems," Proceedings o f t he IEEE, Vol. 63, No. 9, pp. 1278-1308, September 1975.

[SCH78] Schaffert, J. C., "A Formal Definition of CLU," MIT Technical Report MIT-LCS-TR-193, MIT, Cambridge, Mass., January 1978.

[SHA80] Shankar, K. S., "Data Structures, Types, and Abstractions," IEEE Computer, Vol. 13, No. 4, pp. 67—77, April 1980.

[SIT80] Sites, R. L., "Operating Systems and Computer Architecture," in Introduction ' t_o Computer Architecture , 2nd ed., (Stone, H. S., editor,) pp. 591-643, Science Research Associates, Inc., Chicago, Illinois, 1980.

[SNE79] Van de Snepscheut, J. L. A., "Introducing the Notion of Processes to Hardware," ACM SIGARCH Computer Architecture News, Vol. 7, No. 7, pp. 13-23, April 1979.

[SNY79] Snyder, A., "A Machine to Support an Object-Oriented Language," Technical Report TR-209, Laboratory for Computer Science, MIT, Cambridge, Mass., March 1979.

[SOL79] Solomon, M. H. and Finkel, R. A., "The Roscoe Distributed Operating system," Proceedings of 7th Sympos ium on Operating Sys terns Principles, pp. 236

108-114, December 1979.

[STA79] Stankovic, J. A. and Van Dam, A., "Research Directions in (Cooperative) Distributed Processing," in Research Pi rect ions in S of tware Technology , (Wegner, P., editor), pp. 611-638, The MIT Press, Cambridge, Mass., 1979.

[SUN75] Sunshine, C. A., "Interprocess Communication Protocol for Computer Networks," Ph.D. Dissertation, Digital System Laboratory, Stanford University, Stanford, California, December 1975.

[SV07 9] Svobodova, L., Liskov, B., and Clark, D., "Distributed Computer Systems: Structure and Semantics," Technical Report TR-215, Laboratory for Computer Science, MIT, Cambridge, Mass., April 1979 .

[TAN81] Tanenbaum, A. S. , Compute r Networks, Prentice-Hall, Englewood Cliffs, N. J., 1981.

[TER7 7] Terman, L. M., "The Role of Microelectronics in Data Processing," in Microelectronics, pp. 78-87, W. H. Freeman and Company, San Franscio, 1977.

[THU74] Thurber, K. J.,"Interconnection Networks - A Survey and Assessment," Proceedings of Na tional Computer Conference, Vol. 43, pp. 909-919, 1974.

[TSA7 9] Tsay, D. P. and Liu, M. T., "Interface Design for the Distributed Double-Loop Computer Network (DDLCN) ," Proceedings of 19 79 National Telecommunicat ions Conference, pp. 31.4.1-6, December 1979.

[TSA80a] Tsay, D. P. and Liu, M. T., "Design of a Reconfigurable Front-End Processor for Computer Networks," Proceedings of 19 80 Internat ional Symposium on Fault-Tolerant Comput ing, pp. 369-371, October 1980.

[ T SA8 Ob ] Tsay, D. P. and Liu, M. T., "Design of a Robust Network Front-End for the Distributed Double-Loop Computer Network (DDLCN)," Proceedings o f

« r Pis tributed Data Acq uis i t ion, Comput ing, and Control Symposium, pp. 141-155, December 1980.

[TSA81] Tsay, D. P. and Liu, M. T., "MIKE: A Network Operating System for Distributed Double-Loop Computer Network (DDLCN)," to appear in the Fifth 237

Internat ional Computer Sof tware and Applications Conf erence (COMPSAC'81), Chicago, Illinois, November 18, 1981.

[TSA82] Tsay, D. P., Liu, M. T., and Lian R. C., "Design of a Network Operating System for the Distributed Double-Loop Computer Network (DDLCN)," submitted to the International Sympos ium on Local Computer Ne two rks, Florence, Italy, April 1982.

[VAN76] Van Dam, A. and Michel, J., "Experience and Distributed Processing on a Host/ Satellite Graphic. System," Proceedings of SIGGRAPH, July 1976.

[WAR80] Ward, S. A., "TRIX: A Network-Oriented Operating System," Proceedings of COMPCON'80 Spring, pp. 344-349, February 1980.

[WAT80] Watson, R. W. and Fletcher, J. G., "An Architecture for Support of Network Operating System Services," Compute r Ne tworks, Vol. 4, No. 1, pp. 33-49, February 1980.

[WEI78] Weissberger, A. J., Data Communication Handbook, Signetics Corporation, Sunnyvale, California, Augus t 1 9 78 .

[WET80] Wettstein, D. and Merbeth, G., "The Concept of Asynchronization, ACM SIGO P S Operat ing Sys terns Review, Vol. 14, No. 4, pp. 50-70, October 1980 .

[WIL79] Wilkes, M. V. and Needham, R. M., The Cambridge CAP Comp ute r and• its Operating System, North Holland, New York, 1979.

[WIT80] Wittie, L. D. and Van Tilborg, A. M., "MICROS, A Distributed Operating System for MICRONET, A Reconfigurable Network Computers," IEEE Transac t ions on Computers , Vol. C-29, No. 12, pp. 1133-1144, December 1980.

[WOL78J Wolf, J. J. and Liu, M. T., "A Distributed Double-Loop Computer Network (DDLCN)," Proceed ings of Seventh Texas Conf erence on Computing Sys terns, pp. 6.19-6.34, November 1978.

[W0L7 9a] Wolf, J. J., Liu, M. T., Weide, B. W., and Tsay, D. P., "Design of a Distributed Fault-tolerant Loop Network," Proceedings of 1979 Internat ional Symposium on Fault-Tolerant 238

Comput ing, pp. 17-24, June 1 979 .

[W0L7 9b] Wolf, J. J., "Design and Analysis of the Distributed Double-Loop Computer Network (DDLCN)," Ph.D. Dissertation, Department of Computer and Information Science, The Ohio State University, Columbus, Ohio, August 1979.

[W0L7 9c] Wolf, J. J., Weide, B. W., and Liu, M. T . , "Analysis and Simulation of the Distributed Double-Loop Computer Network (DDLCN)," Proceedings o f 19 79 Compute r Networking Sympos ium, pp. 32-89, December 1979.

[WUL74] Wul f , W. A., et al., "HYDRA: The Kernel of a Multiprocessor Operating System," C ommuni c a t ions ACM, Vol. 17, No. 6, pp. 337-345, June 1974.

[WUL78] Wulf, W. A., "A Formal Definition of Alphard (Preliminary)," Technical Report CMU-CS-78-185, Department of Computer Science, Carnegie-Mellon University, Pittsburgh, Pennsylvania, February 1978 .

[WUL81] Wulf, W. A., Levin, R., and Harbison, S. P., Hyd ra/C.mmp: An Experimental C ompute r System, McGraw-Hill, New York, 1981.

[ZEI8 1 ] Zeigler, S., et a l ., "The Intel 432 Ada Programming Environment," Proceedings o f COMPCON'81 Spring, pp. 405-410, February 1981.

[ZIM80] Zimmermann, H., "OSI Reference Model - The ISO Model of Architecture for Open System Interconnection," IEEE Transactions o n Communications, Vol. COM-28, pp. 423-432, April 1980.