A Fault-Tolerant Distributed Operating System*
Total Page:16
File Type:pdf, Size:1020Kb
GEORGIA INSTITUTE OF TECHNOLOGY OFFICE OF CONTRACT ADMINISTRATION PROJECT ADMINISTRATION DATA SHEET El ORIGINAL REVISION NO. — 36 — 6 83 Project No. : - GTRINIVIX DATE 11/ 30, , Project Director et".1115"—b School/jam ICS Sponsor: NASA - Langley Research Center Hampton, VA 23665 ,-71 , Type Agreement: Grant No. -430 $ Award Period: From 1/9/83 (Performance) /?•••44.4 i(Reports) it Sponsor Amount: s mange Total to Date Estimated: $ 47,402 Funded: $ 47,402 Cost Sharing Amount: $ Cost Sharing No: Title: "A Support Architecture for Reliable Distributed Computing Systems" ADMINISTRATIVE DATA OCA Contact John W. Burdette ext. 4820 1) Sponsor Technical Contact: 2) Sponsor Admin/Contractual Matters: •.•.•. Mrs. A.S. Reed (MS 126) V NASA - Langley Research Center National Aeronautics and Space Admin. .144.1149•441ft• /4/5 I/LT Langley Research Center Hampton, VA 23665 Hampton, VA 23665 (804) 865-043-7-7— (804) 865-3215 3515 Defense Priority Rating: N/A Military Security Classification: N/A N/A (or) Company/Industrial Proprietary: ESTR ICTIONS See Attached NASA Supplemental Information Sheet for Additional Requirements. Travel: Foreign travel must have prior approval — Contact OCA in each case. Domestic travel requires sponsor approval where total will exceed greater of $500 or 125% of approved proposal budget category. Equipment: Title vests with GIT; however, Government reserves the right to acquire transfer ro itself title to items of $1,000 or more. COMMENTS: 23456 % .9, DEC1983 c%) co (,) (t`'_ RECEIVED Research Reports ch N Offico IZ" COPIES TO: Project Director Procurement/EES Supply Services GTRI Research Administrative Network Research Security Services Library ... Research Property Management R•pettrOotordinetti-teem - , Project File, GEORGIA xmmorEdof TECHNOLOGY OFFICE NOTICE OF PROJECT CLOSEOUT Closeout Notice Date 06/26/91 Center' to R5702-0A0 .0115. A SUPPORT ARCHITECTURE FOR RELIA14;101TRIBUTED COMPUTING' r ;" ffective CoSeleilen'llete 900410 (Perfii'--iienCi) 900410 (Repo stribution Required' fi Project Director ainistratite Network Represent TRI Accounting/Wants and Carib' tocureuent/MeMIY Services' eseara0Pripairty Man Research Security Services /.° A SUPPORT ARCHITECTURE FOR RELIABLE DISTRIBUTED COMPUTING SYSTEMS Semi-Annual Status Report November 9, 1983 - May 8, 1984 From Georgia Tech Research Institute Atlanta, Georgia 30332 To National Aeronautics and Space Administration Langley Research Center Grant No. NAG-1-430 ENDORSEMENT: /a Martin S. hckendry Date Principal Investigator (404)894-3816 Summary During the past six months, we have consolidated many of the concepts con- tained in the proposal for this grant, and we have begun design and construc- tion of systems to validate those concepts experimentally. An overview of the system design as it currently stands is contained in [McKe84c], attached as an appendix. This is the only paper completed during the period. Several papers are underway, or almost completed. These are described as appropriate below. Equipment The National Science Foundation has agreed to purchase test equipment to support our experimentation. This equipment comprises three VAX 11/750's interconnected via Ethernet and a VAX CI. Four well-configured IBM PC's are to be used as interface machines. The total award amount, including Georgia Tech matching contribution, is approximately $240,000. We are cur- rently negotiating a discount with Digital, the manufacturer. We anticipate delivery towards the end of calendar year 1984. Naturally, our design and construction efforts reflect this timeframe. Kernel Support Mechanisms Our work in this area has focused on mechanisms to maintain ordering require- ments in the nested action environment. This work is described in a tech- nical report [McKe84a] that is currently undergoing revision before release. Kernel Design An operating system kernel to support the operating system is under con- struction. Initial design concepts are discussed in a technical report [Spaf84] to be released soon. This report focuses on the operation invoc- ation mechanisms, network servers, the virtual memory system, and the file system. Work is now progressing on the construction of several of the low- level components of the operating system. Compiler Development In conjunction with this grant, although not funded directly by it, we are constructing a compiler to support system programming for the new operating system. The design of the language to be supported is now substantially complete, and construction of the compiler is underway. 1 Scheduling Mechanisms Preliminary work on the design of fault-tolerant scheduling mechanisms began during the past few months. This work is described in an upcoming technical report [McKe84b]. We anticipate that considerable effort will be expended here over the next two years. User Interface Work is progressing on the design and construction of a user interface for the operating system. Currently, this involves construction of a windowing operating system for the IBM PC. This will ultimately be used to interface to the VAXen, but immediately we hope to get experience using it in a Unix- style environment. Publications Possibly due to the youth of this effort, our publication record during the past six months has not been as good as we would have liked. However, many of the early concepts are now well understood and in draft form. Thus, we anticipate up to five journal submissions and several conference submissions in the next period. The topics to be covered include action management protocols, non-consistent database algorithms, action ordering mechanisms, specification techniques for fault-tolerant jobs, and scheduling mechanisms for fault-tolerant job execution. The last two papers mentioned also address dynamic reconfiguration. Possibly a paper on the file system will be com- pleted during this period also. The paper attached as an appendix was invited by the IEEE Interest group on distributed computing for a special issue of the newsletter. The issue is to focus on distributed operating systems. References [McKe84a] McKendry, M.S., "Ordering Actions for Visibility," School of In- formation and Computer Science, Georgia Institute of Technology, Tech- nical Report, 1984. [McKe84b] McKendry, M.S. "Fault-Tolerant Scheduling Mechanisms," School of Information and Computer Science, Georgia Institute of Technology, Technical Report. [McKe84c] Attached as an appendix. [Spaf84] Spafford, E.H., and M.S. McKendry, "Kernel Structures for Clouds," School of Information and Computer Science, Georgia Institute of Tech- nology, Technical Report. 2 Appendix Clouds: A Fault-Tolerant Distributed Operating System* Martin S. McKendry t School of Information and Computer Science Georgia Institute of Technology Atlanta, Georgia 30332 Introduction Clouds supports objects and (nested) actions as fundamental storage and computation mechanisms Over the past few years, several research groups have respectively. Objects are implemented as components been studying techniques to exploit the potential of of virtual address spaces. Actions are supported by distributed systems for supporting fault-tolerance. At low-level network protocols that handle lost and Georgia Tech, the Clouds group is working on the duplicated messages, remote procedure calls, and design and construction of a fault-tolerant, orphan detection. There is a tightly-controlled reconfigurable distributed operating system. We view relationship between volatile (main) memory and a single operating system as controlling a set of permanent memory (usually disk). This architectural computers called a multicomputer or computer cluster. support provides a sound basis for consistency. That is, Without impacting services provided to users, we it is possible to establish the restart state of any object intend to provide transparent support for upward after a failure. Even if that object is not available to reconfiguration -- addition of machines to the participate in establishing its state, the state is multicomputer -- and downward reconfiguration -- determined by a predefined protocol. failure or removal of machines. Thus, a fundamental difference of our work from much other work in the For a system to survive component failures (i.e., area is our attitude toward autonomy [McKe83]. We tolerate faults), mechanisms are needed in addition to regard autonomy as an issue of operating system basic action support. First, some representation of policy, not structure. Work is assigned to the work is needed. In Clouds, we represent work with machine(s) most able to handle it. In this way, service action networks. Action networks are programmed to all users can be enhanced. using a Petri-Net notation [Pete77]. The transitions of Petri-nets correspond to action executions. One of the The foundation of our approach depends on a attractions of this approach is that the state of such a combination of actions, objects, and fault-tolerant job net (called a job) is easily defined: actions execute scheduling mechanisms. Objects are instances of (transitions fire) atomically, and thus the net changes abstract data types. They encapsulate a data part, state atomically. which represents their state, and a procedural (or operation) part, which specifies the changes that can be The second requirement for survivability is that made to the data. Actions are partial orders of continuity of job execution be maintained through operations on objects. In considering reliability, the failures. To meet this requirement,