YES/MVS: a Continuous Real Time Expert System
Total Page:16
File Type:pdf, Size:1020Kb
From: AAAI-84 Proceedings. Copyright ©1984, AAAI (www.aaai.org). All rights reserved. YEWMVS: A Continuous Real Time Expert System J.H. Griesmer, S.J. Hong, M. Karnaugh, J.K. Kastner, M.I. Schor Expert Systems Group, Mathematical Science Department R.L. Ennis, D.A. Klein, K.R. Milliken, H.M. VanWoerkom Installation Management Group, Computer Science Department IBM T. J. Watson Research Center Yorktown Heights, NY 10598 ABSTRACT: as they arise. A long training period is required to produce a skilled operator; trained operators, in turn, are often pro- YES/MVS (Yorktown Expert System for MVS opera- moted to systems programmers. The resulting shortage of tors) is a continuous, real time expert system that exerts skilled operators and the increasing complexity of the op- interactive control over an operating system as an aid to erator’s job calls for more powerful installation manage- computer operators. This paper discusses the YES/MVS ment tools. We have chosen the management of a Multiple system, its domain of application, and issues that arise in Virtual Storage (MVS) operating system, the most widely the design and development of an expert system that runs used operating system on large IBM mainframe computers, continuously in real time. as an example of the application of expert systems to problems in computer installation management. I INTRODUCTION B. Use of Expert System Techniques Each installation has a different configuration and dif- ferent local policies for computer operations, both of which Expert systems techniques are beginning to be success- change over time. The software running in a large com- fully applied to real problems in industry, although only a puter installation represents hundreds of man-years of de- handful are reportedly in use so far. Most of the applica- velopment and is comprised of many interacting tions are consultation oriented, run in a session or in a batch subsystems. To deal with such complexify, operators and mode, and deal with a static world. The nuclear reactor system programmers often rely on many rules of thumb monitoring expert system, REACTOR [ 11, and the patient gained through experience. The development of installa- monitoring expert system for intensive care units, VM [2], tion management tools which can be easily tailored and are among the few attempts at continuous on-line operation modified, and which can incorporate such “rules of thumb” and real time processing. However, neither of these sys- are highly desirable. An expert systems approach was a tems exercise any real time interactive control over the natural choice because of its flexibility and maintainability. subject being monitored. The Yorktown Expert System for MVS operators (YES/MVS) is a real time interactive con- C. Special Challenges trol system that operates continuously. There are many new requirements in building a real time The idea of on-line monitoring or controlling of one expert system to assist a computer operator. The environ- computer by another is not new. Watch-dog processors ment is too complex and dynamic to allow for obtaining [3] and maintenance processors [4] have been designed to information by querying the human operator. Unlike many assist in the recovery from software errors and hardware of the consultation style expert systems (e.g., MYCIN [5], errors while the subject computer is in operation. What is CASNET [6]), this means that conclusions are based on new is the application of an expert system approach to the primitive facts obtained directly from the system being control of computer operations. monitored and not from human interpreted inputs. In ad- dition, the dynamic nature of the subject system introduces A. Importance of the Domain potential inconsistencies in the expert system’s model of the subject MVS system. By the time conclusions are to be put Computer operations is a monitoring and problem solv- into effect (recommendations made, actions taken), many ing activity that must be conducted in real time. It is be- of the facts from which those conclusions were derived may coming increasingly complex as data processing have changed. This complexity and dynamic character installations grow. Large data processing installations often make it very difficult to simulate or model the subject sys- involve multiple CPU’s and a large number of peripherals tem. Developing and debugging such an expert system networked together, representing a multi-million dollar in- presents yet another interesting challenge. vestment. Many of the installations run real time applica- tions (e.g., banking, reservations systems). The control of D. The OPS5 Base a typical large system rests largely in the hands of just a few operators. Besides carrying on such routine activities as To be able to handle real time on-line data, the inference mounting tapes, loading and changing forms in printers, and engine needs to be mainly data driven (also recognized by answering phones, an operator continuously monitors the REACTOR [l] and VM [2]). The OPS5 production system condition of the subject operating system and initiates developed by C. L. Forgy [7] was chosen as our tool pri- queries and/or commands to diagnose and solve problems marily for this reason. Also, significant applications based 130 on the OPS family of production systems have been re- C. Scheduling Large Batch Jobs Off Prime Shift ported (Rl/XSEL/PTRANS [8-lo] and ACE [ 111). Of importance was our ability to convert OPS5 to run in our Large batch jobs must be scheduled to balance consid- computing environment. While OPS5 was not directly us- erations of system throughput and user satisfaction. These able for our application, it possessed the important proper- considerations may vary in detail from one installation to ties of flexibility (stemming from being a low level another. These include: ensuring that no jobs are indefi- language) and easy modifiability. We have made a number nitely delayed, employing round-robin scheduling among of important extensions to OPS5 so that our continuous users submitting multiple jobs, giving priority to users who real time requirements could be met. are waiting on site, or require some other special consider- ation, running longer jobs early in the shift and running only those jobs that can finish before a scheduled shut- II THE DOMAIN down. Since new jobs may arrive or be withdrawn during the shift, initial scheduling may have to be changed among the jobs that are still in the queue. A separate paper [ 121 The MVS operating system running with a Job Entry describes a truth maintenance approach using OPS5 for System (JES), puts out various system messages to the op- keeping a dynamically correct priority assignment of the erator. While there are literally hundreds of different types jobs. of messages, the number that are relevant to an operator is much smaller. The majority of purely informational mes- D. MVS Detected Hardware Errors sages may usually be statically filtered and diverted to a log. Even then, the peak message rate from MVS to the opera- When MVS fails to recover from a detected hardware tor sometimes exceeds 100 a minute (e.g., a response to an error, the system notifies the operator so that he or she may operator request for a list of jobs in a particular category). attempt to solve the problem. Due to the time criticality of possible remedies (such as speedy reconfiguration), recov- When a (potential) problem is detected, the operator erable situations may result in a system crash since a human may query MVS for additional information and send one operator cannot respond in time. Responses to the most or more corrective commands. The operator must often frequent hardware problems have been implemented in anticipate informational needs and dynamically keep track rules. These rules are not tied to a particular hardware of a number of relevant status variables (out of thousands). configuration but rather make use of hardware configura- tion data placed in the OPS5 working memory. The hard- There are many different subdomains in the domain of ware configuration data is initially loaded from the files operator activities. The six subdomains described below used in the MVS system generation process. were selected for early implementation, since they touch a majority of those operator activities which involve no E. Monitoring Software Subsystems physical intervention. The main activity in this area is to generate informative A. JES Oueue Snace Management incident reports for the systems programmers who are re- sponsible for specific software subsystems. When an inci- All batch jobs processed under MVS are staged from a dent occurs, such as an abnormal end of execution, relevant central spool file, called the Job Entry System (JES) queue information is captured and an appropriate incident report space, before, during, and after execution. The operator is is prepared. In limited cases, reallocation of resources may concerned with the remaining available queue space, be- allow recovery from software failures. cause the job staging subsystem, JES, cannot recover if queue space is exhausted. When the level of remaining F. Performance Monitoring queue space becomes critically low, many actions are initi- ated to free additional space, such as forcing the printing This task goes beyond the usual scope of an operator’s of jobs which have finished execution and dumping large activities. A short term goal is to interpret the data from print jobs to tape. In extreme cases, the system can be existing performance monitoring software, and automat- made to refuse new jobs, and stop data being transmitted ically detect and classify performance problems in real time, from other systems. To initiate such actions the operator generating summary reports in hard copy as well as in makes use of the available facilities connected to the sub- computer graphics. An eventual goal is to diagnose the ject MVS system.