Highly Available Systems for Database Applications

WON KIM IBM Research Laboratory, San Jose, Californm 95193

As users entrust more and more of their applications to computer systems, the need for systems that are continuously operational (24 hours per day) has become even greater. This paper presents a survey and analysis of representat~%e architectures and techniques that have been developed for constructing highly available systems for database applications. It then proposes a design of a distributed software subsystem that can serve as a unified framework for constructing database application systems that meet various requirements for high availability. Categories and Subject Descriptors: A.1 [General Literature]: Introductory and Survey; .1.2 [Processor Architectures]: Multiple Data Stream Architectures (Multiprocessors)--interconnection architectures; C.4 [Computer Systems Organization]: Performance of Systems--reliability, availability, and serviceability; H.2.4 [Database Management]: Systems--distributed systems, transaction processing General Terms: Reliability Additional Key Words and Phrases: Database concurrency control and recovery, relational database

INTRODUCTION ager to retrieve and update records in the database. In this paper we examine major hardware A transaction is a collection of reads and and software aspects of highly available writes against a database that is treated as systems. Its scope is limited to those sys- a unit [Gray 1978]. If a transaction com- tems designed for database applications. pletes, its effect becomes permanently re- Database applications require multiple corded in the database; otherwise, no trace paths from the processor to the disks, which of its effect remains in the database. To gives rise to some difficult issues of system support the notion of a transaction, undo architecture and engineering. Further, they log of data before updates and redo log of involve the software issues of concurrency data after updates are used~to allow a trans- control, recovery from crashes, and trans- action to be undone or redone after crashes. action management. The Write Ahead Log protocol often is used In a typical business data processing en- to ensure that the log is flushed to the disk vironment, a user message from a terminal before the updated database records are invokes an application program. The appli- written to the disk. cation program interacts with a transaction The most fundamental requirement in manager to initiate and terminate (commit constructing a highly available system for or abort) a transaction. Once a transaction database applications is that each major has been initiated, the application program hardware and software component must at repeatedly interacts with a database man- least be duplicated. At minimum, the sys-

Author's current address: Microelectronics and Computer Technology Corporation, 9430 Research Boulevard, Austin, Texas 78759. Permission to without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. © 1984 ACM 0360-0300/84/0300-0071 $00.75 ComputingSurveys, Vo~'16, No. 1, March 1984 72 * Won Kim CONTENTS The data communication subsystem is the terminal handler that receives use r re- quests, invokes application programs, and delivers the resultsthat it receives from the INTRODUCTION database/transaction manager. 1. MACHINE AND STORAGE The database manager must supporl; the ORGANIZATION TAXONOMY two fundamental capabilities of concur- 2. INTUITIVE CRITERIA FOR HIGH AVAILABILITY rency control and recovery. That is, it must 3. LOOSELY COUPLED SYSTEMS guarantee that the database remains con- WITH A PARTITIONED DATABASE sistent despite interleaved reads and writes 3.1 Tandem NonStop System to the database by multiple concurrent 3.2 AT&T's Stored Program Controlled Network transactions. Techniques such as locking 3 3 Bank of America's Distributive and time stamping have been developed Computing Facility and extensively studied for this purpose 3.4 Stratus/32 Continuous Processing System [Bernstein and Goodman 1981; Kohler 3.5 Auragen System 4000 1981]. 3 6 System D Prototype 4. TIGHTLY COUPLED SYSTEMS The transaction/database manager must WITH A SHARED DATABASE ensure that the database consistency is not 5. LOOSELY COUPLED MULTIPROCESSOR compromised by system failures or trans- SYSTEM WITH A SHARED DATABASE action failurescaused by software errors or 5.1 GE MARK III 5.2 Computer Consoles' Power System deadlocks. In particular, it must be able to 6. REDUNDANT COMPUTATION SYSTEMS: recover from transaction failures and soft SYNTREX'S GEMINI FILE crashes, which corrupt only the contents of 7. A FRAMEWORK FOR THE MANAGEMENT the main memory, as well as from hard OF SYSTEM CONFIGURATION crashes, which destroy the contents of the 8. CONCLUDING REMARKS ACKNOWLEDGMENTS disks. For recovery from soft crashes, the REFERENCES undo and redo logs of transactions are used [Gray et al. 1981; Haerder and Reuter

v 1983]. To recover from hard crashes, sys- tems rely on periodic dumping of the data- base into archival storage. tern requires two processors. There may The is needed not .only have to be two paths connecting the pro- to run the other software components, but cessors, and it is desirable to have at least also to detect most of the common, low- two paths from the processors to the data- level software/hardware errors that occur. base, that is, two I/O subsystems consisting Most computer systems rely on progTam of a channel (I/O processor), controller, checks () and machine checks to and disk drives. The disk controllers must detect such errors.Program checks are used be multiported, so that they may be con- to detect exceptions and events that occur nected to more than one processor. during execution of the program [IBM On the software side, the system needs 1980]. Exceptions include the improper five essential ingredients: (1) a network specification or use of instructions and communication subsystem, (2) a data com- data, for example, arithmetic overflow or munication subsystem, (3) a database man- underflow, and addressing or protection vi- ager (or a ), (4) a transaction olation. Machine checks are used to report manager, and (5) the operating system. machine malfunctions, such as memory The network communication subsystem parity errors, I/O errors,and missing inter- must support interprocess(or)communica- rupts. The hardware provides information tion withL. ~ cluster of locally distributed that as~i~s the operating system in deter- processors. If the highly available system is mining the location of the malfunction and a node on a geographically distributed sys- extent of the damage caused by it. tem, the communication subsystem must Most systems have been designed to sur- also support internode communication. vive the failure of a single software/hard-

Computing Surveys, Vol. 16, No. 1, March 1984 Highly Available Systems for Database Applications . 73 ware component. If a system is to be truly is marketing a local network file server continuously operational, however, it must called GEMINI; the MARK III Cluster File guarantee availability during multiple con- System, developed by , pro- current failures of software/hardware com- vides time-sharing services for its tele- ponents, during on-line changes of such phone-switching network users; and the components, and during on-line physical Distributive Computing Facility developed reconfiguration of the database itself. To by Bank of America automates the teller support the full range of high-availability functions for accounts. requirements, the operating system must In addition, the late 1970s SRI SIFT be supplemented with a software subsystem project for aircraft control evoked the Au- that can manage all software and hardware gust Systems' products; Bolt Beranek and components of a system. Such a software Newman (BBN) developed the PLURI- subsystem will receive failure reports from BUS system for use as a highly reliable the system components and reconfigura- communications processor on ARPANET; tion requests from the system operator. It and during the 1960s AT&T developed No. will analyze the status of all the resources 1 ESS and No. 2 ESS (Electronic Switching it manages, and compute the optimal con- System) for telephone switching services. figuration of the system both in response System D, a distributed transaction- to multiple concurrent failures of compo- processing system, was created as a proto- nents and requests for load balancing. Fur- type at IBM Research in San Jose with ther, it will initiate and monitor system availability and modular growth as its ma- reconfiguration, effecting mid-course cor- jor objectives, and there are other research rection of a reconfiguration that does not projects currently under way at IBM Re- succeed. Finally, it will diagnose a class of search to investigate availability and per- failures that other components fail to rec- formance issues under various software/ ognize. hardware structures. The systems introduced in the following The remainder of this paper is organized paragraphs are often considered highly as follows. In Section 1, a taxonomy of available, and are used as examples of var- system structures that has been used to ious architectures and stratagems throu~h- construct highly available systems is devel- out the remainder of this paper, particu- oped, and a discussion is provided of the larly in Sections 3 through 6. A certain advantages and disadvantages of four pos- number of these systems do not, in my sible structures. An intuitive set of criteria opinion, meet the criteria fbr classification for highly available systems is given in Sec- as highly available; these are discussed in tion 2. Systems belonging to each of the Section 2. four system structures are §urveyed and has been successful critiqued, where possible, in Sections 3 for several years in marketing fault-toler- through 6. The discussions in these sections ant computer systems, which shield the focus on various philosophies of system users from various types of failures of the structure and transaction processing. In software and hardware components. Other Section 7 the functions and structure of a companies have entered this market, in- software subsystem that provides a frame- cluding August Systems, Auragen Systems, work for high availability are described. Computer Consoles, Stratus Computer, In view of the fact that such issues as Synapse Computer, Syntrex, Sequoia, Tol- concurrency control, recovery, transaction erant ,Systems, and others [Electronic model, and network communications have Business 1981; IEEE 1983]. Of these, Aur- been extensively addressed elsewhere, these agen, Computer Consoles, Stratus, and aspects of highly available systems will not Synapse have focused on transaction pro- be given detailed treatment. Further, it is cessing; August Systems aims at industrial generally recognized that such mundane and commercial control. sources as downed telephone lines, careless A number of other systems address the computer operators, lack of defensive cod- issue of reliability and availability: Syntrex ing, and the way in which the operating

ComputingSurveys, Vol. 16, No. 1, March 1984 74 • Won Kim

system reacts to failures that it detects can of the system is to keep up with its expan- seriously limit the availability of a system. sion as extra processors are added. Second, Although important, such aspects are not potentially complex techniques must be within the scope of this paper. supported to ensure that the contents of each processor's cache memory are up-to- date. 1. MACHINE AND STORAGE In addition to these performance ,con- ORGANIZATION TAXONOMY cerns, there is a potential availabilityprob- Two fundamental decisions in constructing lem with tightly coupled multiprocessors. a multiple-processor system are the choice All processors run the same operating sys- of machine organization and physical stor- tem from shared main memory, and thus, age organization. The discussion in this when the operating system is corrupted or section of the advantages and disadvan- the shared memory system fails, the entire tages of typical machine and storage orga- system must be restarted. Therefore appli- nizations significantly benefits from cation systems designed to run on a tightly Traiger [1983]. Two conventional tech- coupled multiprocessor system must be able niques for organizing multiple processors to restart very quickly in order to guarantee are loosely coupled and tightly coupled high overall availability. multiprocessor organizations. In tightly Just as there are two techniques for or- coupled systems, two or more processors ganizing multiple processors, there are two share main memory and disks, typically ways to organize disks, and thus the data- through an interconnection switch, and ex- base. One is to assign a set of disks to one ecute one copy of the operating system processor and allow access to it only residing in the shared main memory. A through that processor; the other is to have local cache memory is usually associated all of the processors share all the disks. The with each processor to enhance access two techniques of organizing multiple pro- speed. In loosely coupled systems, each cessors and the two techniques of organiz- processor has not only a local cache mem- ing disks are combined to give rise to four ory but also its own main memory, and may distinct system structures. The tightly cou- or may not share disks with other proces- pled multiprocessor organization and the sors. Each processor executes its own copy shared database organization results in a of the operating system from its own main system structure that is called a tightly memory. There are various ways to loosely coupled system with a shared database. 'The couple the processors, including shared bus loosely coupled multiprocessor organiza- structures, cross-point switches, point-to- tion gives rise to three other system struc- point links such as channel-to-channel tures. When a database is split into N par- adapters, and globally shared memories, as titions and each partition is stored in one in Cm* [Swan et al. 1977]. set of disks assigned to one of N processors, Tightly coupled multiprocessor systems, the resulting system structure is called a such as the Synapse N+I System, and loosely coupled multiprocessor with a par- BBN's PLURIBUS, offer important poten- titioned database. When each of N proces- tim advantages. First, they naturally pre- sors can directly access the entire database, sent a single-system image, since multiple stored in one set of disks, the system struc- processors execute one copy of the operat- ture is called a loosely coupled multiproces- ing system and a common job queue. Sec- sor with a shared database. When an entire ond, the processors do not need to com- database is replicated in each set of disk municate via interprocessor messages, with volumes attached to each of N processors, their inherent overhead. However, this per- and each processor computes the same user formance advantage may be offset by cer- request in parallel, the resulting structure tain problems imposed by this architecture. is called a loosely coupled multiprocessor First, there is contention among processors with redundant computation. for the use of shared memory and other The Tandem NonStop System, the Aur- shared resources. This must somehow be agen System 4000, the Stratus Continuous reduced, especially if the cost/performance Processing System, IBM's System D pro-

Computing Surveys, Vol. 16, No. 1, March 1984 t Highly Available Systems for Database Applications • 75 totype (and its sequel, the Highly Available over a tightly coupled multiprocessor sys- Systems project), Bank of America's Dis- tem, since the operating system is not tributive Computing Facility, and AT&T's shared among the processors. One disad- Stored Program Controlled Network are vantage, however, is the lack of a single- loosely coupled multiprocessors with par- system image; that is, system operators and titioned databases. In this architecture, a system must contend with database manager residing in each proces- multiple copies of the operating system. sor owns and manages the partition of the One important advantage of this architec- database assigned to that processor. Any ture over a loosely coupled multiprocessor user request (transaction) that requires ac- system with a partitioned database is that cess to more than one database partition is it avoids the difficultproblem of deciding satisfied by message communication among which partition of the database should be the database managers that own the nec- stored in which processors' disks. Proces- essary partitions. Communication over- sors may be added to the system without head is the single most significant disad- having to repartitionthe database, and new vantage of the loosely coupled system with disk drives may be added without having to a partitioned database. There are two as- worry about which processors should own pects to this interprocess(or) communica- them. tion overhead: One is the messages sending However, contention on the shared disks requests to servers and receiving results is a potentialproblem, with each processor from servers; another aspect is the mes- moving the disk arms to random positions. sages and processing involved in the dis- Further, algorithms for coordinating the tributed commit protocol that ensures that global locking and logging of database up- the database, which is distributed across dates must be carefullydesigned. A global processors, is left in a globally consistent locking technique in which all database state when the transaction completes or managers must acquire and release locks aborts. through a single global lock manager will Some variation of the two-phase commit cause excessive communication overhead protocol described by Gray is used in com- and create a bottleneck for performance mitting or aborting a distributed transac- and availability. tion [Gray 1978]. One of the participating Loosely coupled multiprocessor systems transaction managers is designated as the with redundant computation, such as Syn- commit coordinator. During phase 1, the trex's GEMINI fileserver, and SRI's SIFT commit coordinator sends a "prepare to (as well as its offspring,the Basic Control- commit" message to all other participants. ler of August Systems, Inc.),achieve fault The participants reply with "yes" or "no" tolerance by having more than one task messages to the coordinator and enter perform the same computation and then phase 2. If the coordinator receives "yes" comparing the results of the computation. votes from all participants, it sends a "com- In such applications as spacecraft control mit" message to all participants. If any and process control in a nuclear power participant replied with a "no" vote, the plant, correct results of computations are coordinator sends an "abort" message to all more critical than is the case in typical the participants. During phase 1 all partic- database applications. Further, the com- ipants retain the right to unilaterally abort putations are well defined, and the results the transaction. However, once they enter are often known in advance. For such ap- phase 2, they no longer can unilaterally plications,it makes sense to have multiple abort the transaction; they must obey the processors perform the same computations decision of the commit coordinator. in order to detect and (even correct) con- Loosely coupled multiprocessor systems flictingresults. But for officeword-process- with shared database architecture, such as ing or transaction-processingapplications, GE's MARK III Cluster File System, Com- this approach may not be desirable. puter Consoles' Power System, and IBM's For applications that require time-con- AMOEBA research project [Traiger 1983], suming computations and/or disk accesses, offer a potentially enhanced availability there tend to be ample opportunities for

ComputingSurveys, Voi. 16, No. 1, March 1984 76 • Won Kim program checks and machine checks to de- cess other database partitions and perform tect low-level failures, and for time-outs or useful work [Good 1983]. defensive coding to detect high-level fail- Similarly, in a geographically distributed ures. Unless the need for error detection environment, it is not entirely clear where and correction is highly critical, the redun- to draw the line on availability, for e~:am- dant-computation approach appears to ple, when a node of the system cannot waste the processing power of the system. communicate with another node and con- Further, the exchange of data and status sequently cannot access data owned and information among the replicated tasks for managed by the other node. each input and output is a considerable Very few vendors of the systems that we performance overhead. are considering have provided availability figures for public review. Suffice it to say 2. INTUITIVE CRITERIA here that, however one may define it, often FOR HIGH AVAILABILITY a highly available system is expected to provide higher than 99 percent overall Now we must address the problem of what availability. is meant by availability. The overall avail- We use the following criteria to classify ability of a system may intuitively be de- a system for database applications as highly fined as the ratio between the time when available. The first three are hard criteria the end user and applications actually have and provide the rationale for including and access to all the database and the time excluding detailed discussion of various when the end user and applications require systems in this paper. The last two are soft access to the database. For example, if users criteria, satisfied by very few existing sys- require the system to be up for 8 hours a tems and mainly included for future consid- day and the system is actually up for 6 erations. hours during the 8 hours, the availability of the system is 6/8 = .75 during the 8-hour (1) The system must support transac- period. tion-processing or file server capabilities, It is more difficult to precisely define a specifically concurrency control and recov- single measure of availability. In the first ery techniques, to maintain database con- place, it is not clear how to define the sistency. A distributed database system "mission duration" for the system. In must support a distributed commit protocol spacecraft control applications, the mission to ensure global consistency of the data- duration is clearly defined: While the base. spacecraft is in orbit, the system must be BBN's PLURIBUS [Katsuki et al. available 24 hours per day. In business data 1978], SRI's SIFT design [Wensley et al. processing applications, however, the mis- 1978], the Basic Controller system now sion duration can be several years. At what being marketed by August Systems Inc. point in the life of a system, and for how [Kinnucan 1981], and AT&T's No. 1 ESS long, should we measure availability? Dur- and No. 2 ESS [Spencer and Vigihmte ing one arbitrary month, a year after the 1969] do not satisfy this criterion, and will installation of the system? not be discussed in detail in the system In the second place, should the entire survey portion of this paper. database be available for access by author- (2) The system must support automatic ized users during the entire duration? For takeover of full workload by a backup proc- example, when a single database partition ess when a primary process fails. This cri- becomes inaccessible in a loosely coupled terion excludes systems that rely on manual multiprocessor architecture with a parti- replacement of failed processors to survive tioned database, one may take the view that a single failure. the system is no longer available. In fact, The problem with the manual replace- this is my view for the purposes of this ment approach is that (1) the responsibility paper. However, one may equally well take of detecting failure often falls on users or the more charitable view that the system is the operators, and (2) the new processor is still largely available, since users may ac- aware of either the database or the termi-

Computing Surveys, Vol. 16, No. 1, March 1984 Highly Available Systems for Database Applications • 77 nals and hence applications, and the system quickly and provides high overall availabil- must be cold-started. ity. System D has a single point of failure Japan National Railways' MARS train in the electronic switch that connects a pair seat reservation system, a loosely coupled of processors with the shared disks and multiprocessor system with a partitioned terminals, but the probability of failure of database, does not support the concept of such a switch is very low and hence does backup processes at the present time [Tsu- not severely compromise overall availabil- kigi and Hasegawa 1983]. Hence, when the ity. processor that manages one partition of (4) The system should support on-line the database crashes, no user can access integration of repaired or new hardware/ the database partition until the database software components. Further, in the case manager that owns it can be restarted. This of a partitioned database, the system system is not given detailed treatment here. should support on-line migration of the IBM's Information Management System database from one disk system to another. (IMS) with the data-sharing feature [Strick- (5) Additional features aimed at making land et al. 1982] is a loosely coupled mul- the component failures transparent to the tiprocessor system with a shared database, users may be useful. One is the ability to with an IMS/VS system in two different automatically restart transactions in prog- processors, each able to access the database ress when system crash occurs, which may in shared disks. When one IMS/VS require the data communication subsystem crashes, its terminals lose access to the to log the transaction request on the disk. database until it can be restarted, and the Another is for the interprocessor commu- surviving IMS/VS is not aware of the nication subsystem to reroute messages transactions in process on the system that originally targeted to a failed process to its failed, and hence cannot abort or complete backup. In view of the fact that transac- them. We do not consider this system tions in typical business data processing highly available. are short-lived, that is, they complete The Stratus system does not currently within a few seconds, it does not appear support the concept of primary-backup that important to burden a system with processes; however, in view of the great these additional capabilities. extent to which the system incorporates hardware and capabilities 3. LOOSELY COUPLED SYSTEMS for on-line system reconfiguration, it is WITH A PARTITIONED DATABASE classified as a highly available system. (3) The system must survive at least a This section provides overviews of archi- single failure of such major components as tectures and transaction-processing strat- processor, I/O channel, I/O controller, disk egies as employed in the Tandem NonStop drives, and interprocessor communication System, Auragen System 4000, Stratus/32 medium. In particular, a single failure Continuous Processing System, AT&T's should not make any part of the database Stored Program Controlled Network, Bank inaccessible to the users for beyond a rea- of America's Distributive Computing Facil- sonable recovery duration. A reasonable re- ity, and System D. covery duration may be I minute for a mini- It is noted that the Auragen, Stratus, and microprocessor system and perhaps 10 AT&T, and Bank of America systems are minutes for a mainframe, because of the actually loosely coupled clusters of proces- larger number of terminals and applica- sors, in which each cluster consists of two tions dealt with by a mainframe system. or more processors and manages one par- This criterion does not imply that a sin- tition of the database. The cluster itself is gle point of failure is unacceptable, rather not necessarily a loosely coupled multipro- that when a single point of failure exists, it cessor architecture. Further, the Auragen, must not cause overall availability to suffer. AT&T, and Bank of America systems cur- For example, the Synapse N+I System rently do not support distributed on-line has a single point of failure in its shared transaction processing: User requests are main memory, but is designed to recover completely processed within one cluster,

Computing Surveys, Vol. 16, No. 1, March 1984 78 • Won Kim power power power sipply sulply sipply DYNABUS

I

] I . i ,, I i BUS CONTROL BUS CONTROL BUS CONTROL CPU CPU CPU MEMORY MEMORY MEMORY I/OCHANNEL I/OCHANNEL I/OCHANNEL

CNTL i

CNTL I

Figure 1. Tandem NonStop system architecture.

without requiring the participation of other ability of the rest of the system. Each pro- clusters. cessor module has a separate power supply, which can be shut off to replace the failed 3.1 Tandem NonStop System module without affecting the rest of the system. Similarly, each I/O controller is The NonStop System, developed by Tan- powered by two power supplies associated dem Computers about 1976, has made im- with the two processors to which it is at- portant contributions to the area of high tached, and can be powered down by a availability [Bartlett 1978; Katzman 1977, corresponding switch, without affecting the 1978]. As shown in Figure 1, each processor rest of the system. Thus the I/O controllers module consists of a central processing unit survive a single power failure, and each (CPU), memory, interface to an interpro- I/O controller and processor module can be cessor bus system called Dynabus, and an repaired without shutting down the rest of I/O channel. Each of the I/O controllers the system. is connected to two processors via its dual- In order to detect a processor failure in port arrangement, and each processor is the Tandem system, each processor broad- connected to all other processors via a dual casts an "I-AM-ALIVE" message every 1 Dynabus. Further, as shown in Figure 2, second and checks for an "I-AM-ALIVE" each processor is connected to a pair of disk message from every other processor every controllers, which in turn maintain a string 2 seconds [Bartlett 1978]. If a processor of up to four pairs of (optionally) mirrored decides that another processor has failed to disk drives. Mirroring is supported in the send the "I-AM-ALIVE" message, it initi- I/O supervisor, which issues two disk writes ates recovery actions, as described later. for each page of data to be written to the Although there is a possibility that differ- disk. Thus it is clear that the system pro- ent processors may reach different deci- vides many paths to data, and hence the sions as to which processor has crashed, data are available to the user regardless of the single-failure assumption precludes any single failure of a disk drive, disk con- consideration (and prevention) of such a troller, I/O channel, or processor. possibility. The Tandem system was designed to This "active" failure-detection approach continue operation through any single com- of the Tandem system helps to detect a ponent failure, and also to allow the failure processor failure soon after it occurs. How- to be repaired without affecting the avail- ever, it is not very useful to say that a

Computing Surveys, Vol. 16, No. 1, March 1984 Highly Available Systems [or DatabaseApplications • 79 i "l backup process instructs Guardian to issue processor I processor I a TAKE OWNERSHIP command to the backup port. This command causes the I/O controller to swap its two ownership cntl bits and do a controller reset. I The database/transaction manager that runs on the Tandem system is called EN- disk COMPASS [Borr 1981]. ENCOMPASS up to consists of four functional components: a 8 disks database manager, a terminal manager, a transaction manager, and a distributed disk transaction manager. The ENCOMPASS I database manager is implemented as a pri- i mary/backup I/O process (called the DISC- cnt i ~ PROCESS) pair per disk volume. In other words, a primary database manager in one Figure 2. TandemNonStop disk subsystem organi- processor and its backup in another pro- zation. cessor own the partition of the database stored in the disk volume to which the two processor is "alive" simply because it can processors are connected. The primary da- send the "I-AM-ALIVE" message, when tabase manager checkpoints to the backup, tasks running in it may have crashed. Soft- so that if the primary fails before complet- ware failures must be detected by message ing a transaction, the backup may take over time-outs. the disk volume and redo committed trans- The operating system that runs on the actions and abort incomplete ones. Tandem NonStop System is called Guard- To run transactions against the database, ian. It is constructed of processes that com- the user provides two sets of programs: the municate by using messages. Guardian pro- Screen COBOL program and application vides high availability of processes by server programs. The Screen COBOL pro- maintaining a primary-backup pair of pro- gram performs screen formatting and se- cesses, each in a different processor. The quencing, data mapping, and field valida- primary process periodically sends check- tion, and sends transaction requests to point information to its paired backup application server programs. The applica- process, so that the backup will stand ready tion server programs perform application to take over as soon as the primary process functions against the database by invoking fails. The checkpoint data from a primary the database manager, the DISCPRO- I/O process contain information about the CESS. files that are opened and closed. The The Screen COBOL program is inter- backup I/O process opens and closes the preted by the terminal management com- checkpointed files while the primary is still ponent of ENCOMPASS, called Terminal active, so that in the event of failure of the Control Process (TCP). TCP is also config- primary, the backup recovers and proceeds ured as a process pair; the primary TCP with normal processing without the time checkpoints the backup TCP with data ex- overhead of opening the files. tracted by the Screen COBOL program An "ownership" bit is associated with from input screens. each of the two ports of an I/O controller, The transaction management compo- which indicates to each port whether it is nent of ENCOMPASS, which implements the primary or backup. Only one port is the conventional model of transactions, is active for the primary I/O process; the called Transaction Monitoring Facility other is used only in the event of a (TMF). TMF consists of a lock manager, a failure to the primary port, and any attempt log manager (called the AUDITPRO- to access data through the backup port is CESS), and a recovery manager (called the rejected. Upon detecting or being notified BACKOUTPROCESS) to provide concur- of the failure of the primary process, the rency control and recovery of interleaved

ComputingSurveys, Vol. 16; No, 1, March1984 80 • Won Kim

execution of concurrent transactions. The caller user's Screen COBOL program interfaces with TMF to indicate the beginning and end of a transaction. The Screen COBOL CCIS program receives a transaction identifi- cation from TMF at the beginning of a primary standby transaction, and attaches the transaction identification to all transaction request messages that it sends to the application server programs. When the Screen COBOL OSN I program notifies end of transaction, TMF initiates a transaction commit protocol to I complete the transaction and make the ef- terminals fect of the transaction permanent. TMF does not support a global deadlock detec- Figure 3. AT&T's Stored Program Controlled Net- tion mechanism; deadlocks are detected by work. time-out, where the time limit is specified as part of lock requests. one backup, for call processing. A database partition is replicated in an NCP and its 3.2 AT&T's Stored Program backup. One NCP may be a backup to Controlled Network another NCP with respect to one database AT&T's Stored Program Controlled (SPC) partition, and a primary to that NCP with Network [Cohen et al. 1983] consists of two respect to another partition of the database. key components: the Network Control When a primary NCP fails and there are Point (NCP) and the Action Point (ACP), insufficient data at the site to restore its as shown in Figure 3. The NCP is a data- operation, its backup takes over while con- base system which manages a database of tinuing to function as the primary for an- customer records that is geographically dis- other database partition. tributed over a network of computers inter- Each NCP is constructed using a 3B-20D connected by the Common Channel Inter- processor [Mitze et al. 1983]. A 3B-20D office Signaling (CCIS) network. The ACP consists of two identical processors: One is a telephone call processing system, and component processor is active and the is a highly reliable No. 4 ESS. other is a standby at any given time. Each When a call is made to a customer of the has its own main memory and control unit. expanded 800 services or Direct Services Further, both processors share all the disks, Dialing Capability (DSDC) services, the and each processor has indirect access to call is routed to an ACP. The ACP trans- the main memory of the other. mits the request to an NCP, which main- During normal operation, the active tains the customer record. The NCP re- processor updates the main memory of the turns the response to the ACP, which in standby processor, which is ready at all turn routes the call according to the re- times to take over if the primary should sponse from the NCP. An administrative fail. At each NCP, the I/O channel, disk system called the User Support System controller, and links to its standby NCP (USS) is used to insert new customer rec- are duplicated. Further, the database par- ords and update existing ones. The USS tition managed by an NCP is quadrupli- sends records to be inserted or updated cated, and stored in four separate sets of through the Operations Support Network disk drives connected to the 3B-20D. The (OSN). standby NCP in turn keeps four copies of This system may require several NCPs, the database partition. according to NCP capacity, performance For a customer record to be updated un- requirements and market forecasts. The der normal conditions, a transaction is sent database is partitioned and each partition to the primary NCP with which the record is assigned to two NCPs, one primary and is associated. The primary NCP checks the

Computing Surveys, Vol. 16, No. 1, March 1984 Highly Available Systems for Database Applications 81

intermodule links 1

cnt I MHP line .... I cnt I

intramodule link

disk cnt 1

disk • cnt I

Figure 4. A module of Bank of America's DCF.

transaction for consistency and authoriza- DCF module is a local-area network of four tion; if it is valid, the primary NCP logs the GA16/440 minicomputers, which manages transaction on the disk and sends an ac- one partition of the customer accounts da- knowledgment to the user. The primary tabase. Two of the four processors in a DCF NCP makes changes to its database and module are communications front ends sends the update to its backup. The backup called Message Handling Processors NCP applies the update to its database and (MHP). The other two are database back acknowledges the primary, at which point ends called File Management Transaction the primary inserts a record in the trans- Processors (FMTP). The four processors action log and finishes the transaction. In communicate via a bus called the intramod- the event of a network partition, when a ule link. primary NCP is disconnected from its As shown in Figure 4, each module is backup NCP due to failure of the commu- configured such that, under normal opera- nication line, the primary updates its da- tion, each MHP is paired with one FMTP tabase without requiring agreement from to operate on half the module's lines and its backup, but maintains a special history database. When one MHP fails, the other log of database changes, which it sends to takes control of all the lines. When one its backup when communication is re- FMTP fails, the other takes control of the stored. entire database of the module. The DCF uses a simple scheme for de- 3.3 Bank of America's Distributive tecting processor failures. Each processor Computing Facility has a watchdog timer, which it periodically Bank of America developed the Distribu- resets. If the timer is not reset on schedule, tive Computing Facility (DCF) around 1978 the DCF module shuts the processor down to automate teller functions for customer and notifies the peer processor to assume checking and savings accounts [Good full work load. When an MHP times out 1983]. By leased lines, branch offices access on a transaction request to an FMTP, it a customer accounts database in two data assumes that the FMTP is dead and starts centers in Los Angeles and San Francisco. sending subsequent transactions to the Each data center houses a DCF cluster, other FMTP. The transaction messages are which consists of eight DCF modules inter- not logged, and so any transaction in prog- connected by a local-area network. Each ress on a failed MHP or FMTP is lost.

Computing Surveys,V0L 16, No. 1, March 1984 82 • Won Kim

i CPU [cPu I

Memory -- Memory Memory- Memory- Cont ro 1 ler control ler

Disk Disk Controller Controller

Commun- I Commun- icat ions ications I Cont ro i Ier Controller I

Tape Controller

StrataLINK StrataLINK Bus Bus Controller J Controller

I ,, I StrataLINK Bus StrataLINK Bus

Figure 5. A Processing Module of Stratus/32.

3.4 Stratus/32 Continuous The duplicated components (CPU and Processing System controllers) of a Processing Module each perform the same computation in parallel. The Stratus/32 Continuous Processing Each component (board), in turn, consists System [Kastner 1983; Stratus 1982] con- of two identical sets of hardware compo- sists of 1-32 Processing Modules, where nents on the same board. As shown in each Processing Module consists of dupli- Figure 6 for a disk controller, a hardware cated CPU, memory, controller, and I/O as logic compares the results of the computa- shown in Figure 5. The memory may be tion by the duplicated boards. If the results configured to be redundant or nonredun- are identical, they are sent to the bus or dant, as the two memory subsystems are device. Otherwise, the results are not sent, not paired with the two CPUs. In a redun- the board is automatically disabled, and an dant configuration, the CPUs read from signal is sent to the Stratus VOS and write to both memory subsystems si- operating system. However, processing multaneously; in a nonredundant configu- continues with the duplexed board of the ration, each memory subsystem becomes Processing Module. an independent unit and the memory ca- All detected hardware malfunctions are pacity is doubled. Each Processing Module reported to a maintenance software, which has duplicated power supplies. The Pro- determines the cause and nature of the cessing Modules are connected through a malfunction. The board is automatically dual-bus system called the StrataLINK. restarted if the malfunction was caused by

Computing Surveys, Vol. 16, No. 1, March 1984 Highly Available Systems for Database Applications • 83

Bus A Bus B

I L° ' ic I

Disk-Contro i Disk-Control Logic I Logic

Figure 6. Self-checking disk controller in a Stratus/32. a transient error, whereas permanent errors Facility. The system does not currently result in the board remaining out of service support a database management system; and a report being sent to an operator TPF invokes the File System to manipulate terminal. the database. TPF supports a two-phase The Stratus system supports optional commit protocol for on-line distributed mirroring of disk volumes. It also allows transaction processing. on-line removal and replacement'of all du- The Stratus system's continuous com- plexed boards and associated peripheral de- parison of the results of computation from vices; in particular, it allows on-line inte- duplicated hardware components on the gration of a new Processing Module. When same board significantly reduces the prob- a duplexed component is replaced, the new ability of a hardware-induced error from component is automatically brought to the propagating and corr~apting the system and same state as its partner. For example, the data integrity, provided that the hardware second disk in a dual-disk system is brought that compares the results does not mal- up-to-date while the first disk is used for function. Further, the Stratus hardware normal processing. The VOS operating sys- and operating system provide protection tem accomplishes this by writing new against some system crashes induced by blocks of pages to both disks and copying software errors, such as attempts by one blocks from the first disk to the second user's program to cross another user's ad- concurrently with normal processing. dress space, to read or write into the oper- Further, the Stratus system allows the ating system, to write into executable code, memory subsystems to be dynamically re- or to execute data. configured to redundant or nonredundant The Stratus system does not currently mode, without taking the Processing Mod- support the concept of a backup subsystem, ule off line. and hence applications running on Stratus The Stratus system supports transaction may not survive software-induced crashes. processing by providing a Transaction The Stratus fault-tolerance philosophy is Processing Facility (TPF), VOS File Sys- based on the view that the hardware can tem, the StrataNET network communica- detect errors and automatically shut down tions subsystem, and a Forms Management a malfunctioning component while an iden-

ComputlngSm~eys, Voi. 16, No. 1, March 1984

.... i~o~ ~ ~T~ o~? .~ 84 • Won Kim tical component operating in parallel con- finishes processing the checkpoint mes- tinues to function, thus ensuring database sage, it is removed from the . integrity and providing continuous process- The Auragen system's current recovery ing of user requests. In my opinion, this technique is based on the roll-forward ap- approach does not safeguard the system proach. When the backup takes over for against crashes induced by a class of errors the primary, it begins execution at the last that even the most sophisticated operating point of synchronization with the primary. systems cannot cope with. For instance, That is, it begins with the last checkpoint IBM's System/370 and its Multiple Virtual message in its message queue. A technique System (MVS) operating system [IBM has been proposed that does not allow the 1979] provide extensive measures to detect backup to redo work that already may have hardware and software failures and to re- been done by the primary before the crash pair and recover from them. Yet, applica- [Gostanian 1983]. tions running on MVS, and MVS itself, do The recovery technique requires an undo occasionally crash, usually as a result of log to allow transaction abort. However, a software failures. redo log is optional and is used for recovery from a single disk crash, when disk mirror- 3.5 Auragen System 4000 ing is not used. Since a backup process will redo in-progress transactions, Auragen's The Auragen System 4000, developed at proposal for commit processing does not Auragen Systems Corp., New Jersey, con- call for the Write-Ahead Log protocol. sists of 2-32 clusters of tightly coupled However, this leaves the system unpro- multimicroprocessors interconnected by a tected from simultaneous failures of both dual-bus system [Gostanian 1983]. Each the primary and backup processes. cluster consists of three MC68000s, its own local memory, several types of I/O control- 3.6 System D Prototype lers, power supply, and battery backup. One of the 68000s is used exclusively to execute System D is a distributed transaction-pro- the operating system, whereas the other cessing system designed and prototyped at two are used to execute user tasks. Each IBM Research, San Jose, as a vehicle for cluster periodically broadcasts an "I-AM- research into availability and incremental ALIVE" message to detect failure of other growth of a locally distributed network of clusters; the time-out mechanism is used to computers [Andler et al. 1982]. The system detect process failures. All peripheral de- was implemented on a network of Series/1 vices are dual ported and are attached to minicomputers interconnected with an in- two different clusters. The Auragen system sertion ring, and was the predecessor to the allows disks to be configured in mirrored Highly Available Systems project currently pairs, and mirroring may be specified on a under way at IBM Research, San Jose [Ag- file basis. hili et al. 1983; Kim 1982]. Although Sys- The Auragen database system, called tem D is not itself a highly available system, AURELATE, is based on the ORACLE its rather novel transaction-processing and relational database system [Weiss 1980]. failure-diagnosis strategies warrant discus- As in the Tandem system, Auragen 4000 sion here. supports a primary-backup pair of pro- The System D transaction-processing cesses, implemented within the AUROS software consists of three distinct types of operating system, which is an enhanced modules: application, data manager, and version of III. To reduce the number storage manager modules. A module is a of checkpoint messages, the AUROS oper- function that exists in a node and may ating system sends a collection of transac- consist of one or more processes called tion messages to both the primary and the agents. The application module, called A, backup. After the primary has processed a provides user interfaces for interactive predetermined number of these transaction users or application programmers. The data messages, the backup is notified to process manager module, called D, transforms the the checkpoint message. Once the backup record-level requests from the A module to

Computing Surveys, Vol. 16, No. 1, March 1984 Highly Available Systems for Database Applications • 85 IProcessor I IProcessor [

Figure 7. Processor pair in System D.

page-level requests for the storage module. supports multiple agents of a module run- All the changes made by an application are ning in the same processor. The Resource kept locally in the data manager module, Manager attempts to bring down and re- which sends them to the storage manager failed agents, while normal service only when the application commits the data requests are handled by other agents. In changes to stable storage. The storage man- the event of a processor failure, agents are ager module, called S, supports multiple brought up in the backup processor and all concurrent transactions against the physi- transactions in progress are aborted. The cal database and database recovery from initial program load (IPL) of a low-end failures. processor normally takes under 1 minute, As shown in Figure 7, terminals and which was considered a reasonable recovery shared disks are connected to a pair of duration, and System D was designed to processors through an electronic switch, avoid the overhead of maintaining synchro- called a Two-Channel Switch (TCS). Only nized pairs of primary and backup pro- one of the processors has access to the cesses. shared disks and terminals at any given The failure-diagnosis logic described be- time. The processor periodically resets the low was necessary because of the inade- timer in the TCS; if this does not occur on quate failure-detection capabilities of the schedule, the TCS switches the shared operating system on which System D ran. disks and terminals over to the other pro- The RM attempts to diagnose the problem cessor. Thus, rather than requiring the pro- by first attempting to establish communi- cessors to communicate with each other, cation with the node in which the agent is System D gives the TCS the task of detect- running. If the response is positive, the ing the failure of a processor. Moreover, the RM issues an "ABORT,TRANSACTION" TCS prevents a processor that is declared command to the agent. The rationale here dead from writing to the disk. is that the service-request message may System D provides a software subsystem have experienced a data- or timing-depend- called the Resource Manager (RM), which ent error that may not recur if the trans- is responsible for diagnosing and taking action is aborted and resubmitted. Success- appropriate actions to recover from the fail- ful processing of the "ABORT_TRANS- ures of modules and agents. The basic ACTION" command also indicates that the premise of its design is that the time-out agent involved probably has not crashed. mechanism detects all failures, including If the transaction cannot be aborted, the deadlock, agent or module crash, commu- RM attempts to bring down and restart the nication medium failure, or processor fail- agent, since the code ore the control struc- ure. Failures are detected only when service ture of the agent may have been destroyed. requests are sent. If the agent cannot be brought down, the Rather than the Tandem-like notion of RM attempts to shut down and restart the primary and backup processes, System D module itself, since the control structures

ComputingSurveys, Vol. 16, No. 1, Ma~ch 1984 86 • Won Kim shared by the agents of the module may ule, and runs these transactions serially in have crashed. If the RM fails to shut down any order. and restart the module, probably the re- mote RM or the operating system has 4. TIGHTLY COUPLED SYSTEMS crashed, in which case an IPL must be WITH A SHARED DATABASE executed remotely for the node in question. Now, if the RM fails to establish com- The Synapse N+I Computer System is an munication with the node in which the on-line transaction-processing system, de- resource resides or if the remote IPL was veloped by Synapse Computer Corporation not successful, it will try to communicate [Jones 1983], which provides a dual path with the RM in the backup node of the from a processor to the database stored in resource, since the TCS has probably secondary storage devices. As Figure 8 switched. If the TCS has not switched over, shows, Synapse N+I is a tightly coupled the RM attempts to remotely execute an multiprocessor system with shared mem- IPL for the node in which the resource ory, in which processing is divided between resides. The reason is that initially it may general-purpose processors (GPP) and I/O have failed to communicate with the RM processors (IOP), each of which is based on in that node because the RM, the CSS, or the . Currently, up to 28 the operating system in that node may have processors may be attached to a dual-bus crashed. system called the Synapse Expansion Bus. The standard two-phase commit protocol GPPs execute user programs and most of allows each node to unilaterally abort the the Synapse's Synthesis operating system transaction as long as the commit protocol from the shared memory. IOPs each man- has not entered the second phase. The de- age up to 16 I/O controllers or communi- sign of System D recognizes that this priv- cations subsystems. IOPs have direct mem- ilege, often called site autonomy, is not so ory access (DMA) capability to the shared important in a locally distributed environ- main memory, but execute part of Synthe- ment, and implements a different commit sis from their own local memory. The Ad- protocol [Andler et al. 1982]. In a sense, vanced Communications Subsystem (ACS) the System D protocol requires just a single is a 68000-based communications control- phase and as soon as the commit coordi- ler, and the Multiple-Purpose Controller nator decides to commit; no other node may (MPC) is a controller for various devices. abort the transaction. Whereas the Tandem NonStop system In System D, actual updates to the da- powers each processor with a separate tabase are not made until the transaction power supply, the entire Synapse N+I is commits. At transaction commit, the trans- powered by one set of duplicated power action's log, maintained by the D module, supplies. is sent to the S module that is designated Synapse enhances availability simply by the commit coordinator, the first S module having one additional processor, disk con- that receives page request from the D mod- troller, and disk drive than what is neces- ule. The commit coordinator writes the log sary for satisfactory performance. This ex- to the disk and acknowledges the D module. tra component is not an idle backup; it is The D module then sends "commit" mes- used for normal transaction processing. sages to the other participating S modules. When a component fails, the system some- Each participant writes its log to the disk times has to restart, and reconfigure itself and then makes database changes. without the failed component. The recovery procedure for this commit The Synapse system also supports mir- protocol is as follows. Upon restart, an S roring of disks. Mirroring is supported in module re-DOes the changes in its local the IOP, where two disk writes are issued transaction log. The module then requests for each page of data to be written to the complete logs from any commit coordina- disk. It is interesting that the Synapse sys- tors for transactions that are known to tem allows mirroring of disks on a logical them but unknown to the recovering rood- volume basis, rather than physical volume.

Computing Surveys, Vol. 16, No. 1, March 1984 Highly Available Systems for DatabaseApplications • 87

up to 16 Mbytes of shared main memory

shared shared shared [ memory memory memory l

I I I I I I I

up to 28 processors

up to 16 adapters per lOP up to 16 comm. lines

• • up to 8 disks

up to 16 comm. lines

Figure 8. Synapse N+I System architecture.

A logical volume may be all or part of a decision was to reduce the overhead asso- physical volume. Mirroring a logical vol- ciated with I/O requests from the database ume is more flexible than mirroring a phys- system to the operating system, both to ical volume; for example, it allows storage retrieve data from disk, and to store the log on the same physical volume of a separate of database changes after transaction pro- database that does not require high availa- cessing. bility (and its associated overhead)• An interesting side benefit of this ap- Synapse N+I supports recoverable proach is that higher levels of the Synthesis transactions using the Write-Ahead Log operating system can use the capabilities of protocol. The Synapse database manager is the database system to query and manipu- a relational system. Further, the database late data about operating system objects, may be migrated from one physical disk such as files and devices. In particular, a volume to another and may undergo struc- higher level of the Synthesis operating sys- tural changes without taking the applica- tem, called the transaction-processing do- tions off line. main, records the state of currently active The Synthesis operating system incor- applications in the database. An applica- porates various techniques to optimize the tion consists of a number of programs; each performance of production transaction- program takes as input a screenful of infor- processing systems. One is the placement mation, processes it, and outputs a screen. of its database system directly above the The state information of an active appli- kernel of Synthesis, rather than on top of cation consists of a user identification, the the file system as has been the case with identification of the current screen, the most database systems. The reason for this contents of its variable fields, and the next

ComputingSurveys, "Col.16, No. 1, March 1984 88 • Won Kim program to execute. During restart follow- this risk is still present. This problem can- ing a crash, the transaction-processing do- not be resolved even if shared memory is main creates and dispatches a task for each made redundant. Synapse N+I does not terminal using this state information. support a redundant shared memory. Another performance optimization in Synthesis is the elimination of task-switch 5. LOOSELY COUPLED MULTIPROCESSOR overhead by replacing task switches with SYSTEM WITH A SHARED DATABASE cross-domain (address space) calls. In other General Electric's MARK III Cluster File words, each of the domains (layers) of Syn- System and Computer Consoles' Power thesis has direct addressability to a parti- System use the loosely coupled multi- tion of the segmented virtual address space, processor architecture, in which each pro- and a request from a task for an operating cessor may access any of the disks. The system service is implemented as a jump to AMOEBA project [Traiger 1983] at IBM the address space of the server's domain. Research, San Jose, also uses the same In order to reduce contention on the architecture, but is still in the research shared memory, Synapse adopted a caching stage. scheme in which modifications to the cache in each GPP are not written through (to 5.1 GE MARK III the shared memory), and fetch requests for the portion of the shared memory read and MARK III is a time-sharing system devel- modified by another GPP in its cache are oped by the Information Services Business resolved between the processors. Division of General Electric to provide its The Synapse system has implemented customers with local-call access to MARK failure detection, reconfiguration, and re- III computing capabilities [Weston 1978]. start procedures in a read-only memory The primary objectives of the system were (ROM). Its failure detection distinguishes high availability, reliability, and maintain- two classes of failures: process-fatal failure ability. Three supercenters (computing and system-fatal failure. A process-fatal centers), located in Ohio, Maryland, and failure causes the process and its associated Amsterdam, provide computing power to transaction to be aborted and restarted, the users. The computing facilities at a whereas a system-fatal failure results in a supercenter typically consist of front-end restart of the entire system. processors, MARK III foreground proces- The kernel of the operating system is sors, and MARK III background proces- capable of recognizing any failures caused sors, as shown in Figure 9. by internal machine checks and is respon- The front-end processors are network sible for initiating the reconfiguration and front-end processors (central concentra- restart of the system. In particular, it acti- tors). The foreground processors support vates self-test code to verify whether each interactive users, whereas the background major hardware component is operational. processors provide batch-processing capa- After any failed components are configured bilities. The foreground and background out of the system, each domain (layer) of processors are interconnected via a Bus the operating system is reinitialized. Since Adapter, which allows job and file - the database system is a layer directly ment between the foreground and back- above the kernal operating system, trans- ground systems. action restart and recovery take place at The Bus Adapter consists of a micro- this time. processor, programmable read-only mem- Since the processing is divided between ory (PROM) control memory, and channel the GPPs and IOPs, and the IOPs run part interfaces to the background systems. The of the operating system from their own microprocessor polls each of the channel local memories, the risk of total system interfaces for data transfer requests; each failure due to corruption of the operating request is fully processed before the next system is somewhat reduced. However, as request is serviced. long as the GPPs execute most of the op- In 1975 GE developed new software to erating system out of shared main memory, allow a single foreground processor of the

Computing Surveys, Vol. 16, No. 1, March 1984 Highly Available Systems [or DatabaseApplications 89 Front End Communications Processors I GEPAC [ 4020 4020 40 0 4020 I I RK III Foreground Processors

IH'sl6080 I 6080 60"1 ,0 6080

BUS ADAPTER J I

MARK III Background Processors Figure 9. GE's MARK III Cluster File System architecture.

MARK III system to access more than one The SPAD contains 16 memory and logic disk system at a time. This new system, devices (8 primary and 8 backup), each for called the Cluster File System, controls a separate disk system. The Maryland su- concurrent access to multiple-disk systems percenter supports seven disk systems. from multiple foreground processors by Each device contains the access-conflict ta- maintaining the access-conflict tables in ble, which indicates whether a particular one stable memory device accessible to all file in a disk system is in use and if so foreground processors. The Scratch Pad whether the file is sharable by other users. (SPAD) was developed to provide the high One device, called the Cluster Control reliability, nonvolatility, and fast access device (CLUSCON), is used for such global time that such a memory device requires. functions as processor status and recovery All data transfers to and from the disk status. Each foreground processor periodi- systems take place over normal I/O chan- cally places its status in CLUSCON and nels, but access-control decisions are made checks to see if any other foreground pro- through use of the access tables in the cessor has failed to do so in the previous SPAD. interval. If it finds that another processor Since the SPAD is the central point failed to update its status, it proceeds to through which all requests are funneled, it clean up all resources belonging to the was built with redundancy to prevent total failed processor. failure of the cluster system resulting from The MARK III Cluster File System is the SPAD and Bus Adapter failure. As protected from single failures by redun- shown in Figure 10, the MARK III Cluster dancy in the interconnections between the File System has two Bus Adapters, primary front-end and foreground processors and and secondary. Each foreground processor between the foreground and background has access paths to both Bus Adapters. The processors. The front-end processors (net- memory and devices in SPAD dedicated to work central concentrators) are connected each disk system are themselves duplicated. to remote concentrators, to which user ter- Each of the Bus Adapters can access both minals are connected. The connection be- the primary and backup elements of SPAD. tween the central concentrators and remote The Bus Adapter was augmented to sup- concentrators is accomplished via redun- port the functions of, and dual access paths dant network-switching computers. to, the SPAD, and the microcode in the An obvious drawback of the MARK III Bus Adapter does dual read and write to Cluster File System is that the SPAD may the SPAD memory. become a performance bottleneck, espe- ComputingSurveys, Vol. 16, No. 1, March 1984 90 • Won Kim MARK III Foreground Processors Unlike many systems that implement transaction management on top of a gen- primary backup eral-purpose operating system, the Power bus bus System combines process management and adapter adapter transaction management into its PERPOS operating system. The PERPOS operating system supports a number of features to I I I I enhance performance of critical applica- L i% tions. To allow concurrent execution of transactions with different response re- quirements, it provides facilities to fix crit- ical applications in memory and run them before other applications. SPAD SPAD devices 0-3 devices 4-7 Since the database manager in each AP can access the entire database, the Power Figure 10. Redundancy in the Bus Adapter and System only needs the standard concur- SPAD. rency control and recovery techniques used for a central database. In particular, it does cially with an expanded system, since each not need the coordinated commit protocol front-end processor must access it before required by loosely coupled multiprocessors accessing a shared file. with partitioned databases. Transaction recovery is done in a 5.2 Computer Consoles' Power System straightforward manner. The AP that re- ceives the transaction from the FEP logs The Power Series systems have been de- the transaction on the disk. When the FEP veloped at Computer Consoles, Inc., Roch- detects that the AP crashed, it requests ester, New York [West et al. 1983]. The another AP to abort the transaction and system is based on Motorola 68000s, and restart it from the log of the failed AP. consists of a number of application pro- As in other systems, interprocessor mes- cessors and two coordination processors sage time-outs are used to detect processor with front-end processors. The processors failures. In addition, the primary ICC pe- communicate via a dual-bus system called riodically polls the APs and FEPs to detect the Data Highway, as shown in Figure 11. failures of processors that do not happen Each application processor (AP) is di- to be in communication with other proces- rectly connected to all disks, and independ- sors. The ICC supports on-line system re- ently executes different user applications configuration after a disk crash and on-line in parallel. The interprocessor coordina- integration of new or repaired disks. tion controllers (ICC) synchronize global The primary ICC does not keep the operations among the APs, which consist standby up-to-date on the global lock table mostly of lock requests to gain access to the and the system configuration information. shared database and system status changes Rather, when the primary ICC fails, the due to reconfiguration. At any given time, standby requests status and lock informa- one ICC is active and the other is a standby. tion from all the APs and reconstructs the The standby ICC is the only idle compo- global lock table. nent of the system. The front-end proces- In order to reduce the communication sors (FEP) perform screen formatting, dis- overhead resulting from lock requests to tribute transactions to APs, receive replies the ICC, the system distinguishes shared from APs, and assist in recovery from some files and nonshared files. When an AP system failures. Further, FEPs attempt to opens a file for the first time, it considers balance the load on the APs by distributing the file nonshared, and does not make a the transactions to the APs on the basis of lock request to the global lock manager in application configuration and flow control the ICC. When an AP opens a file currently information. owned by another AP, that file becomes a

Computing Surveys, Vol 16, No. 1, March 1984 Highly Available Systems for DatabaseApplications • 91 to terminals~ and other systems

Data Highway~ I ] ] I i I I

Figure 11. PowerSystem architecture.

shared file under the jurisdiction of the Up to 14 (word-processing global lock manager. terminals) connected to a GEMINI file One potential drawback of the Power server on a local-area network can share System architecture is that, as the number files and printers. A GEMINI system may of APs increases, the ICCs may become a be connected to other GEMINI systems performance bottleneck. Further, the through an Ethernet-like network. Power System's performance may be en- As shown in Figure 12, GEMINI consists hanced by generalizing its locking tech- of two identical halves. Each half has the nique to the lock hierarchy technique sim- Aquarius interface (AI), a disk controller ilar to that found in IMS/VS [Strickland (DC), and a shared memory (SM), as well et al. 1982]. In this scheme, the database is as a secondary storage system. The AI is a logically partitioned, with each partition communications subsystem, implemented assigned to a different data manager that on an Intel 8088 microprocessor, that op- can acquire and release locks on data ob- erates between GEMINI and the worksta- jects within its partition. The data manager tions, and between the AIs in each half. consults the global lock manager only when The DC, implemented on an Intel 8086, it must lock and unlock objects outside its provides file storage and management for partition. This strategy potentially allows the workstations. Communication between transfer of updated data pages from the an AI and a DC takes place through the buffer pool of one partition's data manager shared memory (SM). to another data manager. Traiger [1983] The two halves of GEMINI perform speculates on this in more detail. identical computations. Each half receives the same request from the workstations and retrieves and updates files in its secondary 6. REDUNDANT COMPUTATION SYSTEMS: storage. When both halves are operational, SYNTREX'S GEMINI FILE SERVER one is designated the master and the other GEMINI is a file server developed by Syn- the slave. The only difference between trex, Inc., with the objective of uninter- them is that only the results from the mas- rupted operation, without backup, in the ter are returned to the workstations. event of a single failure of any hardware or Each half continuously monitors the software component [Cohen et al. 1982]. well-being of the other half. When one half

ComputingSurveys, Vol. 16, No. 1, March 1984 92 Won Kim

I f I

AI AI Uq

Figure 12. Syntrex's GEMINI file server architecture.

crashes, it is powered off and the other half the assumption that the disk controllers continues as the master. While one half is (DC) of both halves receive and perform down, its secondary storage system be- the same computation, and therefore that comes out of date; when it is repaired, a the two secondary storage systems are left utility is run to bring its secondary storage with the same data at the end of computa- system up-to-date. The active system is tion. Since it is possible for one AI to re- suspended until copying is completed. Thus ceive a correctly transmitted request while GEMINI does not support on-line reinte- the other receives the request with a trans- gration of repaired components. mission error, the two AIs are required to A time-out mechanism is used in the exchange status information about the re- communication between the AIs of GEM- quests that they received. The request is INI and the workstations. If the worksta- processed only when both AIs agree that tion times out on its request, it retransmits they received the same request [without a the request; if GEMINI times out, it goes character redundancy check (CRC) error]. to receive mode and waits for the worksta- Similarly, the AIs exchange information tion to time out and retransmit the request. about the results of the computation to This means that the time-out verify their correctness. It is possible for value is longer than the GEMINI time-out. one AI to have completed a computation Another aspect of the reliability meas- before the other is done. In such a situation, ures incorporated in the AI is the periodic the slow half sends a notice that it is "work- self-checks, including auditing of input ing on the request," to prevent the other buffers, checking of the clock, memory half from concluding that it is down. GEM- tests, and testing of the DC and the com- INI takes precautions against an infinite munication link between the AIs. If any sequence of "I am done" and "I am working test fails, the AI logs the failure and, if on it" messages between the AIs. Since the possible, informs the other AI. results of computations may be too long, The heart of mutual checking in GEM- sometimes only the types of results are INI is embedded in the AI-AI communi- exchanged between the AIs. Therefore it cations procedures. The availability and re- appears that sometimes GEMINI may not liability of GEMINI critically depends on detect conflicting results generated by the

Computing Surveys, Vol. 16, No. 1, March 1984 Highly Available Systems for Database Applications • 93 two AIs. Further, when GEMINI does de- requires system shutdown when a repaired tect conflicting results, it arbitrarily as- processor is reintegrsted into the system, sumes that the master is correct. A mean- and in general most systems force applica- ingful vote really cannot be taken with less tions off line when the database is physi- than three processors. cally reorganized. An architecture of a distributed software subsystem that can serve as a framework 7. A FRAMEWORK FOR THE MANAGEMENT for constructing database application sys- OF SYSTEM CONFIGURATION tems to meet most availability require- It is clear from the preceding discussions ments is outlined in the remainder of this that the design of existing systems has been section. This software subsystem is called guided by the single-failure assumption; an auditor. The description of the functions that is, these systems can become unavail- and architecture of the auditor given here able if a software or hardware component is based largely on my own research. A fails while another related component has design based on this is currently being im- failed. Despite the general success of some plemented for the Highly Available Sys- of these systems, notably the Tandem sys- tems project at IBM Research, San Jose tem, the single-failure assumption may not [Aghili et al. 1983]. be valid. Future systems may be required An auditor is a framework for total co- to tolerate multiple concurrent failures. herent management of software and hard- Most existing systems and those that are ware components of a highly available currently being developed are constructed distributed system. It will serve as the re- with mini- and microcomputers. If rela- pository of failure reports from various tively expensive medium- to high-end pro- components of the system and reconfigura- cessors were to be used, it might not be tion requests from the system operator (for economically feasible to keep spares around on-line changes). It will analyze the status for use as replacements for malfunctioning of all resources it manages, and compute processors. In that case, the mean time to the optimal configuration of the system in repair such processors could be relatively response to multiple concurrent failures of long, increasing the probability that other components and requests for load balanc- subsystems or processors might go down ing. Further, it will initiate system recon- before the failed processors are repaired. figuration and monitor its progress in order In addition, the software in most existing to effect mid-course correction of a recon- systems was developed from scratch to run figuration that does not succeed, and fi- on minicomputers and to support only pro- nally, it will diagnose a class of failures that spective new customers. Existing database other components fail to recognize. and operating systems required by medium- From the discussions of the survey por- to high-end processors tend to be complex, tion of this paper, it should have become and the mean time between failures for clear to the reader that most systems pro- these systems due to software-induced fail- vide many of the functions outlined for the ures is expected to be shorter than that for auditor. All systems discussed support au- simple transaction-processing systems that tomatic detection of process and processor run under relatively simple operating sys- failures, followed by automatic switchover tems. to a backup or notification to the service If a system is to be continuously opera- center. Many systems also support on-line tional, it must guarantee availability not integration of new or repaired software only during multiple concurrent failures of (process) and hardware components (proc- software and hardware components, but essor, I/O controller, disk drives). also during on-line changes of software and However, the implementation of the aud- hardware components and on-line physical itor functions in many systems suffers from reconfiguration of the database and data- two shortcomings. First, these functions base backups. The latter problems do not have often been implemented as a loose appear to he properly addressed by most collection of specialized routines, rather systems. Syntrex GEMINI, for example, than as a single coherent subsystem. Sec-

Computing Surveys, Vol. 16, No. I, MJrcb 1964 94 • Won Kim ond, the functions often are implemented control among the auditor tasks are illus- to tolerate only a single failure of the sys- trated in Figure 13. tem resources; as a result, the systems can- The configuration-database task main- not. cope with multiple faih]res, even when tains a consistent and up-to-date configu- they have sufficient hardware redundancy. ration database. All queries and updates to Within this framework, one auditor will the configuration database by other auditor reside in each processor, but only one of tasks are directed to this task. Changes to the auditors may be designated as the audit the configuration database are exclusively coordinator. As pointed out by Garcia-Mol- handled by the configuration-database task ina [1982], to allow each auditor to initiate of the audit coordinator. After each change, reconfiguration may cause confusion or re- the configuration-database tasks of the sult in a less than optimal system configu- subordinate auditors are given the most ration, and the notion of the audit coordi- up-to-date copy of the configuration data- nator is therefore central to the operation base° Although the configuration database of the audit mechanism. If the coordinator viewed by a subordinate auditor may be crashes, a new coordinator must first be temporarily out of date, consistency of the established, either by an election, as sug- system configuration is not compromised gested by Garcia-Molina [1982], or by since criticaldecisions can only be made b3 means of a dynamic succession listto which the audit coordinator. all the auditors have previously agreed The audit task is primarily responsible [Kim 1982]. A succession list contains the for coordinated surveillance of process and system-wide unique rank for each auditor processor failures. The audit task of a sub- to indicate which subordinate auditor will ordinate auditor coltects the local state take over the responsibilities of the audit reports from the state-report task, and de- coordinator once the corrent coordinator livers them to the audit task of the coordi- crashes. The authenticated version of the nator. The audit task of the coordinator Byzantine consensus protocol proposed by receives these state reports from subordi- Dolev and Strong [1982] and the version- nate auditors, and analyzes them to deter- number method discussed by Kim [1982] mine whether any process or processor has are possible techniques to ensure agree- failed. ment on the succession list in the presence One way for the audit coordinator to of failures of communication lines, pro- receive state reports is to periodically poll cesses, and processors. the subordinate auditors. An interesting The audit coordinator should be respon- alternative to polling by the audit coordi- sible for analyzing the states of all other nator is the approach proposed by Walter subordinate auditors, analyzing the reports [1982], which requires each "auditor" to and initiating system reconfiguration, the periodically send an "I-AM-ALIVE" mes- replacement of failed subsystems or pro- sage to its immediate neighbor on a virtual cessors with their backups, and (re)inte- ring of "auditors." When an auditor does gration of repaired (or new) subsystems or not receive the "I-AM-ALIVE" message processors. The audit coordinator will also within a certain time interval, it may re- serve as the arbitrator of conflicting reports quest the audit coordinator to initiate re- from different auditors, and is responsible configuration. for maintaining a stable configuration da- When the audit task of the coordinator tabase which contains information about decides that a failure has occu~ed, or when the status and physical location of each of it receives a reconfiguration request from the subsystems. the operator control task, it activates the Such an auditor may be implemented as reconfiguration task. Changes to system a collection of six asynchronous tasks: con- configuration are reflected in the configu- figuration-database task, audit task, recon- ration database after the reconfiguration figuration task, state-report task, diagnose task completes. The audit task of the co- task, and operator-control task. The task ordinator iS also responsible for preparing structure of an auditor and the flow of the succession list and securing its trans-

Computing SurveyB, Vo|. 16. No. 1, March 1984 Highly Available Systems for Database Applications • 95

from operating system and appllcation subsystems operator

config I~ report 'operator task to and from ! ~ application subsystems aud!t l ig I, diagnose I ~I task I'

to audit task to reconfig task to diagnose task of other auditors Figure13. Thetaskstructu~ofan au~tor. mission to the audit tasks of subordinate au- tions whose results are known. This some- ditors. times will require collabo~tion among the The reconfiguration task is responsible diagnose tasks of several auditors. Its find- for processing the reconfiguration requests ings are packaged into a state report and received from the audit task of the audit sent to a database subsystem (e.g., to report coordinator. Upon receiving a request, it the loss of a message and the need for its initiates reconfiguration and monitors its retransmission) or to the state-report task successful completion. If the audit coordi- (e.g., to report a subsystem crash that has nator fails during a reconfiguration process, remained undetected by the operating sys- then the new coordinator completes the tem). reconfiguration. Any change to the system The complexity of the diagnose task de- configuration is stored in the configuration pends on the extent to which other software database. A report is sent back to the audit subsystems assist in identifying failures. If task of the coordinator upon completion of the operating system is capable of serving a reconfiguration request. as a repository of hardware and program The state-report task receives state re- failures, and application software contains ports from the local database subsystems, a reasonable amount of defensive code to interprocessor communication subsystem, detect impossible software states, the di- and the operating system, as well as recon- agnose task can be quite simple. If this is figuration requests from the system opera- not the case, the diagnose task may have to tor. It manages this collection of state re- be designed in a manner similar to the ports and makes it available to the local Resource Manager of System D to expose audit task and other auditor tasks. the nature of failures. The diagnose task is responsible for ex- The operator-control task is the auditor's posing process failures that may have gone interface to the system operator. By using undetected by the operating system or the this interface, the operator may query the process itself. It may also collect complaints system configuration or request a system from database subsystems about possible reconfiguration. misbehavior of other subsystems (e.g., time-outs and lost messages). To establish 8. CONCLUDING REMARKS availability or misbehavior of subsystems, it may resort to functional tests, such as This paper has provided a survey and anal- checking if messages pass through queues, ysis of the architectures and availability tracking down the messages exchanged by techniques used in database application subsystems, and executing simple transac- systems designed with availability as a pri-

ComputingSurveys, VoL 16, No. 1, March 1984 96 • Won Kim mary objective. We found that all existing such as spacecraft and industrial process systems have been designed under the sin- control, it does not appear particularly suit- gle-failure assumption, but that some of the able for typical database applications. The systems contain single points of failure and exchange of status information among the cannot survive failures of some single com- replicated tasks to verify correctness of ponents. Rather, these systems are de- each input and output could seriously signed to restart quickly to provide high impede the performance of a production overall availability. system. All of the systems may be classified into A continuously operational database ap- four distinct architectures: loosely coupled plication systems must guarantee availabil- multiprocessor systems with a partitioned ity not only during multiple concurrent fail- database, tightly coupled multiprocessor ures of software and hardware components systems with a shared database, loosely but also during on-line changes of software coupled multiprocessor systems with a and hardware components, on-line physical shared database, and multiprocessor sys- reconfiguration of the database, and gen- tems that perform redundant computations eration of backup databases. An architec- and compare the results. Of these, the ture was outlined in Section 7 of a distrib- loosely coupled multiprocessor with either uted software subsystem called an auditor, a partitioned or shared database appears to which can serve as a framework for con- offer the best framework for building a structing database application systems to highly available system. Either architecture meet these requirements. is conducive to incremental expansion and offers a natural boundary between data ACKNOWLEDGMENTS managers, which makes it difficult for a Irv Traiger (IBM Research, San Jose) read an initial malfunctioning data manager to corrupt version of this paper and made numerous valuable other data managers. Of course, both ar- suggestions that helped to significantly improve the chitectures require a low-overhead com- technical contents and presentation of the paper. John West of Computer Consoles, Richard Gostanian of munications subsystem to process user re- Auragen Systems, Bob Good of Bank of America, quests that require access to more than one Steve Jones of Synapse Computer, and Dave Cohen database partition. A difficult problem of AT&T kindly provided answers to various posed by the partitioned database, however, technical questions I had about their systems. The is that of deciding which database partition referees and Randy Katz made various constructive should be owned by which processor, so as comments on an earlier version of this paper. Finally, to minimize the volume of processing re- a technical editor did a superb job of shaping the paper quiring collaboration among more than one into publishable form. data manager. Potential drawbacks of the REFERENCES shared database approach are contention on shared disks and the difficulty of coor- AGHILI, H., ASTRAHAN, M., FINKELSTEIN, S., KIM, W., MCPHERSON, K., SCHKOLNICK, M., AND dinating the global locking and the logging STRONG, M. 1983. A prototype for a highly of database changes. available database system. IBM Res. Rep. The tightly coupled multiprocessor ar- RJ3755, IBM Research, San Jose, Calif., Jan. 17. chitecture with a shared database compro- ANDLER, S., DINe, I., ESWARAN, K., HAUSER, C., mises availability in favor of a potential KIM, W., MEHL, J., AND WILLIAMS, R. 1982. System D: A distributed system for avail- performance advantage over the loosely ability. In Proceedings of the 8th International coupled system with a partitioned database. Conference on Very Large Data Bases (Mexico However, before this potential advantage City, Mexico D.F., Sept.), pp. 33-44. in performance can be realized, the prob- BARTLETT, J. 1978. A nonstop operating system. In lems of contention among processors for Proceedings o[ the 1978 International Conference the use of shared memory and other shared on System Sciences (Honolulu, Hawaii, Jan.). BERNSTEIN, P., AND GOODMAN, N. 1981. Concur- resources, especially as more processors are rency control in distributed database systems. added, must be resolved. ACM Comput Surv. 13, 2 (June), 185-221. Although the redundant-computation BORE, A. 1981. Transaction monitoringin ENCOM- approach may make sense for applications PASS (TM): Reliable distributed transaction

ComputingSurveys, Vol. 16, No. 1, March 1984 Highly Available Systems for Database Applica~ons • 97 processing. In Proceedings of the 7th International WOLF, E. W. 1978. Plutibus--An operational Conference on Very Large Databases (Cannes, fault-tolerant multiprocessor. Proc. IEEE 66, 10 France, Sept. 9-11). IEEE, New York, pp. 155- (Oct.) 1146-1159. 165. ACM, New York. KATZMAN, J. A. 1977. System a~chitectu~ for COHEN, D., HOLCOME, J. E., AND SURY, M. B. NonStop computing. In Proceedings of the 1983. Database management strategies to sup- CompCon (Feb). IEEE Computer Society, Los port network services. IEEE Q. Bull Database Angeles. pp. 77-80. Eng. 6, 2 (June), special issue on Highly Available KATZMAN, J. A. 1978. A fault tolerant computer Systems. system. In Proceedings of the 1978 International COHEN, N. B., HALEY, C. B., HENDERSON, S. E., AND Conference on System Sciences (Honolulu, Ha- WON, C. L. 1982. GEMINI: A reliable local waii, Jan.). network. In Proceedings of the 6th Workshop on KIM, W. 1982. Auditor: A framework for highly Distributed Data Management and Computer available DB/DC systems. In Proceedings of the Networks (Berkeley, Calif., Feb.), pp. 1-22. 2nd Symposium on Reliability in Distributed Soft- DOLEV, D., AND STRONG, H. R. 1982. Polynomial ware and Database Systems (Pittsburgh, Pa., algorithms for multiple processor agreement. In July). IEEE Computer Society, Silver Spring, Proceeding of the 14th ACM Symposium on The- Md., pp. 76--84. ory of Computing (San Francisco, May 5-7). KINNUCAN, P. 1981. An industrial computer that ACM, New York, pp. 401-407. 'can't fail.'Mini-Micro Syst. (Mar.), 29-34. ELECTRONICBUSINESS 1981. October issue. KOHLER, W. 1981. A survey of techniques for syn- GARCIA-MOLINA,H. 1982. Elections in a distributed chronization and recovery in decentralized com- computing system. IEEE Tran8 Comput. C-31, 1 puter systems. ACM Comput. Surv. 13, 2 (June), (Jan.), pp. 48-59. 149-183. GOOD, B. 1983. Experience with Bank of America's MITZE, R. W., ET. AL. 1983. The 3B-20D processor distributive computing System. In Proceedings o[ and DMERT as a base for telecommunications the IEEE CompCon (Mar.). IEEE Computer So- applications. Bell Syst. Tech. J. Comput. Sci. Syst. ciety, Los Angeles. 62, 1 (Jan.), 171-180. GOSTANIAN, R. 1983. The Auragen System 4000. SPENCER, A. C., AND VIGILANTE, F. S. 1969. System IEEE Q. Bull• Database Eng. 6, 2 (June), special organization and objectives, Bell Syst. Tech. J. issue on Highly Available Systems. (Oct.), 2607-2618, special issue on No. 2 ESS. GRAY, J. N. 1978. Notes on data base operating STRATUS 1982 Stratus/32 System Overview. Stratus systems. IBM Res. Rep. RJ2188, IBM Research, Computers, Natick, Mass. San Jose, Calif., Feb. STRICKLAND, J. P., UHROWCZIK, P. P, AND WATTS, GRAY, J. N., MCJONES, P., BLASGEN, M., LINDSAY, V. L. 1982. IMS/VS: An evolving system. IBM B., LOBIE, R., PRICE, T., PUTZOLU, F., AND Syst. J. 21, 4, 490-510. TRAIGER, I. 1981. Recovery manager of a data SWAN, R., FULLER, S. H., AND SIEWIOREB, D. P. management system. ACM Comput. Surv. 13, 2 1977. Cm*--A modular, multi-microprocessor. (June), 223-242. In Proceedings of the National Computer Confer- HAERDER, T., AND REUTER, A. Principles of trans- ence (Dallas, Tex., June 13-16), vol. 45. AFIPS action-oriented database recovery. ACM Comput. Press, Reston, Va., pp. 637-644. Surv. 15, 4 (Dec.), 287-317. TRAIGER, I. 1983. Trends in systems aspects of da- IBM 1979. OS/VS2 MVS multiprocessing: An intro- tabase management. In Proceedings of the British duction and guide to writing, operating, and re- Computer Society 2nd International Conference covery procedures. Form No. GC28-0952-1, File on Databases (Cambridge, England, Aug. 30- No. $370-34, International Business Machines. Sept. 2). IBM 1980. IBM System/370 principles of operation. TSUKIGI, K., AND HASEGAWA,Y. 1983. The travel Form No. GA22-7000-6, File No. $370-01, Inter- reservation on-line network system. IEEE Q. national Business Machines. Bull. Database Eng. 6, 1 (Mar.), special issue on Database Systems in Japan. IEEE 1983. IEEE Q Bull. Database Eng. 6, 2, (June), special issue on Highly Available Systems. WALTER,B. 1982. A robust and efficient protocol for checking the availability of remote sites. In Pro- JONES, S. 1983. Synapse's approach to high appli- ceedings of the 6th Workshop on Distributed Data cation availability. In Proceedings of the IEEE Management and Computer Networks (Berkeley, Spring CompCon (Mar.). IEEE Computer Soci- Calif., Feb.), pp. 45-68. ety, Los Angeles. WEISS, H. M. 1980. The ORACLE data base man- KASTNER, P. C. 1983. A fault-tolerant transaction agement system. Mini-Micro Syst. (Aug.), 111- processing environment. IEEE Q. Bull. Database 114. Eng. 6, 2 (June), special issue on Highly Available WENSLEY, J. H., LAMPORT, L., GOLDBERG, J., Systems. GREEN, M., LEVITT, K., MELLIAR-SMITH,P. M., KATSUKI, D., ELSAM, E. S., MANN, W. F., ROBERTS, SHOSTAK, R., AND WEINSTOCK, C. 1978. SIFT: E. S., ROBINSON, J. G., SKOWRONSKI,F. S., AND Design and analysis of fault-tolerant computer

Computing Su~eys, V~I. 16, No. 1, March 1984 98 • Won Kim

for aircraft control. Proc. IEEE 66, 10 (Oct.), 6, 2 (June), special issue on Highly Available 1240-1255. Systems. WEST, J. C., ISMAN, M. A., AND HANNAFORD, S. G. WESTON, J. 1978. General Electric's MARK III 1983. Transaction processing in the PERPOS Cluster System. Presented to the American In- operating system. IEEE Q. Bull. Database Eng. stitute of Industrial Engineers, Jan. 31.

Received April 1983; final revision accepted February 1984.

ComputingSurveys, Vol 16, No. 1, March 1984