USOO5692120A United States Patent (19) 11 Patent Number: 5,692,120 Forman et al. 45 Date of Patent: *Nov. 25, 1997

54 FALURE RECOVERY APPARATUS AND 56) References Cited METHOD FOR DISTRIBUTED PROCESSING SHARED RESOURCE CONTROL U.S. PATENT DOCUMENTS 4,634,110 1/1987 Julich et al...... 371/1.1 Inventors: Ira Richard Forman; Hari Haranath 4,827,399 5/1989 Shibayama ...... 364,200 Madduri, both of Austin,Tex. 5,003,464 3/1991 Ely et al...... 364/200

5,058,056 10/1991 Hammer et al. . ... 364/900 Assignee: International Business Machines 5,214,778 5/1993 Glider et al...... 395/575 73 5,235,700 8/1993 Alaiwan et al. . ... 395/575 Corporation, Armonk, N.Y. 5,253,359 10/1993 Spix et al...... 395/575 5,463,733 10/1995 Forman et al...... 395/182.08 Notice: The term of this patent shall not extend Primary Examiner Robert W. Beausoliel, Jr. beyond the expiration date of Pat. No. Assistant Examiner-Joseph E. Palys 5.463,733. Attorney, Agent, or Firm-Mark S. Walker 57 ABSTRACT 21 Appl. No.: 453,610 Communicating the failure of master process controlling one or more shared resources to all processes sharing the 22) Fed: May 30, 1995 resources. A shared resource control file is established that contains the identities of all sharing processes. Master Related U.S. Application process failure triggers a race to establish exclusive access over the shared control file. The new master reads 62 Division of Ser. No. 287,046, Aug. 8, 1994, Pat. No. address data from the old shared control file, marks it as 5,463,733. invalid and establishes a new control file based on renewed 51 Int. Cl. ... G06F11/00 registrations from the sharing processes. The master process maintains the sharing process list as process begin and end 52 U.S. Cl...... 395/182,08; 395/729 sharing. 58 Field of Search ...... 395/182,08, 182.09, 395/82.02, 182.1 3 Claims, 3 Drawing Sheets

108

CPU MEMORY COMM. ADAPT. FIG. 1 104 06 102 10

DISK I/O CONTROL CNTL 120 112

8 . 116 114 U.S. Patent Nov. 25, 1997 Sheet 1 of 3 5,692,120 100 R 108 CPU MEMORY SNM FIG. 1 102 104 106 to

DISK I/O CONTROL CNTL 120 112

18 116 114 ---

212 U.S. Patent Nov. 25, 1997 Sheet 2 of 3 5,692,120

150 REQUEST FOR COMMON RESOURCE

152 S NO THEREA 154 SHARED CONTROL FILE

CREATE THE p SHARED YES CONTROL FILE (SCF) 156 HOLD EXCLUSIVE WRITE MODE

158 YES NO

164 INFORMSYSTEM THAT T LOST CONTENTION 166 OPENSHARED CONTROL FE

TO READ WINNER'S INFO

60 168 WRITE DESTINATION INSCF 162 70 MASTER OF RESOURCE

172 YES NO

74 176 NEGOTATE WITH ACCESS RESOURCE MASTER TO ACCESS RESOURCE

FIG. 3 PRIOR ART U.S. Patent Nov. 25, 1997 Sheet 3 of 3 5,692,120

SHADOW PROCESSES DETECT FAILURE 2O2

RACE FOR CONTROL 204

NEWMASTER PROCESS SELECTED 2O6

PREVIOUS SCFINVALIDATED 208

MESSAGE SENT TO EACH SHADOW 210 PROCESS

SHADOWSRE-REGISTER DATAADDED TO 212 SCF

FIG. 4

230 232

MACHINED PROTOCOLID SHADOWED SHADOW COMM SHADOW O COMM SHADOW ID COMM

SHADOW ID COMM

SHADOW ID COMM

FIG. 5 5,692,120 1. 2 FAILURE RECOVERY APPARATUS AND patent application Ser. No. 07/961,750 filed Oct. 16, 1992 METHOD FOR DISTRIBUTED PROCESSING and entitled Determining a Winner of a Race in a Data SHARED RESOURCE CONTROL Processing System, commonly assigned and bearing attor ney docket number AT992-117. The "race” between each This application is a Division of application Ser. No. 5 process potentially controlling a resource results in the 08/287,046, filed Aug. 8, 1994, now U.S. Pat. No. 5,463, assignment of master status to the process first establishing 733. write control over a Share Control File. Once control has been established by one process, other processes are BACKGROUND OF THE INVENTION assigned "shadow" status. Master process failure causes 10 reevaluation of master status. This patent application is also 1. Field of the Invention incorporated by reference. The present invention relates to the operation of distrib The technical problem addressed by the present invention uted processing computer systems. In particular it relates to is providing fault tolerantfeatures to a distributed processing failure recovery in those systems that have a plurality of system controlling resource sharing by designating a master processing nodes each one having access to a number of 15 process for each shared resource. The problem of systems shared resources controlled by a master process. Still more using write locks or tokens to manage replicated data objects particularly, the present invention relates to the management is also addressed. Fault tolerance is required to ensure that shared access including the passing of a token or write no data or updates are lost due to the failure of a master that grants permission to one of a number of distributed process. Prior art systems, including those referenced above, processes allowing that process to update a data item. 20 require the master determination and write lock control to be 2. Background and Related Art reinitialized. This could result in loss of data if a locally Distributed computer, systems are created by linking a updated data object replica has not been propagated to the number of computer systems using a communications master or other replicas. work. Distributed systems frequently have the ability to share data resident on an individual system. Sharing can take 25 SUMMARY OF THE INVENTION many forms. Simple allows any of the distrib The present invention is directed to an improved system uted processes to access file regardless of the physical and method for managing write locks in distributed process system on which they reside. Device sharing similarly ing systems. The present invention improves on failure allows use of physical devices regardless of location. Rep notification in a distributed environment by notifying all licated data systems implement data sharing by providing a 30 shadow processes of the failure. Notification is accom replica copy of a data object to each process using that data plished by having each master collect the names of all object. Replication reduces the access time for each proces shadow processes and causing a new master to notify all sor by eliminating the need to send messages over the previously logged shadows of the changed master status. network to retrieve and supply the necessary data. A repli The present invention offers an improved system and cated object is a logical unit of data existing in one of the 35 method for recovering from master failure in a write lock computer systems but physically replicated to multiple dis control system. The present invention is directed to a system tributed computer systems. Replicated copies are typically and method that ensures the designation of a new master maintained in the memories of the distributed systems. considers data integrity by determining which of the shadow Replicated data objects also speed the update process by processes has the most current data object and attempting to allowing immediate local update of a data object. Replica make that shadow the master. tion introduces a control problem, however, because many The present invention is directed to a method of managing copies of the data object exist. The distributed system must recovery of a distributed processing system in which shared have some means for controlling data update to ensure that resources are a each controlled by a master process, the all copies of the data remain consistent. distributed processing system having a plurality of Prior art systems control data consistency by establishing 45 processors, each of said processors having memory and each a master data object copy in one of the distributed systems. of said processors interconnected to the other processors by The master copy is always assumed to be valid. Data object means of a communications network. The method comprises update by a system other than that of the master copy detecting failure of a master process for a shared resource; requires sending of the update request to the master for requesting exclusive access to a shared resource control file; update and propagation to all replicas. This approach has the 50 establishing exclusive access to said shared resource control disadvantage of slowing local response time as the master file, if said requestis granted; determining from said control data object update and propagation are performed. file the communications addresses of all other processes Another means for controlling replicated data is described accessing said shared resource via the failed master process; in Moving Write Lock for Replicated Objects, commonly 55 sending a message to each of said other processes indicating assigned, filed on Oct. 16, 1992 as Ser. No. 07/961,757 and failure of said master process. having attorney docket number AT992-046. The apparatus It is therefore an object of the present invention to provide and method of that invention require that a single “write a fault-tolerant distributed system having replicated data lock” exist in a distributed system and be passed to each objects. process on request. Data object updates can only be per It is another object of the invention to manage new master formed by the holder of the “write lock.” The “write lock” process "race” conditions to ensure that master process holder may update the local object copy and then send that designation does not result in loss of data. update to the master processor for its update and propagation It is yet another object of the invention to ensure failure to other processes. The above patent application is incorpo of a master process for a given data object is detected by all rated by reference. 65 shadow processes having replicas of that data object. The method for determining which of a number of dis The foregoing and other objects, features and advantages tributed processes is to be master is described in pending of the invention will be apparent from the following more 5,692,120 3 4 particular description of a preferred embodiment of the NFS technology or CMUAFS technology. Each of these file invention, as illustrated in the accompanying drawing system programs allows distributed processes to access and wherein like reference numbers represent like parts of the manage data residing on remote systems. These systems invention. create a single logical for each processor regard less of the physical location of individual files. NFS is BRIEF DESCRIPTION OF THE DRAWING described in greater detail in the IBM Corp. publication Communication Concepts and Procedures, Order No. SC23 FIG. 1 is a block diagram of a computer system of the type 22O3-00. in which the present invention is embodied. The variety of permitted networks means that the pro FIG. 2 is a block diagram of a distributed network cessing nodes may be distributed throughout a building, according to the present invention. O across a campus, or even across national boundaries. FIG. 3 is a flowchart depicting the master resolution logic The preferred embodiment of the present invention is of prior art systems. practiced in a distributed network of peer processing nodes. FIG. 4 is a flowchart of the preferred embodiment failure Peer nodes each have equal status in the network with none recovery logic. 15 being master or slave nodes. Using peer nodes improves network efficiency because there is no single bottleneck FIG. 5 is a diagram showing the shared control file through which requests must be funnelled. Instead each according to the preferred embodiment. node can actindependently to performits functions. Another DETALED DESCRIPTION advantage is that failure of any particular node will not cause the entire network to fail as would be the case where a The present invention is practiced in a distributed pro master processor existed. The disadvantage of peer networks cessing computer environment. This environment consists is that there is no focal point for controlling data integrity of of a number of computer processors linked together by a replicated data. communications network. Alternatively, the present inven The above referenced patent application for Determining tion could be practiced in a multiprogramming System in 25 the Winner of a Race in a Data Processing System teaches which a single computer (e.g. single CPU) supports the a procedure for "racing” for control of a resource. FIG. 3 execution of multiple processes each having a separate illustrates the steps of this process. The process starts by address space. generating a request for a common resource 150. The The preferred embodiment is practiced with linked com processor requesting the resource tests to determine whether puters. Each computer has the components shown generally 30 or not a shared control file exists 152. If not, the process for the system 100 in FIG. 1. Processing is provided by creates a shared control file 154. In either case, the process central processing unit or CPU 102. CPU 102 acts on attempts to acquire exclusive write access 156. If this is instructions and data stored in random access memory 104. successful 158 the system updates the shared control file 160 Long term storage is provided on one or more disks 122 and is becomes master of that resource 162. If the attempt to operated by disk controller 120. A variety of other storage 35 acquire the exclusive write lockfailed, the process is not the media could be employed including tape, CD-ROM, or master 164 and must read the name of the master from the WORM drives. Removable storage media may also be shared control file 166 and connect to the master 168 as a provided to store data or computer process instructions. shadow 170. If the requesting process is the master, it can Operators communicate with the system through I/O devices directly access the resource, otherwise, it is a shadow controlled by I/O controller 112. Display 114 presents data process and must negotiate with the master for access 176. to the operator while keyboard 114 and pointing device 118 The shared control file of the preferred embodiment is a allow the operator to direct the computer system. Commu storage file in the logical file system. As such, it resides on nications adapter 106 controls communications between this one of the permanent storage devices in the distributed processing unit and others on a network to which it con system. The present invention is equally applicable, nected by network interface 108. AS however, to a shared resource control file managed in Computer system 100 can be any known computer system (RAM) that is sharable among the distrib including microcomputers, mini-computers and mainframe uted processes. computers. The preferred embodiment envisions the use of The system described in the above patent application computer systems such as the IBM Personal System/2 provided handling for master process failure by reinitiating (PS/2) or IBM RISC System/6000 families of computers. 50 the control race. This approach has several disadvantages. (IBM, Personal System/2, PS/2 and RISC System/6000 are First, only those processes that know about the failure of the trademarks of the IBM Corp.) However, workstations from master will participate in the race. Shadow processes will other vendors such as Sun or Hewlett Packard may be used, find out about master failure in a number of ways. Some as well as computers from Compaq or Apple. communication systems, such as NFS will notify any pro A distributed processing system is shown in FIG. 2. Each 55 cess linked to a failing process of that process failure. In this of the processing nodes 202,204.206, 208,210 is connected case the shadow process will be quickly notified. In other to a network 200 that enables communications among the cases shadow processes will detect master failure only when processors. Additional permanent storage may be associated the shadow process attempts to communicate with the with the network as shown by disk storage unit 212. In the master. A shadow process that is read intensive may not alternative, persistent storage in one of the processing nodes 60 contact the master for long periods and thus will not par could be used for network persistent storage. ticipate in the race. Network 200 can be any type of network including LAN, A second disadvantage exists after the new master process WAN, ATM or other. Physical network protocols such as is established. The new master has no knowledge of the Ethernet or Token Ring can be used and communications shadow processes previously accessing the shared resource. protocols such as TCP/IP or Netbios or Novell Netware can 65 These shadows will not necessarily become aware of the control the network. management can master process failure until they attempt to communicate be provided by a program based on the Sun Microsystems with that process, fail, and seek to determine the new master. 5,692,120 5 6 The process of the preferred embodiment will be We claim: described with reference to FIG. 4. The process starts when 1. A computer program product having a computer read one or more shadow processes detect master process failure able medium having computer program logic recorded 202. These processes will then "race" to determine which thereon for causing a computer system to detect a master will be the new master process by attempting to gain 5 process failure in system having a plurality processes each exclusive control of a shared control file 204. The new controlled by a separate and a plurality of master process is determined as the process first acquiring shared resources each controlled by a master process and to exclusive access to the shared control file 206. The new notify all other processes of master process failure, the master process then reads and marks the existing shared computer program product comprising: control file as invalid 208. 10 program product means for causing said computer system to detect failure of a master process for a shared The data in the shared control file includes the commu resource; nication address for each of the shadow processes accessing program product means for causing said computer system the shared resource. A message is sent to each of these to request exclusive access to a shared resource control processes indicating master process failure and a need to 15 file using a networkfile management system procedure reregister 210. A new shared control file is created by the independent of said operating system for said proces new master and data about each shadow processes added to SOr; the file 212. An example of a shared control file record is program product means for causing said computer system shown in FIG. 5. The preferred embodiment implements the to establish exclusive access to said shared resource shared control file as a single record having an array 20 containing all shadow process communication addresses. control file, if said request is granted; Alternatives, such as linked lists of shadows could also be program product means for causing said computer system implemented. to determine from said control file the communications addresses of all other processes accessing said shared The shared control file table of FIG. 5 contains a master 25 resource via the failed master process; process machine identifier 222, a master process id 224, a program product means for causing said computer system master port number 226, a master protocol id 228, and a to invalidate said control file; series of shadow ids 230 (containing machine and process program product means for causing said computer system ids) and shadow communications data 232 containing port to send a message to each of said other processes and protocol information. 30 indicating failure of said master process. The master process maintains the shared control file by 2. The computer program product of claim 1, further adding shadow data as shadows request access to the shared comprising: resource and removing the data from the table when they program product means for causing said computer system cease accessing the resource. Master process maintenance of to create a new control file and enter data for each a single is a distinct advantage over prior art systems that 35 process responding to said messages. required sending the identities of all shadow processes to all 3. The computer program product of claim2, wherein said other processes for use in the event of master process failure. program product means for detecting failure comprises: It will be understood from the foregoing description that program product means for causing said computer system various modifications and changes may be made in the to attempt to communicate with a master process for a preferred embodiment of the present invention without shared resource; and departing from its true spirit. It is intended that this descrip program product means for causing said computer system tion is for purposes of illustration only and should not be to signal master process failure if no response is construed in a limiting sense. The scope of this invention received. should be limited only by the language of the following claims.