Quick viewing(Text Mode)

The Process Group Approach to Reliable Distributed Computing

The Process Group Approach to Reliable Distributed Computing

Business THE PROCESS GROUP APPROACH TO RELIABLE Kenneth P. Birman One might expect the reliability of a distributed system to correspond directly to the reliability of its constituents, but this is not always the case. The mechanisms used to structure a distributed system and to implement cooperation between components play a vital role in determining

the reliability of the system. Many trading information. 1 It is not un- all system can be greatly enhanced. contemporary distributed operating common for brokers to coordinate Figure 1 illustrates a possible in- systems have placed emphasis on com- trading activities across multiple terface to a trading system. The dis- munication performance, overlooking markets. play is centered around the current the need for tools to integrate com- Trading strategies rely on accurate position of the account being traded, ponents into a reliable whole. The pricing and market-volatility data, showing purchases and sales as they communication primitives supported dynamically changing databases giv- occur. A broker typically authorizes give generally reliable behavior, but ing the firm's holdings in various purchases or sales of shares in a exhibit problematic semantics when equities, news and analysis data, and stock, specifying limits on the price transient failures or system config- elaborate financial and economic and the number of shares. These in- uration changes occur. The resulting models based on relationships be- structions are communicated to the building blocks are, therefore, unsuit- tween financial instruments. Any dis- trading floor, where agents of the able for facilitating the construction of tributed system in support of this brokerage or bank trade as many systems where reliability is important. application must serve multiple com- shares as possible, remaining within This article reviews 10 years of munities: the firm as a whole, where this authorized . The display research on ISIS, a system that pro- reliability and security are key con- illustrates several points: vides tools to support the construc- siderations; the brokers, who depend tion of reliable distributed software. on speed and the ability to customize • Information backplane. The broker The thesis underlying ISIS is that the trading environment; and the would construct such a display by in- development of reliable distributed system administrators, who seek uni- terconnecting elementary widgets software can be simplified using pro- formity, ease of monitoring and con- (.g., graphical windows, computa- cess groups and group programming trol. A theme of this article is that all tional widgets) so that the output of tools. This article describes the ap- of these issues revolve around the one becomes the input to another. proach taken, surveys the system, technology used to "glue the system Seen in the large, this implies the and discusses experiences with real together." By endowing the corre- ability to publish messages and sub- applications. sponding software layer with pre- scribe to messages sent from program It will be helpful to illustrate dictable, fault-tolerant behavior, the to program on topics that make up group programming and ISIS in a flexibility and reliability of the over- the "corporate information back- setting where the system has found plane" of the brokerage. Such a IAlthough this class of systems certainly de- backplane would support a naming rapid acceptance: brokerage and mands high performance, there are no real- trading systems. These systems inte- time deadlines or hard time constraints, such as structure, communication interfaces, grate large numbers of demanding in the FAA's Advanced Automation System access restrictions, and some sort of [14]. This issue is discussed further in the sec- applications and require timely reac- tion "ISIS and Other Distributed Computing selective history mechanism. For ex- tion to high volumes of pricing and Technologies." ample, when subscribing to a topic,

¢OMMUNICATIONSOWTHllACM December 1993/Vol.36, No.12 3~ ® Business Computing an application will often need key region, a doctor reviewing the status prevent users from realizing the po- messages posted to that topic in the of patients in a hospital from a work- tential of the distributed computing past. station at home, a design group col- infrastructure on which their appli- • Customization. The display suggests laborating to develop a new product, cations run. that the system must be easily cus- or application programs cooperating tomized. The information backplane in a factory-floor process control set- Process Groups must be organized in a systematic ting. The software of a modern tele- Two styles of process group usage way (so that the broker can easily communications switching product is are seen in most ISIS applications: track down the name of communica- faced with many of the same issues, tion streams of interest) and flexible as is software implementing a data- Anonymous groups: These arise (allowing the introduction of new base that will be used in a large dis- when an application publishes data communication streams while the tributed selting. To build applica- under some "topic," and other pro- system is active). tions for the networked envi- cesses subscribe to that topic. For an • Hierarchical structure. Although the ronments of the future, a technology application to operate automatically trader will treat the wide-area system is needed that will make it as easy to and reliably, anonymous groups in a seamless way, communication solve these types of problems as it is should provide certain properties: disruptions are far more common on to build graphical user interfaces wide-area links (say, from New York (GUIs) today. 1. It should be possible to send mes- to Tokyo or Zurich) than on local- A central premise of the 1SIS proj- sages to the group using a group ad- area links. This gives the system a ect, shared with several other efforts dress. The high-level programmer hierarchical structure composed of [2, 14, 19, 22, 25] is that support for should not be involved in expanding local-area systems which are closely programming with distributed groups the group address into a list of desti- coupled and rich in services, inter- of cooperating programs is the key to nations. connected by less reliable and solving problems such as the ones 2. If the sender and subscribers higher- wide-area communi- previously mentioned. For example, remain operational, messages should cation links. a fault-tolerant data analysis service be delivered exactly once. If the What about the reliability implica- can be implemented by a group of sender fails, a message should be de- tions of such an architecture? In Fig- programs that adapt transparently to livered to all or none of the subscrib- ure 1, the trader has graphed a com- failures and recoveries. The publica- ers. The application programmer puted index of technology stocks tion/subscription style of interaction should not need to worry about mes- against the price of IBM, and it is involves an anonymous use of pro- sage loss or duplication. easy to imagine that such customiza- cess groups: here, the group consists 3. Messages should be delivered to tion could include computations crit- of a set of publishers and subscribers subscribers in some sensible order. ical to the trading strategy of the that vary dramatically as brokers For example, one would expect mes- firm. In Figure 2, the analysis pro- change the instruments they trade. sages to be delivered in an order con- gram is '"shadowed" by additional Each interacts with the group sistent with causal dependencies: if a copies, to indicate that it has been through a group name (the topic), message m is published by a program made fault-tolerant (i.e., it would but the group membership is not that first received m] . . . mi, then m remain available even if the broker's tracked or used within the computa- might be dependent on these prior workstation failed). A broker is un- tion. Although the processes publish- messages. If some other subscriber likely to be a sophisticated program- ing or subscribing to a topic do not will receive m as well as one or more mer, so fault-tolerance such as this cooperate directly, when this struc- of these prior messages, one would would have to be introduced by the ture is employed, the reliability of the expect them to be delivered first. system--the trader's only action application will depend on the reli- Stronger ordering properties might being to request it, perhaps by speci- ability of group communication. It is also be desired, as discussed later. fying the degree of reliability needed easy to see how problems could arise 4. It should be possible for a sub- for this analytic program. This if, for example, two brokers monitor- scriber to obtain a history of the means the system must automatically ing the same stock see different pric- group--a log of key events and the replicate or checkpoint the computa- ing information. order in which they were received. 2 tion, placing the replicas on proces- Process groups of various kinds If n messages are posted and the first sors that fail independently from the arise naturally throughout a distrib- message seen by a new subscriber will broker's workstation, and activating a uted system. Yet, current distributed be message mi, one would expect backup if the primary fails. computing environments provide lit- messages ml • • • mi-i to be reflected The requirements of modern tle support for group communica- in the history, and messages m i . . . trading environments are not unique tion patterns and programming. m,, to all be delivered to the new pro- to the application. It is easy to re- These issues have been left to the cess. If some messages are missing phrase this example in terms of the application programmer, and appli- from the history, or included both in issues confronted by a team of seis- cation programmers have been mologists cooperating to interpret largely unable to respond to the chal- 2The application itself would distinguish mes- the results of a seismic survey under lenge. In short, contemporary dis- sages that need to be retained from those that way in some remote and inaccessible tributed computing environments can be discarded.

~8 December 1993/Vol.36, No.12 COMMU~ICa'rlOM$OP'rHi AClW the history and in the subsequent postings, incorrect behavior might /japan/quotes/ibm result. f~~/nys/quOtes/ibm [ Zurich: up ] Japan: closed ] Explicit groups: A group is explicit when its members cooperate directly: they know themselves to be members of the group, and employ algorithms ~ Hitch= that incorporate the list of members, (IBM+ DEC + HP)A Overall Position relative rankings within the list, or in Trades in progress which responsibility for responding Stock Shares BIO Price to requests is shared. IBM 500 B 133-1/4 Explicit groups have additional IBM 1500 B 132-118 needs stemming from their use of IBM 1000 O 134-1/4 group membership information: in DEC 650 B 96-3/8 some sense, membership changes are among the information being pub- lished to an explicit group. For ex- ample, a fault-tolerant service might have a primary member that takes /japan/quotes/ibmI some action and an ordered set of I /nys/quotes/decI I If~/nys/quOtes/ibm Zurich: up I backups that take over, one by one, if Japan: closed the current primary fails. Here, group membership changes (failure of the primary) trigger actions by group members. Unless the same changes are seen in the same order Overall Position by all members, situations could arise Trades in progress in which there are no primaries, or Stock Shares B/O Price several. Similarly, a parallel database iBM 500 B 133-1/4 search might be done by ranking the IBM 1500 B 132-1/8 group members and then dividing IBM 1000 O 134-114 the database into n parts, where n is DEC 650 B 96-3•8 the number of group members. Each member would do 1/n'th of the work, with the ranking determining which member handles which fragment of correct behavior from group applica- Figure 1. Broker's trading system the database. The members need tions, it is necessary to synchronize consistent views of the group mem- Figure 2. Making an analytic the order in which actions are taken, service fault-tolerant bership to perform such a search particularly when group members correctly; otherwise, two processes will act independently on the basis of might search the same part of the dynamically changing, shared infor- database while some other part re- mation. mains unscanned, or they might par- CS~S ~ Primary tition the database inconsistently. The first and last of these prob- lems have received considerable Backup Thus, a number of technical prob- study. However, the problems cited lems must be considered in develop- are not independent: their integra- ing software for implementing dis- tion within a single framework is Figure 3. Inconsistentconnec- tributed process groups: nontrivial. This integration issue tion states underlies our virtual synchrony exe- • Support for group communication, cution model. including addressing, failure atomic- ity, and message delivery ordering. Building Distributed Services • Use of group membership as an input. Over Conventional Technologies It should be possible to use the group In this section we review the techni- membership or changes in member- cal issues raised in the preceding sec- ship as input to a distributed algo- tion. In each case, we start by de- rithm (one run concurrently by mul- scribing the problem as it might be tiple group members). approached by a developer working • Synchronization. To obtain globally over a contemporary computing sys-

¢OIIIIUNICATIONBOPTHiA¢liDecember 1993/"/ol.36, No.12 39 tern, with. no special tools for group now be uncertain regarding one an- loss and unordered delivery. In ISIS, programming. Obstacles to solving other's status. Worse, s9 is totally un- these tasks are accomplished using a the proh,lems are identified, and aware of the problem. In such a situ- windowed acknowledgement proto- used to motivate a general approach ation, the application may easily col similar to the one used in TCP, to overcoming the problem in ques- behave in an inconsistent manner. In but integrated with a failure-detec- tion. Where appropriate, the actual our primary-backup example, c tion subsystem. With this (nonstand- approach used in solving the prob- would cease sending requests to sl, ard) approach, a consistent system- lem within ISIS is discussed. expecting s2 to them. s2, how- wide view of the state of components ever, will not respond (it expects Sl to in the system and of the state of com- Conventional Message-Passing do so). munication channels between them Technologies In a system with more compo- can be presented to higher layers of Contemporary operating systems nents, the situation would be greatly software. For example, the ISIS offer three classes of communication exacerbated. From this, one sees that transport layer will only break a com- services ['14]: a reliable data stream has guarantees munication channel to a process in little stronger than an unreliable one: situations in which it would also re- • Unreliable datagrams: These services when channels break, it is not safe to port to any application monitoring automatically discard corrupted mes- infer that either endpoint has failed; that process that the process has sages, but do little additional process- channels may not break in a consis- failed. Moreover, if one channel to a ing. Most messages get through, but tent manner, and data in transit may process is broken, all channels are under some conditions messages be lost. Because the conditions under broken. might be lost in transmission, dupli- which a stream break are defined by cated, or ,delivered out of order. the standards, one has a situation in Failure Model • Remote ,procedure call: In this ap- which potentially inconsistent behav- Throughout this article, processes proach, communication results from ior is unavoidable. and processors are assumed to fail by a procedure invocation that returns a These considerations lead us to halting, without initiating erroneous result. RPC is a relatively reliable ser- make a collection of assumptions actions or sending incorrect mes- vice, but when a failure does occur, about the network and message com- sages. This raises a problem: tran- the sender is unable to distinguish munication in the remainder of the sient problems--such as an unre- among many possible outcomes: the article. First, we will assume the sys- sponsive swapping device or a destination may have failed before or tem is structured as a wide-area net- temporary communication outage-- after receiving the request, or the work (WAN) composed of local-area can mimic halting failures. Because network may have prevented or de- networks (LANs) interconnected by we will want to build systems guaran- layed delivery of the request or the wide-area communication links. teed to make progress when failures reply. (WAN issues will not be considered occur, this introduces a conflict be- • Reliable data streams: Here, commu- in this article due to space con- tween "accurate" and "timely" failure nication is performed over channels straints.) We assume that each LAN detection. that provide flow control and reli- consists of a collection of machines One way ISIS overcomes this able, sequenced message delivery. (as few as two or three, or as many as problem is by integrating the com- Standard stream protocols include one or two hundred), connected by a munication transport layer with the TCP, the ISO protocols, and TP4. collection of high-speed, lowqatency failure detection layer to make pro- Because of pipelining, streams gen- communication devices. If shared cesses appear to fail by halting, even erally outperform RPC when an ap- memory is employed, we assume it is when this may not be the case: a fail- plication sends large volumes of data. not used over the network. Clocks stop model [30]. To implement such a However, the standards also pre- are not assumed to be closely syn- model, a system uses an agreement scribe rules under which a stream chronized. protocol to maintain a system mem- will be broken, using conditions Within a LAN, we assume mes- bership list: only processes included based on timeout or excessive re- sages may be lost in transit, arrive out in this list are permitted to partici- transmissions. For example, suppose of order, be duplicated, or be dis- pate in the system, and nonrespon- that processes c, s] and s9 have con- carded because of inadequate buf- sive or failed processes are dropped nections with one another--perhaps, fering capacity. We also assume that [12, 28]. If a process dropped from si and s2 are the primary and backup, LAN communication partitions are the list later resumes communication, respectively, for a reliable service of rare. The algorithms described later the application is forced to either which c is a client. in this article and the ISIS system it- shut down gracefully or to run a "re- self may pause (or make progress in connection" protocol. The message Now, consider the state of this sys- only the largest partition) during transport layer plays an important tem if the. connection from c to s 1 periods of partition failure, resum- role, both by breaking connections breaks due to a communication fail- ing normal operation only when nor- and by intercepting messages from ure, while all three processes and the mal communication is restored. faulty processes. other two connections remain opera- We will assume the lowest levels of In the remainder of this article we tional (Figure 3). Much like the situa- the system are responsible for flow assume a message transport and tion after a failed RPC, c and Sl will control and for overcoming message failure-detection layer with the prop-

40 December 1993/Vol.36, No.12 ¢OMMUNICATIONSOFTHEAIIIM Business Computing O erties of the one used by ISIS. To to send messages to all members of Logical time and causal depen- summarize, a process starts execu- the group, under some reasonable dency. The phrase "reaching all of its tion by joining the system, interacts interpretation of the term "all." The members at the same time" raises an with it over a period of time during question, then, is how to schedule the issue that will prove to be fundamen- which messages are delivered in the delivery of messages so that the de- tal to message-delivery ordering. order sent, without loss or duplica- livery is to a reasonable set of pro- Such a statement presupposes a tem- tion, and then terminates (if it termi- cesses. For example, suppose that a poral model. What notion of time nates) by halting delectably. Once a process group contains three pro- applies to distributed process group process terminates, we will consider cesses, and a process sends many applications? it to be permanently gone from the messages to it. One would expect In 1978, Leslie Lamport published system, and assume that any state it these messages to reach all three a seminal paper that considered the may have recorded (say, on a disk) members, not some other set reflect- role of time in distributed algorithms ceases to be relevant. If a process ing a stale view of the group compo- [21]. Lamport asked how one might experiences a transient problem and sition (e.g., including processes that assign timestamps to the events in a then recovers and rejoins the system, have left the group). distributed system to correctly cap- it is treated as a completely new en- The solution to this problem fa- ture the order in which events oc- tity-no attempt is made to automat- vored in our work can be understood curred. Real time is not suitable for ically reconcile the state of the system by thinking of the group member- this: each machine will have its own with its state prior to the failure (re- ship as data in a database shared by clock, and clock synchronization is at covery of this nature is left to higher the sender of a multidestination mes- best imprecise in distributed systems. layers of the system and applica- sage (a multicast4), and the algorithm Moreover, operating systems intro- tions). used to add new members to the duce unpredictable software delays, group. A multicast "reads" the mem- execution speeds can vary Building Groups Over Conventional bership of the group to which it is widely due to cache affinity effects, Technologies sent, holding a form of read-lock and is often unpredict- Group Addressing. Consider the until the delivery of the message oc- able. These factors make it difficult problem of mapping a group address curs. A change of membership that to compare timestamps assigned by to a membership list, in an applica- adds a new member would be treated different machines. tion in which the membership could like a "write" operation, requiring a As an alternative, Lamport sug- change dynamically due to processes write-lock that prevents such an op- gested, one could discuss distributed joining the group or leaving. The eration from executing while a prior algorithms in terms of the depen- obvious way to approach this prob- multicast is under way. It will now dencies between the events making lem involves a membership service [9, appear that messages are delivered up the system execution. For exam- 12]. Such a service maintains a map to groups only when the membership ple, suppose a process first sets some from group identifiers to member- is not changing. variable x to 3, and then sets y = x. ship lists. Deferring fault-tolerance A problem with using locking to The event corresponding to the lat- issues, one could implement such a implement address expansion is cost. ter operation would depend on the service using a simple program that Accordingly, ISIS uses this idea, but former one--an example of a local supports remotely callable proce- does not employ a database or any dependency. Similarly, receiving a dures to register a new group or sort of locking. And, rather than message depends on sending it. This group member, obtain the member- implement a membership , view of a system leads one to define ship of a group, and perhaps for- which could represent a single point the potential causality relationship be- ward a message to the group. A pro- of failure, ISIS replicates knowledge tween events in the system. It is the cess could then transmit a message of the membership among the mem- irreflexive transitive closure of the either by forwarding it via the nam- bers of the group itself. This is done message send-receive relation and ing service, or by looking up the in an integrated manner, in order to the local dependency relation for membership information, caching it, perform address expansion with no processes in the system. If event a and transmitting messages directly. 3 extra messages or unnecessary delays happens before event b in a distrib- The first approach will perform bet- and guarantee the logical instantane- uted system, the causality relation ter for one-time interactions; the sec- ity property that the user expects. will capture this. ond would be preferable in an appli- For practical purposes, any message In Lamport's view of time, we cation that sends a stream of sent to a group can be thought of as would say that two events are concur- messages to the group. reaching all members at the same rent if they are not causally related: This form of addressing also raises time. the issue is not whether they actually a scheduling question. The designer executed simultaneously in some run of a distributed application will want 4 In this article the term multicast refers to send- of the system, but whether the system ing a single message to the members of a pro- cess group. The term broadcast, common in the was sensitive to their respective or- s In the latter case, one would also need a mech- literature, is sometimes confused with the hard- dering. Given an execution of a sys- anism for invalidating cached addressing infor- ware broadcast capabilities of devices like tem, there exists a large set of equiva- mation when the group membership changes Ethernet. While a multicast might make use of (this is not a trivial problem, but the need for hardware broadcast, this would simply repre- lent executions arrived at by brevity precludes discussing it in detail). sent one possible implementation strategy. rescheduling concurrent events

¢OIMIMUNICATIONSOFTHIEAC:NI December 1993/Vol,36, Nos12 41 Business Computing while retaining the event ordering receives message m3. Message ms was proposed in the previous section is constraints represented by causality sent by s] after receiving m2, and violated: were m4 to trigger a concur- relation. The key observation is that might refer to or depend on m 2. For rent database search, process Sl the causal ,event ordering captures all the example, m 2 might authorize a cer- would search the first third of the essential ordering information needed to tain broker to trade a particular ac- database, while s~ searches the sec- describe the execution: any two physical count, and m3 could be a trade the ond ha/f--one-sixth of the database execution.s with the same causal broker has initiated on behalf of that would not be searched. However, the event ordering describe indistin- account. Our execution is such that ss figure could easily be changed to guishable runs of the system. has not yet received m2 when ms is simultaneously violate other order- Recall our use of the phrase delivered. Perhaps m2 was discarded ing properties. "reaching all of its members at the by the due to a lack same time." Lamport has suggested of buffering space. It will be retrans- State transfer. Figure 4, part (B) that for a system described in terms mitted, but only after a brief delay illustrates a slightly different prob- of a causal event ordering, any set of during which ms might be received. lem. Here, we wish to transfer the concurrent events, one per process, Why might this matter? Imagine state of the service to process s3: per- can be thought of as representing a that ss is displaying buy/sell orders on haps ss represents a program that logical instant in time. Thus, when the trading floor, ss will consider ms has restarted after a failure (having we say that all members of a group invalid, since it will not be able to lost prior state) or a server that has receive a message at the same time, confirm that the trade was autho- been added to redistribute . In- we mean that the message delivery rized. An application with this prob- tuitively, the state of the server will events are concurrent and totally lem might fail to carry out valid trad- be a data structure reflecting the data ordered with respect to group mem- ing requests. Again, although the managed by the service, as modified bership c]hange events. Causal de- problem is solvable, the question is by the messages received prior to pendency provides the fundamental whether the application designer will when the new member joined the notion of time in a distributed sys- have anticipated the problem and group. However, in the execution tem, and plays an important role in programmed a correct mechanism to shown, a message has been sent to the remainder of this section. compensate when it occurs. the server concurrent with the mem- Message delivery ordering. Con- In our work on ISIS, this problem bership change. A consequence is sider Figure 4, part (A), in which is solved by incl.uding a rec- that ss receives a state which does not messages mb m2, m3 and m 4 are sent ord on each message. If a message reflect message m4, leaving it incon- to a group consisting of processes Sl, arrives out of order, this record can sistent with Sl and s2. Solving this s2, and ss. Messages ml and m~ are be used to detect the condition, and problem involves a complex synchro- sent concurrently and are received in to delay delivery until prior messages nization algorithm (not presented different orders by s2 and ss. In many arrive. The context representation here), probably beyond the ability of applications, s2 and ss would behave we employ has size linear in the num- a typical distributed applications pro- in an uncoordinated or inconsistent ber of members of the group within grammer. manner if this occurred. A designer which the message is sent (actually, in Fault tolerance. Up to now, our must, therefore, anticipate possible the worst case a message might carry discussion has ignored failures. Fail- inconsistent message ordering. For multiple such context records, but ures cause many problems; here, we example, one might design the appli- this is extremely rare). However, the consider just one. Suppose the cation to tolerate such mixups, or average size can be greatly reduced sender of a message were to crash explicitly prevent them from occur- by taking advantage of repetitious after some, but not all, destinations ring by delaying the processing of ml communication patterns, such as the receive the message. The destina- and m 2 within the program until an tendency of a process that sends to a tions that do have a copy will need to ordering has been established. The group to send multiple messages in complete the transmission or discard real danger is that a designer could succession [11]. The imposed over- the message. The protocol used overlook tlhe whole issue--after all, head is variable, but on the average should achieve "exactly-once deliv- two simultaneous messages to the small. Other solutions to this prob- ery" of each message to those desti- program tlhat arrive in different se- lem are described in [9, 26]. nations that remain operational, with quences may seem like an improb- Message m 4 exhibits a situation bounded overhead and storage. able scenario--yielding an applica- that combines several of these issues. Conversely, we need not be con- tion that usually is correct, but may m 4 is sent by a process that previously cerned with delivery to a process that exhibit abnormal behavior when un- sent m] and is concurrent with m2, ms, fails during the protocol, since such a likely sequences of events occur, or and a membership change of the process will never be heard from under periods of heavy load. (Under group. One sees here a situation in again (recall the fail-stop model). load, multicast delivery latencies rise, which all of the ordering issues cited Protocols to solve this problem can increasing the probability that con- thus far arise simultaneously, and in be complex, but a fairly simple solu- current multicasts could overlap). which failing to address any of them tion will illustrate the basic tech- This is only one of several delivery could lead to errors within an impor- niques. This protocol uses three ordering problems illustrated in Fig- tant class of applications. As shown, rounds of RPCs as illustrated in Fig- ure 4. Consider the situation when s3 only the group addressing property ure 5. During the first round, the

42 December 1993/Vol.36, No.12 CONNUNICAYIOH| Olin ?HII ACIN sender sends the message to the des- tinations, which acknowledge re- C S1 S2 S3 $1 $2 ceipt. Although the destinations can $3 deliver the message at this point, they need to keep a copy: should the M sender fail during the first round, the destination processes that have TliE received copies will need to finish the protocol on the sender's behalf. In M4 M5 the second round, if no failure has occurred, then the sender tells all destinations that the first round has r n r finished. They acknowledge this (A) (B) message and make a note that the sender is entering the third round. During the third round, each desti- nation discards all information about S. $2 S3 the message--deleting the saved B copy of the message and any other OK to deliver message data it was maintaining. Round1 When a failure occurs, a process J J that has received a first- or second- round message can terminate the protocol. The basic idea is to have some member of the destination set Round 2 Acknowledge, no other action take over the round that the sender was running when it failed; processes J that have already received messages J in that round detect duplicates and respond to them as they responded Round 3 OK to garbage collect after the original reception. The pro- tocol is straightforward, and we leave the details to the reader. This three-round multicast proto- col does not obtain any form of pipe- oneself that the protocol is correct Figure 4. Message-ordering problems lined or asynchronous data flow (indeed, as stated, the protocol could when invoked many times in succes- deliver duplicates if failures are re- Figure S. Three-round reliable sion, and the use of RPC limits the ported inaccurately). The merit of multicast degree of communication concur- solving such a problem at a low level rency during each round (it would be is that we can then make use of the better to send all the messages at consistency properties of the solution once, and to collect the replies in par- when reasoning about protocols that allel). These features make the pro- react to failures. tocol expensive. Much better solu- tions have been described in the Summary of issues. The previous literature (see [9, 11] for more detail discussion pointed to some of the on the approach used in ISIS, and potential pitfalls that confront the for a summary of other work in the developer of group software working area). over a conventional operating sys- Recall that in the subsection "Con- tem: (1) weak support for reliable ventional Message-Passing Technol- communication, notably inconsis- ogies," we indicated that systemwide tency in the situations in which chan- agreement on membership was an nels break, (2) group address expan- important property of our overall sion, (3) delivery ordering for approach. It is interesting to realize concurrent messages, (4) delivery that a protocol such as this is greatly ordering for sequences of related simplified because failures are re- messages, (5) state transfers, and ported consistently to all processes in (6) failure atomicity. This list is not the system. If failure detection were exhaustive: we have overlooked by an inconsistent mechanism, it questions involving real-time deliv- would be very difficult to convince ery guarantees, and persistent data-

COMMUNICATIONSOF THE ACM December 1993/Vol.36, No.12 43 bases and files. However, our work plexity associated with working out virtual synchrony, motivated by prior on ISIS treats process group issues the solutions and integrating them in work on transaction serializability. under the assumption that any real- a single system will be a significant We will present the approach in two time deadlines are long compared to barrier to application developers. stages. First, we discuss an execution communication latencies, and that The only practical approach is to model called close synchrony. This process states are volatile, hence we solve these problems in the distrib- model is then relaxed to arrive at the view these issues as beyond the scope uted computing environment itself, virtual synchrony model. A compari- of the current article. 5 The list does or in the operating system. This per- son of our work with the serializabil- cover the major issues that arise in mits a solution to be engineered in a ity model appears in the section this more restrictive domain. [5] way that will give good, predictable "ISIS and Other Distributed Com- At the beginning of this section, performance and takes full advan- puting Technologies." The basic idea we asserted that modern operating tage of hardware and operating sys- is to encourage programmers to as- sume a closely synchronized style of distributed execution [ 10, 31 ]:

• Execution of a process consists of a ISiS TOOLS at Process Group Level sequence of events, which may be internal computation, message trans- Process groups: Create, delete, join (transferring state). missions, message deliveries, or Group multlcast: CBCAST, ABCAST, collecting 0, 1 QUORUM or ALL replies (0 re- changes to the membership of plies gives an asynchronous multicast). groups that it creates or joins. Synchronization: LoCking, with symbolic strings to represent locks. • A global execution of the system detection or avoidance must be addressed at the application level. Token pass- consists of a set of process execu- ing. tions. At the global level, one can talk Replicated data: implemented by broadcasting updates to group having cop- about messages sent as multicasts to Ies. Transfer values to processes that join using state transfer facility. Dynamic process groups. system reconfiguratlon using replicated configuration data. Checkpoint/update • Any two processes that receive the logging, spooling for state recovery after failure. same multicasts or observe the same group membership changes see the Monltorlrtg facilities: Watch a process or site, trigger actions after failures and recoveries. Monitor changes to process group membership, site failures, and corresponding local events in the so forth. same relative order. • A multicast to a process group is Distributed execution facilities: Redundant computation (all take same action). delivered to its full membership. The Subdivided among multiple servers. Coordinator-cohort (primary/backup). send and delivery events are consid- Automated recovery: When a site recovers, Programs automatically restart. ered to occur as a single, instanta- For the first site to recover, group state is restored from logs (or initialized by neous event. software). For other sites, a process group join and transfer state is initiated. WAN communication: Reliable long-haul and file transfer facility. Close synchrony is a powerful guarantee. In fact, as seen in Figure 6, it eliminates all the problems iden- tified in the preceding section: systems lack the tools needed to de- tem features. Furthermore, provid- velop group-based software. This ing process groups as an underlying • Weak communication reliability guar- assertion goes beyond standards such tool permits the programmer to con- antees: A closely synchronous com- as to include next-generation centrate on the problem at hand. If munication subsystem appears to the systems such as NT, Mach, CHORUS the implementation of process programmer as completely reliable. and Amoeb;a. 6 A basic premise of this groups is left to the application de- • Group address expansion: In a closely article is that, although all of these signer, nonexperts are unlikely to synchronous execution, the member- problems can be solved, the corn- use the approach. The brokerage ship of a process group is fixed at the application of the introduction logical instant when a multicast is 5 These issues can be addressed within the tools layer of ISIS, and in fact the current system in- would be extremely difficult to build delivered. cludes an optional subsystem for management using the tools provided by a conven- • Delivery ordering for concurrent mes- of persistent data. tional operating system. sages: In a closely synchronous exe- cution, concurrently issued multi- 6Mach IPC provides strong guarantees of reli- ability in its communication subsystem. How- Virtual Synchrony casts are distinct events. They would, ever, Mach may experience unbounded delay It was observed earlier in this article therefore, be seen in the same order when a node failure occurs. CHORUS includes that integration of group program- a port-group mechanism, but with weak seman- by any destinations they have in com- tics, patterned .after earlier work on the V sys- ming mechanisms into a single envi- mon. tem [15]. Amoeba, which initially lacked group ronment is also an important prob- * Delivery ordering for sequences of re- support, has recently been extended to a mech- anism apparently motivated by our work on lem. Our work addresses this issue lated messages: In Figure 6, part (A), ISIS [19]. through an execution model called process Sl sent message ms after re-

4 December 1993/Vol.36, No.12¢OMMUNICATIOHS OP THIEA¢IM Business Computing ® ceiving m2, hence m3 may be causally Sl S2 S3 S 1 S2 dependent on ms. Processes execut- 02 ing in a closely synchronous model $3 would never see anything inconsist- M1 ent with this causal dependency rela- M Ms tion. Time • State transfer: State transfer occurs Ma -~ at a well-defined instant in time in the model. If a group member M~ checkpoints the group state at the instant when a new member is added, or sends something based on ,, Crash the state to the new member, the (A) (B) state will be well defined and com- plete. • Failure atomicity: The close syn- P1 $1 chrony model treats a multicast as a single logical event, and reports fail- M1 ures through group membership changes that are ordered with re- M2 spect to multicast. The all or nothing behavior of an atomic multicast is Time M3 thus implied by the model. M4 Unfortunately, although closely synchronous execution simplifies Ms distributed application design, the approach cannot be applied directly in a practical setting. First, achieving close synchrony is impossible in the presence of failures. Say that pro- cesses s] and s2 are in group G and message m is multicast to G. Consider Sl at the instant before it delivers m. According to the close synchrony which the sender of a message is per- Figure 6. ClOSely synchronous execution model, it can only deliver m if sz will mitted to continue executing without do so also. But sl has no way to be waiting for delivery. An asynchro- Figure 7. Asynchronous pipe- sure that s2 is still operational, hence nous approach treats the communi- lining s~ will be unable to make progress cations system like a bounded buffer, [36]. Fortunately, we can finesse this the sender only when the issue: if s2 has failed, it will hardly be rate of data production exceeds the in a position to dispute the assertion rate of consumption, or when the that m was delivered to it first! sender needs to for a reply or A second concern is that maintain- some other input (Figure 7). The ing close synchrony is expensive. The advantage of this approach is that the simplicity of the approach stems in latency (delay) between the sender part from the fact that the entire and the destination does not affect process group advances in lockstep. the data transmission rate--the sys- But, this also means that the rate of tem operates in a pipelined manner, progress each group member can permitting both the sender and des- make is limited by the speed of the tination to remain continuously ac- other members, and this could have a tive. Closely synchronous execution huge performance impact. What is precludes such pipelining, delaying needed is a model with the concep- execution of the sender until the tual simplicity of close synchrony, but message can be delivered. that is capable of efficiently support- This motivates the virtual syn- ing very high throughput applica- chrony approach. A virtually syn- tions. chronous system permits asynchro- In distributed systems, high nous executions for which there throughput comes from asynchronous exists some closely synchronous exe- interactions: patterns of execution in cution indistinguishable from the

COMMUNICATIONSOII THI iI¢M December 1993/Vol.36, No,12 4S 0 Business Computing asynchronous one. In general, this exclusion mechanism to ensure that tions, corresponding to a multicast means that for each application, conflicting operations are performed primitive denoted CBCAST. Notice events need to be synchronized only in some order. In a parallel shared- that CBCAST is weaker than AB- to the degree that the application is memory environment, this is nor- CAST, because it permits messages sensitive to event ordering. In some mally done using semaphores that were sent concurrently to be de- situations, this approach will be iden- around critical sections of code. In a livered to overlapping destinations in tical to close synchrony. In others, it distributed system, it would normally different sequences. 7 is possible to deliver messages in dif- be done by using some form of lock- The major advantage of CBCAST ferent orders at different processes, ing or token passing. Consider such a over ABCAST is that it is not subject without the application noticing. distributed system, having the prop- to the type of latency cited previ- When such a relaxation of order is erty that two messages can be sent ously. A CBCAST message can be permissible, a more asynchronous concurrently to the same group only delivered as soon as any prior mes- execution results. when their effects on the group are inde- sages have been delivered, and all the pendent. In the preceding example, information needed to determine Order sensitivity in distributed sys- either Sl and s2 would be prevented whether any prior messages are out- tems. We are led to a final technical from sending concurrently (i.e., if ml standing can be included, at low question: "when can synchronization and m2 have potentially conflicting overhead, on the CBCAST message be relaxed in a virtually synchronous itself. Except in unusual cases where distributed system?" Two forms of a prior message is somehow delayed ordering turn out to be useful; one is S~ S 2 in the network, a CBCAST message "stronger" than the other, but also will be delivered immediately on re- more costly to support. ceipt. Consider a system with two pro- M1 The ability to use a protocol such cesses, Sl and s2, sending messages as CBCAST is highly dependent on into a group G with members g~ and the nature of the application. Some g2. sl sends message m] to G and, con- applications have a mutual exclusion currently, sz sends m2. In a closely structure for which causal delivery synchronous system, g] and g2 would ordering is adequate, while others receive these messages in identical would need to introduce a form of orders. If, for example, the messages locking to be able to use CBCAST caused updates to a data structure Figure 8. Causal ordering instead of ABCAST. Basically, replicated within the group, this CBCAST can be used when any con- property could be used to ensure flicting multicasts are uniquely or- that the replicas remain identical effects on the states of the members dered along a single causal chain. In through tlhe execution of the system. of G), or if they are permitted to send this case, the CBCAST guarantee is A multicast with this property is said concurrently, the delivery orders strong enough to ensure that all the to achieve an atomic delivery ordering, could be arbitrarily interleaved, be- conflicting multicasts are seen in the and is denoted ABCAST. ABCAST cause the actions on receiving such same order by all recipients-- is an easy primitive to work with, but messages commute. specifically, the causal dependency costly to implement. This cost stems It might seem that the degree of order. Such an execution system is from the tollowing consideration: An delivery ordering needed would be virtually synchronous, since the out- ABCAST message can only be deliv- first-in, first-out, (FIFO). However, come of the execution is the same as ered when it is known that no prior this is not quite right, as illustrated in if an atomic delivery order had been ABCAST remains undelivered. This Figure 8. Here we see a situation in used. introduce,; latency: messages ml and which s], holding mutual exclusion, The CBCAST communication m2 must he delayed before they can sends message ml, but then releases pattern arises most often in a process be delivered to g] and g,2. Such a de- its mutual exclusion lock to s2, which group that manages replicated (or livery latency may not be visible to sends m2. Perhaps, m] and m2 are coherently cached) data using locks the application. But, in cases in which updates to the same data item; the to order updates. Processes that up- s~ and s2 need responses from gl and/ order of delivery could therefore be date such data first acquire the lock, or g2, or where the senders and desti- quite important. Although there is then issue a stream of asynchronous nations are the same, the application certainly a sense in which ml was sent updates, and then release the lock. will experience a significant delay "first," notice that a FIFO delivery each time an ABCAST is sent. The order would not enforce the desired 7The statement that CBCAST is "weaker" than ABCAST may seem imprecise: as we have latencies involved can be very high, ordering, since FIFO order is usually stated the problem, the two protocols simply depending on how the ABCAST defined for a (sender, destination) provide different forms of ordering. However, pair, and here we have two senders. the ISIS version of ABCAST actually extends protocol is. engineered. the partial CBCAST ordering into a total one: it Not all applications require such a The ordering property needed for is a causal atomic muhicast primitive. An argu- strong, costly, delivery ordering. this example is that if mr causally pre- ment can be made that an ABCAST protocol that is not causal cannot be used asyuchro- Concurrent systems often use some cedes m2, then ml should be deliv- nously, hence we see strong reasons for imple- form of synchronization or mutual ered before m2 at shared destina- menting ABCAST in this manner.

46 December 1993/Vol.36, No.12 COmMUNlCA~'|OHSO,THIACM There will generally be one update sent in one packet, giving a pipelin- list integrated with the communica- lock for each class of related data ing effect. tion subsystem. This is in contrast to items, so that acquisition of the up- The distinction between causal the usual approach of sensing fail- date lock rules out conflicting up- and total event orderings (CBCAST ures through timeouts and broken dates. 8 However, mutual exclusion and ABCAST) has parallels in other channels, which does not guarantee can sometimes be inferred from settings. Although ISIS was the first consistency. other properties of an algorithm, distributed system to enforce a causal hence such a pattern may arise even delivery ordering as part of a com- The approach also has limitations: without an explicit locking stage. By munication subsystem [7], the ap- • Reduced availability during LAN using CBCAST for this communica- proach draws on Lamport's prior partition failures: only allows prog- tion, an efficient, pipelined data flow work on logical notions of time. ress in a single partition, and hence is achieved. In particular, there will Moreover, the approach was in some tolerates at most Ln/2J - 1 simulta- be no need to block the sender of a respects anticipated by work on pri- neous failures, if n is the number of multicast, even momentarily, unless mary copy replication in database sites currently operational; the group membership is changing at systems [6]. Similarly, close syn- • Risks incorrectly classifying an the time the message is sent. chrony is related both to Lamport operational site or process as faulty. The tremendous performance and Schneider's state machine approach advantage of CBCAST over AB- to developing distributed software The virtual synchrony model is CAST may not be immediately evi- [32] and to the database serializability unusual in offering these benefits dent. However, when one considers model, to be discussed further. Work within a single framework. More- how fast modern processors are in on parallel processor architectures over, theoretical arguments exist that comparison with communication has yielded a memory update model no system that provides consistent devices, it should be clear that any called weak consistency [16, 35], which distributed behavior can completely primitive that unnecessarily waits uses a causal dependency principle to evade these limitations. Our experi- before delivering a message could increase parallelism in the cache of a ence has been that the issues ad- introduce substantial overhead. For parallel processor. And, a causal cor- dressed by virtual synchrony are en- example, it is common for an appli- rectness property has been used in countered in even the simplest cation that replicates a table of pend- work on lazy update in shared mem- distributed applications, and that the ing requests within a group to multi- ory multiprocessors [1] and distrib- approach is general, complete, and cast each new request, so that all uted database systems [18, 20]. A theoretically sound. members can maintain identical cop- more detailed discussion of the con- ies of the table. In such cases, if the ditions under which CBCAST can be The ISiS TOOlkit way that a request is handled is sensi- used in place of ABCAST appears in The ISIS toolkit provides a collection tive to the contents of the table, the [10, 31]. of higher-level mechanisms for sender of the multicast must wait forming and managing process until the multicast is delivered before Summary of Benefits Due to Virtual groups and implementing group- acting on the request. Using AB- Synchrony based software. This section illus- CAST the sender will need to wait Brevity precludes a more detailed trates the specifics of the approach until the delivery order can be deter- discussion of virtual synchrony, or by discussing the styles of process mined. Using CBCAST, the update how it is used in developing distrib- groups supported by the system and can be issued asynchronously, and uted algorithms within ISIS. It may giving a simple example of a distrib- applied immediately to the copy be useful, however, to summarize the uted database application. maintained by the sender. The benefits of the model: ISIS is not the first system to use sender thus avoids a potentially long process groups as a programming delay, and can immediately continue • Allows code to be developed as- tool: at the time the system was ini- computation or reply to the request. suming a simplified, closely synchro- tially developed, Cheriton's V system When a sender generates bursts of nous execution model; had received wide visibility [15]. updates, also a common pattern, the • Supports a meaningful notion of More recently, group mechanisms advantage of CBCAST over AB- group state and state transfer, both have become common, exemplified CAST is even greater, because multi- when groups manage replicated by the Amoeba system [19], the ple messages can be buffered and data, and when a computation is CHORUS operating system [26], the dynamically partitioned among Psync system [29], a high availability group members; system developed by Ladin and Lis- s In ISIS applications, locks are used primarily • Asynchronous, pipelined commu- kov [20], IBM's AAS system [14], and for mutual exclusion on possibly conflicting operations, such as updates on related data nication; Transis [3]. Nonetheless, ISIS was items. In the case of replicated data, this results • Treatment of communication, pro- first to propose the virtual synchrony in an algorithm similar to a primary copy up- model and to offer high-perfor- date in which the "primary" copy changes dy- cess group membership changes and namically. The execution model is nontransac- failures through a single, event- mance, consistent solutions to a wide tional, and there is no need for read-locks or for oriented execution model; variety of problems through its tool- a two-phase locking rule. This is discussed fur- ther in the section "ISIS and Other Distributed • Failure handling through a consis- kit. The approach is now gaining Computing Technologies." tently presented system membership wide acceptance. 9

ClOMMUHICATION| oF'rill ACM December 1993/Vol.36, No.12 4'/ digms. These include a synchroniza- tion tool that supports a form of locking (based on distributed to- kens), a replication tool for manag- ing replicated data, a tool for fault- tolerant primary-backup server de- sign that load-balances by making Server Client different group members act as the primary for different requests, and so forth (a partial list appears in the Peer Client server group Diffusion Hierarchical group group group sidebar "ISIS Tools at a Process Group Level)," Using these tools, and following programming exam- ples in the ISIS manual, even non- Figure S. Styles of groups the group addressing protocol. experts have been successful in de- Diffusion groups: A diffusion group is a veloping fault-tolerant, highly asyn- Styles of Groups client-server group in which the cli- chronous distributed software. The efficiency of a distributed sys- ents register themselves but in which Figures 10 and 11 show a com- tem is limited by the information the members of the group send mes- plete, fault-tolerant database server available to the protocols employed sages to the full client set and the cli- for maintaining a mapping from for communication. This was a con- ents are passive sinks. names (ascii strings) to salaries (inte- sideration in developing the ISIS Hierarchical groups: A hierarchical gers). The example is in the C process group interface, in which a group is a structure built from multi- programming language. The server trade-off had to be made between ple component groups, for reasons initializes ISIS and declares the pro- simplicity of the interface and the of scale. Applications that use the cedures that will handle update and availability of accurate information hierarchical group initially contact its inquiry requests. The isis_rrlsX.nloop about group membership for use in root group, but are subsequently re- dispatches incoming messages to multicast address expansion. Conse- directed to one of the constituent these procedures as needed (other quently, the ISIS application inter- "subgroups." Group data would nor- styles of main loop are also sup- face introduces four styles of process mally be partitioned among the sub- ported). The formatted-I/O style of groups that differ in how processes groups. Although tools are provided message generation and scanning is interact with the group, illustrated in for multicasting to the full member- specific to the C interface, where Figure 9 (anonymous groups are not ship of the hierarchy, the most com- type information is not available at distinguished from explicit groups at mon communication pattern involves run time. this level of the system). ISIS is opti- interaction between a client and the The "state transfer" routines are mized to detect and handle each of members of some subgroup. concerned with sending the current these cases efficiently. The four contents of the database to a server styles of process groups are: There is no requirement that the that has just been started and is join- members of a group be identical, or ing the group. In this situation, ISIS Peer groups: These arise where a set even coded in the same language or arbitrarily selects an existing server of processes cooperate closely, for executed on the same architecture. to do a state transfer, invoking its example, to replicate data. The Moreover, multiple groups can be state sending procedure. Each call membership is often used as an input overlapped and an individual pro- that this procedure makes to to the algorithm used in handling cess can belong to as many as several xfer_out will cause an invocation of requests, as for the concurrent data- hundred different groups, although rcv_state on the receiving side; in base search described earlier. this is uncommon. Scaling is dis- our example, the latter simply passes Client-server groups: In ISIS, any pro- cussed later in this article. the message to the update procedure cess can communicate with any (the same message format is used by group given the group's name and The Toolkit Interface sen6__state and update). Of course, appropriate permissions. However, As noted earlier, the performance of there are many variants on this basic if a nonmember of a group will mul- a distributed system is often limited scheme. For example, it is possible to ticast to it repeatedly, better perfor- by the degree of communication indicate to the system that only cer- mance is obtained by first registering pipelining achieved. The develop- tain servers should be allowed to the sender as a client of the group; ment of asynchronous solutions to handle state transfer requests, to re- this permits the system to optimize distributed problems can be tricky, fuse to allow certain processes to join, and many ISIS users would rather and so forth. The client program does a pg_looRup to find the server. 9At the time of this writingour group is work- employ less efficient solutions than ing with the Open Software Foundationon in- risk errors. For this reason, the tool- Subsequently, calls to its query and tegration of a new version of the technology kit includes asynchronous imple- update procedures are mapped into into Mach (the OSF 1/AD version) and with Unix International, which plans a reliable mentations of the more important messages to the server. The BCAST group mechanism for UI Atlas. distributed programming para- calls are mapped to the appropriate

48 December 1993/Vo1.36, No.12 COMMUNICA/IOHS OF THE ACM Business Computing

Figure 10. A simple database server

Figure 11. A client of the simple database service default for the group--ABCAST in this case. The database server of Figure 10 uses a redundant style of execution in which the client broadcasts each request and will receive multiple, identical replies from all copies. In practice, the client will wait for the first reply and ignore all others. Such an approach provides the fastest pos- sible reaction to a failure, but has the disadvantage of consuming n times the resources of a fault-intolerant solution, where n is the size of the process group. An alternative would have been to subdivide the search so that each server performs 1/n'th of the work. Here, the client would combine responses from all the serv- ers, repeating the request if a server fails instead of replying, a condition readily detected in ISIS. ISIS interfaces have been devel- oped for C, C+ +, Fortran, Common Lisp, Ada and Smalltalk, and ports of ISIS exist for Unix workstations and mainframes from all major vendors, as well as for Mach, CHORUS, ISC and SCO Unix, the DEC VMS sys- tem, and Honeywell's Lynx operat- ing system. Data within messages is represented in the binary format used by the sending machine, and converted to the format of the desti- nation on receipt (if necessary), auto- matically and transparently.

Who Uses ISIS, and How? Brokerage A number of ISIS users are con- cerned with financial computing sys- tems such as the one cited at the be- ginning of this article. Figure 12 illustrates such a system, now seen from an internal perspective in which groups underlying the services employed by the broker become evi- dent. A client server architecture is used, in which the servers filter and analyze streams of data. Fault-toler- ance here refers to two very different aspects of the application. First, fi- nancial systems must rapidly restart failed components and reorganize themselves so that service is not in-

¢OIWMUNICATIONS OPTHE ACM December 1993/Vol.36, No.12 49 Business Computing

terrupted by software or hardware cessors, or may simply have been rectly updated only from within that failures. Second, there are specific compiled for some specific hardware LAN. On remote LANs, such data system functions that require fault- configuration). This problem is can only be queried and could be tolerance at the level of files or data- solved using the ISIS network re- stale, but this is still sufficient for base, such as a guarantee that after source manager, an application de- many applications. rebooting, a file or database manager scribed later. A final use of ISIS in database set- will be able to recover local data files tings is to implement database trig- at low cost. ISIS was designed to ad- Database Replication and Triggers gers. A trigger is a query that is incre- dress the first type of problem, but Although the ISIS computation mentally evaluated against the includes several tools for solving the model differs from a transactional database as updates occur, causing latter one. model (see also the section "ISIS and some action immediately if a speci- The approach generally taken is to Other Distributed Computing Tech- fied condition becomes true. For represent key services using process nologies"), ISIS is useful in con- example, a broker might request that groups, replicating service state in- structing distributed database appli- an alarm be sounded if the risk asso- formation so that even if one server cations. In fact, as many as half of the ciated with a financial position ex- process fails the other can respond to applications with which we are famil- ceeds some threshold. As data enters requests on its behalf. During peri- iar are concerned with this problem. the financial database maintained by ods when n service programs are Typical uses of ISIS in database the brokerage, such a query would be operational, one can often exploit applications focus on replicating a evaluated repeatedly. The role of the redundancy to improve response database for fault-tolerance or to ISIS is in providing tools for reliably time; thus, rather than asking how support concurrent searches for notifying applications when such a much such an application must pay improved performance [2]. In such trigger becomes enabled, and for for fault-tolerance, more appropri- an architecture, the database system developing programs capable of tak- ate questions concern the level of need not be aware that ISIS is pres- ing the desired actions despite fail- replication at which the overhead ent. Database clients access the data- ures. begins to outweigh the benefits of base through a layer of software that concurrency, and the minimum ac- multicasts updates (using ABCAST) Major ISiS-based Utilities ceptable performance assuming k to the set of servers, while issuing In the preceding subsection, we al- component failures. Fault-tolerance queries directly to the least loaded luded to some of the fault-tolerant is something of a side effect of the server. The servers are supervised by utilities that have been built over replication ;approach. a process group that informs clients ISIS. There are currently five such A significant theme in financial of load changes in the server pool, systems: computing is use of a subscription/ and supervises the restart of a failed publication style. The basic ISIS server from a checkpoint and log of • NEWS: This application supports communication primitives do not subsequent updates. It is interesting a collection of communication topics spool messages for future replay, to realize that even such an unsophis- to which users can subscribe (obtain- hence an application running over ticated approach to database replica- ing a replay of recent postings) or the system, the NEWS facility, has tion addresses a widely perceived post messages. Topics are identified been developed to support this func- need among database users. In the with file-system style names, and tionality. long run, of course, comprehensive it is possible to post to topics on a A final aspect of brokerage sys- support for applications such as this remote network using a "mail ad- tems is that they require a dynami- would require extending ISIS to sup- dress" notation; thus, a Swiss broker- cally varying collection of services. A port a transactional execution model age firm might post some quotes to firm may work with dozens or hun- and to implement the XA/XOpen "~GENEVA~QUOTES~IBM@NEW-YORK." dreds of financial models, predicting standards. The application creates a process market behavior for the financial in- Beyond database replication, ISIS group for each topic, monitoring struments being traded under vary- users have developed WAN data- each such group to maintain a his- ing market conditions. Only a small bases by placing a local database sys- tory of messages posted to it for re- subset of these services will be tem on each LAN in a WAN system. play to new subscribers, using a state needed at any time. Thus, systems of By monitoring the update traffic on transfer when a new member joins. this sort generally consist of a proces- a LAN, updates of importance to • NMGR: This program manages sor pool on which services can be remote users can be intercepted and batch-style jobs and performs load started as necessary, and this creates distributed through the ISIS WAN sharing in a distributed setting. This a need to support an automatic re- architecture. On each LAN, a server involves monitoring candidate ma- mote execution and load balancing monitors incoming updates and ap- chines, which are collected into a mechanism. The heterogeneity of plies them to the database server as processor pool, and then scheduling typical networks complicates this necessary. To avoid a costly concur- jobs on the pool. A pattern-matching problem, by' introducing a pattern- rency control problem, developers of mechanism is used for place- matching aspect (i.e., certain pro- applications such as these normally ment. If several machines are suit- grams may be subject to licensing partition the database so that the able for a given job, criteria based on restrictions, or require special pro- data associated with each LAN is di- load and available memory are used

~0 December 1993/Vol.36, No.12 ¢OMMIUINICATIOM| 01u THII A¢M Historical

Price

Analysis and qolatility database module: ut )

Trader BIDIq ",t,,,, Data feeds Monitor control

"""'•'•d•Japan, Zurich, etc. LAN Manager Long-haul/WAN Spooler

to select one (these criteria can read- the load on a machine, the status of Figure 12. Process group archi- ily be changed). When employed to software and hardware components tecture of brokerage system manage critical system services (as of the system, and the set of users on opposed to running batch-style jobs), each machine. User-defined sensors the program monitors each service and actuators extend this initial set. • SPOOLER/LONG-HAUL FACIL- and automatically restarts failed The "raw" sensors and actuators ITY: This subsystem is responsible components. Parallel make is an ex- of the lowest layer are mapped to ab- for wide-area communication [23] ample of a distributed application stract sensors by an intermediate and for saving messages to groups program that uses NMGR for job layer, which also supports a simple that are only active periodically. It placement: it compiles applications database-style interface and a trig- conceals link failures and presents an by farming out compilation subtasks gering facility. This layer supports an exactly-once communication inter- to compatible machines. entity-relation data model and con- face. • DECEIT: This system [33] pro- ceals many of the details of the physi- vides fault-tolerance NFS-compatible cal sensors, such as polling frequency Other ISiS Applications file storage. Files are replicated both and fault tolerance. Sensors can be Although this section covered a vari- to increase performance (by support- aggregated, for example by taking ety of ISIS applications, brevity pre- ing parallel reads on different repli- the average load on the servers that cludes a systematic review of the full cas) and for fault tolerance. The level manage a replicated database. The range of software that has been de- of replication is varied depending on interface supports a simple trigger veloped over the system. In addition the style of access detected by the sys- language, that will initiate a prespeci- to the problems cited, ISIS has been tem at run time. After a failed node fled action when a specified condi- applied to telecommunications recovers, any files it managed are tion is detected. switching and "intelligent network- automatically brought up to date. Running over META is a distrib- ing" applications, military systems, The approach conceals file replica- uted language for specifying control such as a proposed replacement for tion from the user, who sees an NFS- actions in high-level terms, called the AEGIS aircraft tracking and compatible file-system interface. LOMITA. LOMITA code is embed- combat engagement system, medical • META/LOMITA: META is an ex- ded into the Unix CSH command systems, graphics and virtual reality tensive system for building fault- interpreter. At run time, LOMITA applications, seismology, factory au- tolerant reactive control applications control statements are expanded into tomation and production control, [24, 37]. It consists of a layer for in- distributed finite state machines trig- reliable management and resource strumenting a distributed application gered by events that can he sensed scheduling for shared computing or environment, by defining sensors local to a sensor or system compo- facilities, and a wide-area weather and actuators. A sensor is any typed nent; a process group is used to im- prediction and storm tracking system value that can be polled or moni- plement aggregates, perform these [2, 17, 35]. ISIS has also proved pop- tored by the system; an actuator is state transitions, and to notify appli- ular for scientific computing at labo- any entity capable of taking an action cations when a monitored condition ratories such as CERN and Los Ala- on request. Built-in sensors include arises. mos, and has been applied to such

COMMUNiCATIONSOFTHiACM December 1993/Vol.36, No.12 S~ problems as a programming envi- for even very demanding applica- • Process groups should embody ronment for automatically introduc- tions. 1 strong semantics for group member- ing parallelism into data-flow appli- The relationship between ISIS ship, communication, and synchroni- cations [4], a beam focusing system and transactional systems originates zation. A simple and powerful model for a partic][e accelerator, a weather- in the fact that both virtual syn- can be based on closely synchronized simulation that combines a highly chrony and transactional serializabil- distributed execution, but high per- parallel ocean model with a vec- ity are order-based execution models formance requires a more asynchro- torized atmospheric model and dis- [6]. However, where the "tools" of- nous style of execution in which com- plays output on advanced graphics fered by a database system focus on munication is heavily pipelined. The workstations, and resource manage- isolation of concurrent transactions virtual synchrony approach combines ment software for shared supercom- from one another, persistent data these benefits, using a closely syn- puting facilities. and rollback (abort) mechanisms, chronous execution model, but de- It should also be noted that al- those offered in ISIS are concerned riving a substantial performance though this article has focused on with direct cooperation between benefit when message ordering can LAN issues, ISIS also supports a members of groups, failure han- safely be relaxed. WAN architecture and has been used dling, and ensuring that a system can • Efficient protocols have been de- in WANs composed of up to 10 dynamically reconfigure itself to veloped for supporting virtual syn- LANs. a° Many of the applications make forward progress when partial chrony. cited are structured as LAN solutions failures occur. Persistency of data is a • Nonexperts find the resulting sys- interconnected by a reliable, but less big issue in database systems, but tem relatively easy to use. responsive, WAN layer. much less so in ISIS. For example, the commit problem is a form of reli- This article was written as the first ISIS and Otl~er Distributed able multicast, but a commit implies phase of the ISIS effort approached Computing Technologies serializability and permanence of the conclusion. We feel the initial system Our discussion has overlooked the transaction being committed, while has demonstrated the feasibility of a types of real-time issues that arise in delivery of a multicast in ISIS pro- new style of distributed computing. the Advanced Automation System, a vides much weaker guarantees. As reported in [11], ISIS achieves next-generation air-traffic control levels of performance comparable to system being; developed by IBM for Conclusions those afforded by standard technolo- the FAA [13, 14], which also uses a We have argued that the next gener- gies (RPC and streams) on the same process-group-based computing ation of distributed computing sys- platforms. Looking to the future, we model. Similarly, one might wonder tems will benefit from support for are now developing an ISIS "micro- how the ISIS execution model com- process groups and group program- kernel" suitable for integration into pares with transactional database ming. Arriving at an appropriate next-generation operating systems execution models. Unfortunately, semantics for a process group mech- such as Mach, NT, and CHORUS. these are complex issues, and it anism is a difficult problem, and This new system will also incorporate would be difficult to do justice to implementing those semantics would a security architecture [26] and a them without a lengthy digression. exceed the abilities of many distrib- real-time communication suite. The Briefly, a technology like the one uted application developers. Either programming model, however, will used in AAS differs from ISIS in the operating system must imple- be unchanged. providing strong real-time guaran- ment these mechanisms or the reli- Process group programming tees provided that timing assump- ability and performance of group- could ignite a wave of advances in tions are respected. This is done by structured applications is unlikely to reliable distributed computing, and measuring timing properties of the be acceptable. of applications that operate on dis- network, ha::dware, and scheduler The ISIS system provides tools for tributed platforms. Using current on which the system runs and de- programming with process groups. technologies, it is impractical for typ- signing protocols to have highly pre- A review of research on the system ical developers to implement high- dictable behavior. Given such infor- leads us to the following conclusions: reliability software, self-managing mation about the environment, one distributed systems, to employ repli- could undertake a similar analysis of cated data or simple coarse-grained the ISIS protocols, although we have I1A process that experiences a timing fault in parallelism, or to develop software the protocols on which the AAS was originally not done so. As noted earlier, experi- based could receive messages that other pro- that reconfigures automatically after ence suggests that ISIS is fast enough cesses reject, or reject messages others accept, a failure or recovery. Consequently, because the criteria for accepting or rejecting a although current networks embody message uses the value of the local clock [13]. ~°The WAN architecture of ISIS is similar to This can lead to consistency violations. More- tremendously powerful computing the LAN structure, but because WAN partitions over, ifa fault is transient (e.g., the clock is sub- resources, the programmers who are more common, encourages a more asyn- sequently resynchronized with other clocks), develop software for these environ- chronous programming style. WAN communi- the inconsistency of such a process could spread cation and link state is logged to disk files (un- if it initiates new multicasts, which other pro- ments are severely constrained by a like LAN communication), which enables ISIS cesses will accept. However, this problem can be deficient software infrastructure. By to retransmit messages lost when a WAN parti- overcome by changing the protocol, and the tion occurs and to suppress duplicate messages. author understands this to have been done as removing these unnecessary obsta- WAN issues are discussed in more detail in [23]. part of the implementation of the AAS system. cles, a vast groundswell of reliable

S2 December 1993/Voi.36, No.12 ¢OMMUNICATIONSOFTH|ACM Business Computing

distributed application development land, Wash. Dec. 1985), ACM ACM Trans. Comput. Syst. 4, 1 (Feb. can be unleashed. SIGOPS, pp. 79-86. 1989), 54-70. 8. Birman, K. and Cooper, R. The ISIS 19. Kaashoek, M.F., Tanenbaum, A.S., project: Real experience with a fault Flynn-Hummel, S. and Bal H.E. An Acknowledgments tolerant programming system. Euro- efficient reliable broadcast protocol. The ISIS effort would not have been pean SIGOPS Workshop, Sept. 1990. Oper. Syst. Rev. 23, 4 (Oct. 1989), 5- possible without extensive contribu- To be published in Oper. Syst. Rev. 19. tions by many past and present mem- (Apr. 1991). Also available as Cornell 20. Ladin, R., Liskov, B. and Shrira, L. bers of the project, users of the sys- University Computer Science Depart- Lazy replication: Exploring the se- tem, and researchers in the field of ment Tech. Rep. TR90-1138. mantics of distributed services. In distributed computing. Thanks are 9. Birman, K.P. and Joseph, T.A. Ex- Proceedings of the Tenth ACM Sympo- ploiting virtual synchrony in distrib- sium on Principles of Distributed Comput- due to: Ozaip Babaoglu, Micah Beck, uted systems. In Proceedings of the ing (Quebec City, Quebec, Aug. Tim Clark, Robert Cooper, Brad Eleventh ACM Symposium on Operating 1990), ACM SIGOPS-SIGACT, pp. Glade, Barry Gleeson, Holger Her- Systems Principles (Austin, Tex., Nov. 43-58. zog, Guerney Hunt, Tommy Joseph, 1987), ACM SIGOPS, pp. 123-138. 21. Lamport, L. Time, clocks, and the Ken Kane, Cliff Krumvieda, Jacob 10. Birman, K. and Joseph, T. Exploiting ordering of events in a distributed Levy, Messac Makpangou, Keith replication in distributed systems. In system. Commun. ACM 21, 7 (July Marzullo, Mike Reiter, Aleta Ric- Distributed Systems, Sape Mullender, 1978), 558-565. ciardi, Fred Schneider, Andre Editor, ACM Press, Addison-Wesley, 22. Liskov, B. and Ladin, R. Highly-avail- Schiper, Frank Schmuck, Stefan New York, 1989, pp. 319-368. able distributed services and fault- Sharkansky, Alex Siegel, Don Smith, 11. Birman, K., Schiper, A. and Stephen- tolerant distributed garbage collec- son, P. Lightweight causal and atomic tion. In Proceedings of the Fifth ACM Pat Stephenson, Robbert van group multicast. ACM Trans. Comput. Symposium on Principles of Distributed Renesse, and Mark Wood. In addi- Syst. 9, 3 (Aug. 1991). Computing (Calgary, Alberta, Aug. tion, the author also gratefully ac- 12. Cristian, F. Reaching agreement on 1986), ACM SIGOPS-SIGACT, pp. knowledges the help of Mauren Rob- processor group membership in syn- 29-39. inson, who prepared the original chronous distributed systems. Tech. 23. Makpangou, M. and Birman, K. De- figures for this article. [] Rep. RJ5964, IBM Research Labora- signing application software in wide tory, March 1988. area network settings. Tech. Rep. 90- References 13. Cristian, F., Aghili, H., Strong, H.R. 1165, Department of Computer Sci- 1. Ahamad, M., Burns, J., Hutto, P. and and Dolev, D. Atomic broadcast: ence, Cornell University, 1990. Neiger, G. Causal memory. Tech. From simple message diffusion to 24. Marzullo, K., Cooper, R., Wood, M. Rep., College of Computing, Georgia Byzantine agreement. In Proceedings and Birman, K. Tools for distributed Institute of Technology, Atlanta, Ga, of the Fifteenth International Symposium application management. IEEE Com- July 1991. on Fault-Tolerant Computing, (Ann put. (Aug. 1991). 2. Allen, T.A., Sheppard, W. and Con- Arbor, Michigan,June 1985), Institu- 25. Peterson, L. Preserving context infor- don, S. Imis: A distributed query and tion of Electrical and Electronic Engi- mation in an ipc abstraction. In Sixth report formatting system. In Proceed- neers, pp. 200-206. A revised ver- Symposium on Reliability in Distributed ings of the SUN Users Group Meeting, sion as IBM Tech. Rep. RJ5244. Software and Database Systems, IEEE Sun Microsystems Inc., 1992, pp. 94- 14. Cristian F. and Dancey, R. Fault- (March 1987), pp. 22-31. 101. tolerance in the advanced automation 26. Peterson, L.L., Bucholz, N.C. and 3. Amir, Y., Dolev, D., Kramer, S. and system. Tech. Rep. RJ7424, IBM Re- Schlichting, R. Preserving and using Malki, D. Transis: A communication search Laboratory, San Jose, Calif., context information in interprocess subsystem for high availability. Tech. Apr. 1990. communication. ACM Trans. Comput. Rep. TR 91-13, Computer Science 15. Cheriton, D. and Zwaenepoel, W. Syst. 7, 3 (Aug. 1989), 217-246. Dept., The Hebrew University of Je- The distributed V kernel and its per- 27. Reiter, M., Birman, K.P, and Gong, rusalem, Nov. 1991. formance for diskless workstations. L. Integrating security in a group ori- 4. Babaoglu, O., Alvisi, L., Amoroso, S., In Proceedings of the Ninth A CM Sympo- ented distributed system. In Proceed- Davoli, R. and Giachini, L.A. Paralex: sium on Operating Systems Principles. ings of the IEEE Symposium on Research An environment for parallel pro- (Bretton Woods, New Hampshire, in Security and Privacy (May 1992), gramming distributed systems. In Oct. 1983), ACM SIGOPS, pp. 129- pp. 18-32. Proceedings of the Sixth ACM Interna- 140. 28. Ricciardi, A. and Birman, K. Using tional Conference on Supercomputing 16. Dubois, M., Scheurich, C. and Briggs, process groups to implement failure (Washington, D.C., July 1992), pp. F. Memory access buffering in multi- detection in asynchronous environ- 178-187. processors. In Proceedings of the Thir- ments. In Proceedings of the Eleventh 5. Bache, T.C. et. al. The intelligent teenth Annual International Symposium ACM Symposium on Principles of Distrib- monitoring system. Bull. Seismological on (June 1986), uted Computing (Montreal, Quebec, Soc. Am. 80, 6 (Dec. 1990), 59-77. pp. 434-442. Aug. 1991), ACM SIGOPS-SIGACT. 6. Bernstein, P.A., Hadzilacos, V. and 17. Johansen, D. Stormcast: Yet another 29. Rozier, M., Abrossimov, V., Armand, Goodman, N. and exercise in distributed computing. In M., Hermann, F., Kaiser, C., Recovery in Database Systems. Addison- Distributed Open Systems in Perspective, Langlois, S., Leonard, P. and Wesley, Reading, Mass., 1987. Dag Johansen and Frances Brazier, Neuhauser, W. The CHORUS dis- 7. Birman, K.P. Replication and avail- Eds, IEEE, New York, 1993. tributed system. Comput. Syst. (Fall ability in the ISIS system. In Proceed- 18. Joseph T. and Birman, K. Low cost 1988), pp. 299-328. ings of the Tenth ACM Symposium on management of replicated data in Operating Systems Principles (Orcas Is- fault-tolerant distributed systems. CONTINUED ON PAGE 103

COMMUNICATIOH|OPTHiACM December 1993/Vol.36, No.12 S~ CONTINUED FROM PAGE $3 Laboratory, Stanford University, Additional Key Words and Phrases: Feb. 1990. Fault-tolerant process groups, message 36. Turek, J. and Shasha, D. The many ordering, multicast communication faces of Consensus in distributed sys- 30. Schlichting, R.D. and Schneider, F.B. About the Author: tems. IEEE Comput. 25, 6 (1992), 8- Fail-stop processors: An approach to KENNETH P. BIRMAN is an associate 17. designing fault-tolerant computing professor in the Computer Science de- 37. Wood, M. Constructing reliable reac- systems. ACM Trans. Comput. Syst. 1, 3 partment at Cornell University and presi- tive systems. Ph.D. dissertation, Cor- (Aug. 1983), 222-238. dent and CEO of ISIS Distributed Sys- nell University, Department of Com- 31. Schmuck, F. The use of efficient tems, Inc. Current research interests puter Science, Dec. 1991. broadcast primitives in asynchronous include a range of problems in distrib- distributed systems. Ph.D. disserta- uted computing and fault-tolerance, and tion, Cornell University, 1988. he has developed a number of systems in CR Categories and Subject Descrip- 32. Schneider, F.B. Implementing fault- this general area. Author's Present Ad- tors: C.2.1 [Computer-Communication tolerant services using the state ma- dress: Cornell University, Department of Networks]: Network Architecture and chine approach: A tutorial. ACM Computer Science, 4105A Upson Hall, C.2.2 Comput. Surv. 22, 4 (Dec. 1990), 299- Design--network communications; Ithaca, NY 14853; 319. [Computer-Communication Networks]: emaii: [email protected], edu 33. Siegel, A., Birman, K. and Marzullo, Network Protocols--protocol architecture; Funding for the work presented in this article C.2.4 [Computer-Communication Net- K. Deceit: A flexible distributed file was provided under DARPA/NASA grant works]: Distributed Systems--distributed NAG-2-593, and by grants from IBM, HP, Sie- system. Tech. Rep. 89-1042, Depart- applications; D.4.4 [Operating Systems]: mens, GTE and Hitachi. ment of Computer Science, Cornell University, 1989. Communications management--message Permission to copy without fee all or part of this D.4.5 material is granted provided that the copies are not 34. Tanenbaum, A. Computer Networks. sending, network communications; [Operating Systems]: Reliability--fault made or distributed for direct commercial advantage, Prentice-Hall, second ed,, 1988. the ACM copyright notice and the title of the publi- tolerance; D.4.7 [Operating Systems]: 35. Torrellas, J. and Hennessey, J. Esti- cation and its date appear, and notice is give that Organization and Design--distributed sys- mating the performance advantages copying is by permission of the Association for tems Computing Machinery. To copy otherwise, or to of relaxing consistency in a shared republish, requires a fee and/or specific permission. memory multiprocessor. Tech. rep. CSL-TN-90-365, Computer Systems General Terms: Algorithms, Reliability © ACM 0002-0782/93/1200-036 $1.50

CONTINUED FROM PAGE 65 Finance, Villanova University, Villanova, thor's Present Address: The Belk Col- PA 19085; email: [email protected] lege of Business Administration, Univer- Additional Key Words and Phrases: sity of North Carolina, Charlotte, NC Decision support systems, expert systems, 28223; email: [email protected] integration frameworks, intelligent sys- ANTHONY C. STYLIANOU is an assis- tems tant professor of management informa- For a complete list of references, contact the tion systems and director of the Academy authors. About the Authors: for Applied Research in Information Sys- M. K. EL-NAJDAWI is an associate pro- tems at the University of North Carolina Permission to copy without fee all or part of this fessor of operations management and at Charlotte. His current research inter- material is granted provided that the copies are not management information systems at Vil- ests include the evaluation, integration, made or distributed for direct commercial advantage, lanova University. His research interests and implementation of ES, knowledge- the ACM copyright notice and the title of the publi- are in the areas of production scheduling, based DSS, and neural networks, the ap- cation and its date appear, and notice is give that copying is by permission of the Association for integration of IS technologies, inventory plication of TQM in the information sys- Computing Machinery. To copy otherwise, or to management, and applications of ES and tems area, and issues related to strategic republish, requires a fee and/or specific permission. AI in decision making. Author's Present IS planning and change of management Address: The College of Commerce and during mergers and acquisitions. Au- © ACM 0002-0782/93/1200-054 $1.50

Call for your FREE Don't be ACM publications catalog: 1-800-342-6626 left in the (In NY, or outside the US and Canada, call 1-212-626-0500)

COMMUM|~AT|OMt O~THII ACM December1993/Vol.36, No.12 103