Total Order Broadcast and Multicast Algorithms: Taxonomy and Survey

XAVIER DEFAGO´ Japan Advanced Institute of Science and Technology and PRESTO, Japan Science and Technology Agency

ANDRE´ SCHIPER Ecole´ Polytechnique F´ed´erale de Lausanne, Switzerland

AND

PETER´ URBAN´ Japan Advanced Institute of Science and Technology

Total order broadcast and multicast (also called atomic broadcast/multicast) present an important problem in distributed systems, especially with respect to fault-tolerance. In short, the primitive ensures that messages sent to a of processes are, in turn, delivered by all those processes in the same total order.

Categories and Subject Descriptors: C.2.4 [Computer-Communication Networks]: Distributed Systems; C.2.2 [Computer-Communication Networks]: Network Protocols—Applications; Protocol architecture;D.4.4 [Operating Systems]: Communications Management—Message sending; Network communication;D.4.5 [Operating Systems]: Reliability—Fault-tolerance; H.2.4 [Database Management]: Systems—Distributed databases; H.3.4 [Information Storage and Retrieval]: Systems and Software—Distributed systems General Terms: Algorithms, Reliability, Design Additional Key Words and Phrases: Distributed systems, distributed algorithms, group communication, fault-tolerance, agreement problems, message passing, total ordering, global ordering, atomic multicast, atomic broadcast, classification, taxonomy, survey

Part of this research was conducted for the program “Fostering Talent in Emergent Research Fields” in Spe- cial Coordination Funds for Promoting Science and Technology by the Japan Ministry of Education, Culture, Sports, Science and Technology. This work was initiated during Xavier Defago’s´ Ph.D. research at the Swiss Federal Institute of Technology in Lausanne [Defago´ 2000]. Peter´ Urban´ was supported by the Japan Society for the Promotion of Science, a Grant-in-Aid for JSPS Fellows from the Japanese Ministry of Education, Cul- ture, Sports, Science and Technology, the Swiss National Science Foundations, and the CSEM Swiss Center for Electronics and Microtechnology, Inc., Neuchatel.ˆ Authors’ addresses: X. Defago´ and P. Urban,´ School of Information Science, JAIST, 1-1 Asahidai, Tatsunokuchi, Nomigun, Ishikawa 923-1292, Japan; email: defago,urban @jaist.ac.jp; A. Schiper, IC- LSR, School of Information and Communication, EPFL, CH-1015{ Lausanne,} Switzerland; email: Andre. Schiper@epfl.ch. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copy- rights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 1515 Broadway, New York, NY 10036 USA, fax: 1 (212) 869-0481, or [email protected]. + c 2004 ACM 0360-0300/04/1200-0372 $5.00 !

ACM Computing Surveys, Vol. 36, No. 4, December 2004, pp. 372–421. Total Order Broadcast and Multicast Algorithms: Taxonomy and Survey 373

The problem has inspired an abundance of literature, with a plethora of proposed algorithms. This article proposes a classification of total order broadcast and multicast algorithms based on their ordering mechanisms, and addresses a number of other important issues. The article surveys about sixty algorithms, thus providing by far the most extensive study of the problem so far. The article discusses algorithms for both the synchronous and the asynchronous system models, and studies the respective properties and behavior of the different algorithms.

1. INTRODUCTION Literature on total order broadcast. There exists a considerable amount of Distributed systems and applications are literature on total order broadcast, and notoriously difficult to build. This is many algorithms, following various ap- mostly due to the unavoidable concur- proaches, have been proposed to solve this rency in such systems, combined with problem. It is, however, difficult to com- the difficulty of providing a global con- pare them as they often differ with respect trol. This difficulty is greatly reduced by to their actual properties, assumptions, relying on group communication primi- objectives, or other important aspects. It tives that provide higher guarantees than is hence difficult to know which solution is standard point-to-point communication. best suited to a given application context. One such primitive is called total order1 When confronted with new requirements, broadcast.2 Informally, the primitive en- the absence of a roadmap to the problem sures that messages sent to a set of pro- of total order broadcast can lead engineers cesses are delivered by all those processes and researchers to either develop new al- in the same order. Total order broadcast gorithms rather than adapt existing solu- is an important primitive that plays a tions (thus reinventing the wheel), or use central role, for instance, when imple- a solution poorly suited to the application menting the state machine approach (also needs. An important step to improve the called active replication) [Lamport 1978a; present situation is to provide a classifica- Schneider 1990; Poledna 1994]. It also tion of existing algorithms. has other applications, such as clock syn- Related work. Previous attempts have chronization [Rodrigues et al. 1993], com- been made at classifying and compar- puter supported cooperative writing, dis- ing total order broadcast algorithms tributed shared memory, and distributed [Anceaume 1993b; Anceaume and Minet mutual exclusion [Lamport 1978b]. More 1992; Cristian et al. 1994; Friedman and recently, it was also shown that an ad- van Renesse 1997; Mayer 1992]. However, equate use of total order broadcast can none is based on a comprehensive survey significantly improve the performance of of existing algorithms, and hence they all replicated databases [Agrawal et al. 1997; lack generality. Pedone et al. 1998; Kemme et al. 2003]. The most complete comparison so far was done by Anceaume and Minet [1992] 1Total order broadcast is also known as atomic broad- (an extended version was later published cast. Both terminologies are currently in use. There is a slight controversy with respect to using one over in French by Anceaume [1993b]), who the other. We opt for the former, that is, “total or- take an interesting approach based on the der broadcast,” because the latter is somewhat mis- properties of the algorithms. Their paper leading. Indeed, atomicity suggests a property re- raises some fundamental questions that lated to agreement rather than to total order (de- inspired a part of our work. It is, how- fined in Sect. 2), and the ambiguity has already been a source of misunderstandings. In contrast, “total or- ever, a little outdated now. In addition, der broadcast” unambiguously refers to the property the authors only study seven different al- of total order. gorithms, which are not truly represen- 2Total order multicast is sometimes used instead of tative; for instance, none is based on a total order broadcast. The distinction between the communication history approach (one of two primitives is explained later in the article (Sec- tion 3). When the distinction is not important, we use the five classes of algorithms; details in the term total order broadcast. Section 4.4).

ACM Computing Surveys, Vol. 36, No. 4, December 2004. 374 X. D´efago et al.

Cristian et al. [1994] take a different ap- assumptions of each algorithm. This is, proach, focusing on the implementation of however, not always possible because the the algorithms, rather than their proper- information available in the papers is of- ties. They study four different algorithms, ten not sufficient to accurately character- and compare them using discrete event ize the behavior of the algorithm (e.g., in simulation. They find interesting results the face of a failure). regarding the respective performance of Structure. The article is logically or- different implementation strategies. Nev- ganized into four main parts: specifica- ertheless, they fail to discuss the respec- tion, ordering mechanisms and taxon- tive properties of the different algorithms. omy, fault-tolerance, and survey. More Besides, as they compare only four al- precisely, the article is structured as fol- gorithms, this work is less general than lows. Section 2 presents the specification Anceaume’s [1993b]. of the total order broadcast problem (also Friedman and van Renesse [1997] study known as atomic broadcast). Section 3 the impact of packing messages on the extends the specification by considering performance of algorithms. To this pur- the characteristics of destination groups pose, they study six algorithms, includ- (e.g., single versus multiple groups). In ing those studied by Cristian et al. [1994]. Section 4, we define five classes of total They measure the actual performance of order broadcast algorithms, according to the algorithms and confirm the observa- the way messages are ordered: commu- tions made by Cristian et al. [1994]. They nication history, privilege-based, moving show that packing several protocol mes- sequencer, fixed sequencer, and destina- sages into a single physical message in- tions agreement. Section 5 discusses sys- deed provides an effective way to improve tem model issues in to failures. the performance of algorithms. The com- Section 6 presents the main mechanisms parison also lacks generality, but this is on which total order broadcast algorithms quite understandable as this is not the rely to ensure fault-tolerance. Section 7 main concern of their paper. gives a broad survey of total order broad- Mayer [1992] defines a framework in cast algorithms found in the literature. Al- which total order broadcast algorithms gorithms are grouped along their respec- can be compared from a performance point tive classes, and we discuss their principal of view. The definition of such a framework characteristics. Section 8 discusses some is an important step toward an extensive other issues of interest that are related and meaningful comparison of algorithms. to total order broadcast. Finally, Section 9 However, the paper does not actually com- concludes the article. pare the numerous existing algorithms. Contributions.In this article, we pro- pose a classification of total order broad- 2. SPECIFICATION OF TOTAL ORDER cast algorithms based on the mechanism BROADCAST used to order messages. The reason for In this section, we give the formal specifi- this choice is that the ordering mechanism cation of the total order broadcast prob- is the characteristic with the strongest lem. As there are many variants of the influence on the communication pattern problem, we present here the simplest of the algorithm: two algorithms of the specification, and discuss other variants in same class are likely to exhibit similar be- Section 3. haviors. We define five classes of order- ing mechanisms: communication history, 2.1. Notation privilege-based, moving sequencer, fixed sequencer, and destinations agreement. Table I summarizes some of the nota- In this article, we also provide a vast tions used throughout the article. survey of about sixty published total or- is the set containing all possible validM der broadcast algorithms. Wherever pos- messages. ! denotes the set of all pro- sible, we mention the properties and the cesses in the system. Given some arbitrary

ACM Computing Surveys, Vol. 36, No. 4, December 2004. Total Order Broadcast and Multicast Algorithms: Taxonomy and Survey 375

Table I. Notation Byzantine component is allowed any ar- set of all valid messages. bitrary behavior. For instance, a faulty !M set of all processes in the system. sender(m) sender of message m. process may change the content of mes- Dest(m) set of destination processes for sages, duplicate messages, send unso- message m. licited messages, or even maliciously try !sender set of all sending processes in the to break down the whole system. system. !dest set of all destination processes in the A correct process is defined as a process system. that never expresses any of the faulty be- haviors mentioned above. message m, sender(m) designates the pro- cess in ! from which m originates, and Dest(m) denotes the set of all destination 2.3. Basic Specification of Total Order processes for m. Broadcast In addition, !sender is the set of all pro- We can now give the simplest specification cesses in ! that can potentially send some of total order broadcast. Formally, the valid message. problem is defined in terms of two prim- itives, which are called TO-broadcast(m) !sender p p can send some message ={ | and TO-deliver(m), where m is m . (1) ∈ M ∈ M} some message. When a process p ex- ecutes TO-broadcast(m) (respectively Likewise, !dest is the set of all potential TO-deliver(m)), we may say that p TO- destinations of valid messages. broadcasts m (respectively TO-delivers m). We assume that every message m ! def Dest(m). (2) dest = can be uniquely identified, and carries m !∈M the identity of its sender, denoted by sender(m). In addition, we assume that, 2.2. Process Failures for any given message m, and any run, The specification of total order broad- TO-broadcast(m)is executed at most cast requires the definition of the notion once. In this context, total order broadcast of a correct process. The following set is defined by the following properties of process failure classes are commonly [Hadzilacos and Toueg 1994; Chandra considered: and Toueg 1996]:

—Crash failures. When a process crashes, (VALIDITY)Ifacorrect process TO- it ceases functioning forever. This means broadcasts a message m, then it that it stops performing any activity in- eventually TO-delivers m. cluding sending, transmitting, or receiv- (UNIFORM AGREEMENT)Ifaprocess TO- ing any message. delivers a message m, then all correct —Omission failures. When a process fails processes eventually TO-deliver m. by omission, it omits performing some (UNIFORM INTEGRITY)For any message m, actions, such as sending or receiving a every process TO-delivers m at most message. once, and only if m was previously TO- —Timing failures. A timing failure occurs broadcast by sender(m). when a process violates some of the tim- (UNIFORM TOTAL ORDER)If processes p and ing assumptions of the system model q both TO-deliver messages m and m , (details in Section 5.1). Obviously, this # then p TO-delivers m before m#, if and type of failures does not exist in asyn- only if q TO-delivers m before m . chronous system models, because of the # absence of timing assumptions in such A broadcast primitive that satisfies all systems. these properties except Uniform Total Or- —Byzantine failures. Byzantine failures der (i.e., that provides no ordering guar- are the most general type of failures. A antee) is called a reliable broadcast.

ACM Computing Surveys, Vol. 36, No. 4, December 2004. 376 X. D´efago et al.

Fig. 1.Violation of Uniform Agreement (example).

Validity and Uniform Agreement are form Agreement and Total Order are spec- liveness properties. Roughly speaking, ified as follows: this means that, at any point in time, no matter what has happened up to that (AGREEMENT) If a correct process TO- point, it is still possible for the property delivers a message m, then all correct to eventually hold [Charron-Bost et al. processes eventually TO-deliver m. 2000]. Uniform Integrity and Uniform (TOTAL ORDER) If two correct processes p Total Order are safety properties. This and q both TO-deliver messages m and means that, if, at some point in time, the m#, then p TO-delivers m before m#, if property does not hold, no matter what and only if q TO-delivers m before m#. happens later, the property cannot even- tually hold. The combinations of uniform and nonuniform properties define four differ- 2.4. Nonuniform Properties ent specifications to the problem of fault- tolerant total order broadcast. These defi- In the above definition of total order broad- nitions constitute a hierarchy of problems, cast, the properties of Agreement and To- as discussed extensively by Wilhelm and tal Order are uniform. This means that Schiper [1995]. However, for simplicity, we these properties do not only apply to cor- say that a total order broadcast algorithm rect processes, but also to faulty ones. For is uniform when it satisfies both Uniform instance, with Uniform Total Order, a pro- Agreement and Uniform Total Order, and cess is not allowed to deliver any message we say that an algorithm is nonuniform out of order, even if it is faulty. Conversely, when it enforces neither (i.e., only their (nonuniform) Total Order applies only to nonuniform counterparts). We give no spe- correct processes, and hence does not put cial name to the two hybrid definitions. any restriction on the behavior of faulty Figure 1 illustrates a violation of the processes. Uniform Agreement property with a sim- Uniform properties are strong guar- ple example. In this example, the se- antees that might make life easier for quencer p1 sends a message m, using total application developers. Not all applica- order broadcast. It first assigns a tions need uniformity, however, and en- number to m, then sends m to all pro- forcing uniformity often has a cost. For cesses, and finally, delivers m. Process p1 this reason, it is also important to consider crashes shortly afterwards, and no other weaker problems specified using nonuni- process receives m (due to message loss). form properties, though nonuniform prop- As a result, no correct process (e.g., p2) erties may lead to inconsistencies at the will ever be able to deliver m. Uniform application level. However, an application Agreement is violated, but not (nonuni- might protect itself from nonuniformity by form) Agreement: no correct process ever voting (e.g., given an application that col- delivers m (p1 is not correct). lects replies from the destinations of a to- Note 1. Byzantine failures and unifor- tal order broadcast, the application may mity. Algorithms tolerant to Byzantine vote on the replies received, and consider failures can guarantee none of the uni- a reply to be effective only after receiving form properties given in Section 2.3. the same reply from a majority). Nonuni- This is understandable as no behavior

ACM Computing Surveys, Vol. 36, No. 4, December 2004. Total Order Broadcast and Multicast Algorithms: Taxonomy and Survey 377

Fig. 2. Contamination of correct processes (p1, p2), by a message (m4), based on an inconsistent state (p3 delivered m3 but not m2). can be enforced on Byzantine processes. inconsistent state, and thus contaminate In other words, nothing can prevent a correct processes [Gopal and Toueg 1991; Byzantine process from (1) delivering a Anceaume and Minet 1992; Anceaume message more than once (violates Uni- 1993b; Hadzilacos and Toueg 1994]. form Integrity), (2) delivering a message that is not delivered by other processes 2.5.1. Illustration. Figure 2 illustrates (violates Agreement), or (3) delivering an example [Charron-Bost et al. 1999; two messages in the wrong order (violates Hadzilacos and Toueg 1994] in which an Total Order). incorrect process contaminates the cor- Reiter [1994] proposes a more useful rect processes. Process p3 delivers mes- definition of uniformity for Byzantine sys- sages m1 and m3, but not m2.So, its state tems. He distinguishes between crashes is inconsistent when it multicasts m4 to and Byzantine failures. He says that a the other processes before crashing. The process is honest if it behaves according correct processes p1 and p2 deliver m4, to its specification, and corrupt otherwise thus becoming contaminated by the in- (i.e., Byzantine), where honest processes consistent state of p3.Itis important to can also fail by crashing. In this context, stress again that the situation depicted in uniform properties are those which are en- Figure 2 satisfies even the strongest spec- forced by all honest processes, regardless ification presented so far. of whether they are correct or not. This 2.5.2. Specification. It is possible to ex- definition is more sensible that the stricter tend or reformulate the specification of to- of definition of Section 2.3, as nothing is tal order broadcast in such a way that required from corrupt processes. it disallows contamination. The solution Note 2. Safety/liveness and uniformity. consists of preventing any process from de- Charron-Bost et al. [2000] have shown livering a message that may lead to an in- that, in the context of failures, some consistent state. nonuniform properties that are commonly Aguilera et al. [2000] propose a reformu- believed to be safety properties are actu- lation of Uniform Total Order which, un- ally liveness properties. They have pro- like the traditional definition, is not prone posed refinements of the concept of safety to contamination, as it does not allow gaps and liveness that avoid the counterintu- in the delivery sequence: itive classification. (GAP-FREE UNIFORM TOTAL ORDER)If some process delivers message m# after mes- 2.5. Contamination sage m, then a process delivers m# only The problem of contamination comes after it has delivered m. from the observation that, even with the As an alternative, an older formulation strongest specification (i.e., with Uniform uses the history of delivery and requires Agreement and Uniform Total Order), to- that, for any two given processes, the his- tal order broadcast does not prevent a tory of one is a prefix of the history of faulty process p from reaching an incon- the other. This is expressed by the follow- sistent state before it crashes. This is a ing property [Anceaume and Minet 1992; serious problem because p can “legally” Cristian et al. 1994; Keidar and Dolev TO-broadcast a message based on this 2000]:

ACM Computing Surveys, Vol. 36, No. 4, December 2004. 378 X. D´efago et al.

(PREFIX ORDER)For any two processes p [1978b]. It is based on the relation and q, either hist(p)isaprefix of hist(q), “precedes”3 (denoted by ), defined or hist(q)isaprefix of hist(p), where in his seminal paper and extended−→ in a hist(p) and hist(q) are the of later paper [Lamport 1986b]. The relation messages delivered by p and q, respec- “precedes” is defined as follows. tively. Definition 1. Let ei and e j be two Note 3. The specification of total order events in a distributed system. The tran- broadcast using Prefix Order precludes sitive relation ei e j holds if any one of the dynamic join of processes (e.g., with the following three−→ conditions is satisfied: a group membership). This can be cir- cumvented, but the resulting property (1) ei and e j are two events on the same is much more complicated. For this rea- process, and ei comes before e j ; son, the simpler alternative proposed by (2) ei is the sending of a message m by one Aguilera et al. [2000] is preferred. process, and e j is the receipt of m by another process; or, Note 4. Byzantine failures and contami- (3) There exists a third event e , such that, nation. Contamination cannot be avoided k e e and e e (transitivity). in the face of arbitrary failures. This is i −→ k k −→ j because a faulty process may be inconsis- This relation defines an irreflexive par- tent even if it delivers all messages cor- tial ordering on the set of events. The rectly. It may then contaminate the other causality of messages can be defined by the processes by broadcasting a bogus mes- “precede” relationship between their re- sage that seems correct to every other pro- spective sending events. More precisely, a cess [Hadzilacos and Toueg 1994]. message m is said to precede a message m# (denoted m m ), if the sending event of m ≺ # 2.6. Other Ordering Properties precedes the sending event of m#. The property of causal order for broad- The Total Order property (see Section 2.3), cast messages is defined as follows restricts the order of message delivery [Hadzilacos and Toueg 1994]: based solely on the destinations, that is, the property is independent of the sender (CAUSAL ORDER)If the broadcast of a mes- processes. The definition can be further re- sage m causally precedes the broadcast stricted by two properties related to the of a message m#, then no correct process senders, namely, FIFO Order and Causal delivers m#, unless it has previously de- Order. livered m. Hadzilacos and Toueg [1994] also prove 2.6.1. FIFO Order. Total Order alone that the property of Causal Order is equiv- does not guarantee that messages are de- alent to combining the property of FIFO livered in the order in which they are sent Order with the following property of Local (i.e., in first-in/first-out order). Yet, this Order. property is sometimes required by applica- tions in addition to Total Order. The prop- (LOCAL ORDER)Ifaprocess broadcasts a erty is called FIFO Order: message m and a process delivers m before broadcasting m#, then no correct (FIFO ORDER)Ifacorrect process TO- process delivers m , unless it has previ- broadcasts a message m before it TO- # ously delivered m. broadcasts a message m#, then no cor- rect process delivers m#, unless it has Note 5. State-machine approach. A total previously delivered m. order broadcast ensuring causal order

2.6.2. Causal Order. The notion of 3Lamport initially called the relation “happened be- causality in the context of distributed fore” [Lamport 1978b], but he renamed it “precedes” systems was first formalized by Lamport in later work [Lamport 1986a, 1986b].

ACM Computing Surveys, Vol. 36, No. 4, December 2004. Total Order Broadcast and Multicast Algorithms: Taxonomy and Survey 379 is, for instance, required by the state messages are sent within a group of pro- machine approach [Lamport 1978a; cesses. This originally came from the fact Schneider 1990]. However, we think that that early work on this topic was done in some applications may require causality, the context of parallel machines [Lamport some others not. 1978a], or highly available storage sys- tems [Cristian et al. 1995]. However, most 2.6.3. Source Ordering. Some papers distributed applications are now devel- (e.g., Garcia-Molina and Spauster [1991] oped by considering more open interaction and Jia [1995]) make a distinction be- models, such as the client-server model, tween single source and multiple source N-tier architectures, or publish/subscribe. ordering. These papers define single For this reason, it is necessary for a pro- source ordering algorithms as algorithms cess to be able to multicast messages to a that ensure total order only if a single group to which it does not belong. Conse- process broadcasts messages. This is a quently, we consider it an important char- special case of FIFO broadcast, easily acteristic of algorithms that they be easily solved using sequence numbers. Source adaptable to open interaction models. ordering is not particularly interesting in itself, and hence we do not discuss the 3.1.1. Closed Group Algorithms. In closed issue further in this article. groups algorithms, the sending process is always one of the destination processes: 3. PROPERTIES OF DESTINATION GROUPS m (sender(m) Dest(m)). (5) So far, we have presented the problem of ∀ ∈ M ∈ total order broadcast, wherein messages So, these algorithms do not allow external are sent to all processes in the system. processes (processes that are not members In other words, all valid messages are ad- of the group) to multicast messages to the dressed to the entire system: destination group. m (Dest(m) !). (3) ∀ ∈ M = 3.1.2. Open Group Algorithms. Con- A multicast primitive is more general in versely, open group algorithms allow the sense that it can send messages to any any arbitrary process in the system to chosen subset of the processes. In other multicast messages to a group, whether words, we can have two valid messages or not the sender process belongs to the sent to different destinations sets, or the destination group. More precisely, there destination set may not include the mes- are some valid messages where the sender sage sender: is not one of the destinations:

m (sender(m) Dest(m)) m (sender(m) Dest(m)). (6) ∃ ∈ M )∈ ∃ ∈ M )∈ m , m (Dest(m ) Dest(m )). ∧∃ i j ∈ M i )= j Open group algorithms are more general (4) than closed group algorithms: the former can be used with closed groups, while the Although in wide use, the distinction opposite is not true. between broadcast and multicast is not precise enough. This leads us to discuss 3.2. Single Versus Multiple Groups a more relevant distinction, namely, be- tween closed versus open groups, and be- Most algorithms presented in the litera- tween single versus multiple groups. ture assume that all messages are multi- cast to one single group of destination pro- 3.1. Closed Versus Open Groups cesses. Nevertheless, a few algorithms are designed to support multiple groups. In In the literature, many algorithms are de- this context, we consider three situations: signed with the implicit assumption that single group, multiple disjoint groups, and

ACM Computing Surveys, Vol. 36, No. 4, December 2004. 380 X. D´efago et al. multiple overlapping groups.We also dis- intersection: cuss how useless, trivial solutions can be ruled out with the notion of minimality. m , m (Dest(m ) Dest(m ) ∃ i j ∈ M i )= j Since the ability to multicast messages to Dest(m ) Dest(m ) ). (9) multiple destination sets is critical for cer- ∧ i ∩ j )=∅ tain classes of applications, we regard this The real difficulty of designing total ability as an important characteristic of an order multicast algorithms for multiple algorithm. groups arises when the groups can over- lap. This is easily understood when one 3.2.1. Single Group Ordering. With single considers the problem of ensuring total or- group ordering, all messages are multi- der at the intersection of groups. In this cast to one single group of destination pro- context, Hadzilacos and Toueg [1994] give cesses. As mentioned above, this is the three different properties for total order model considered by a vast majority of the in the presence of multiple groups: Lo- algorithms that are studied in this article. cal Total Order, Pairwise Total Order, and Single group ordering can be defined by Global Total Order.5 the following property:4 (LOCAL TOTAL ORDER)If correct processes p and q both TO-deliver messages m mi, m j (Dest(mi) Dest(m j )). (7) and m and Dest(m) Dest(m ), then p ∀ ∈ M = # = # TO-delivers m before m#,if and only if q 3.2.2. Multiple Groups Ordering (Disjoint). TO-delivers m before m#. In some applications, the restriction to one Local Total Order is the weakest of the single destination group is not acceptable. three properties. It requires that total or- For this reason, algorithms have been pro- der be enforced only for messages that are posed that support multicasting messages multicast within the same group. to multiple groups. The simplest case oc- Note also that multiple unrelated curs when the multiple groups are disjoint groups can be considered as disjoint groups. More precisely, if two valid mes- groups even if they overlap. Indeed, des- sages have different destination sets, then tination processes belonging to the inter- these sets do not intersect: section of two groups can be seen as having two distinct identities, one for each group. mi, m j (Dest(mi) Dest(m j ) It follows that an algorithm for distinct ∀ ∈ M )= Dest(mi) Dest(m j ) ). (8) multiple groups can be trivially adapted ⇒ ∩ =∅ to support overlapping groups with Local Adapting algorithms designed for one Total Order. single group to work in a system with mul- As pointed out by Hadzilacos and Toueg tiple disjoint groups is almost trivial. [1994], the total order multicast primi- tive of the first version of Isis [Birman and Joseph 1987] guaranteed Local Total 3.2.3. Multiple Groups Ordering (Overlap- Order.6 ping). In case of multiple groups order- ing, it can happen that groups overlap. (PAIRWISE TOTAL ORDER)If two correct pro- This can be expressed by the fact that cesses p and q both TO-deliver mes- some pairs of valid messages have dif- sages m and m#, then p TO-delivers m ferent destination sets with a nonempty 5The ordering properties cited here are subject to contamination, see Section 2.5. Contamination can 4This definition and the following ones are static. be avoided by formulating these properties similarly They do not take into account the fact that processes to the Gap-free Uniform Total Order property. can join groups and leave groups. Nevertheless, we 6It should be noted that, if the transformation is triv- prefer these simple static definitions, rather than ial from a conceptual point of view, the implementa- more complex ones that would take dynamic desti- tion was certainly a totally different matter, espe- nation groups into account. cially in the mid-80’s.

ACM Computing Surveys, Vol. 36, No. 4, December 2004. Total Order Broadcast and Multicast Algorithms: Taxonomy and Survey 381

before m#,if and only if q TO-delivers m expressed as an I/O automaton [Lynch and before m#. Tuttle 1989; Lynch 1996] and uses the no- tion of pseudo-time to impose an order on Pairwise Total Order is strictly stronger the delivery of messages. than Local Total Order. Most notably, it requires that total order be enforced for all messages delivered at the intersection 3.2.4. Minimality and Trivial Solutions. Any of two groups. algorithm that solves the problem of total As far as we know, there is no straight- order broadcast in a single group can eas- forward algorithm to transform a total ily be adapted to solve the problem for mul- order multicast algorithm that enforces tiple groups with the following approach: Local Total Order into one that also (1) form a super-group with the union of guarantees Pairwise Total Order (except all destination groups; for trivial solutions; see Section 3.2.4). Hadzilacos and Toueg [1994] observe that, (2) whenever a message m is multicast to a for instance, Pairwise Total Order is the group, multicast it to the super-group, order property guaranteed by the al- and gorithm of Garcia-Molina and Spauster (3) processes not in Dest(m) discard m. [1989, 1991]. Pairwise Total Order alone may lead The problem with this approach is its in- to unexpected situations when there are herent lack of scalability. Indeed, in very three or more overlapping destination large distributed systems, even if the des- groups. For instance, Fekete [1993] illus- tination groups are individually small, trates the problem with the following sce- their union is likely to cover a very large nario. Consider three processes p , p , p , number of processes. i j k To avoid this sort of solution, Guerraoui and three messages m1, m2, m3 that are respectively sent to three different over- and Schiper [2001] require the implemen- lapping groups G p , p , G tation of total order multicast for multiple 1 ={i j } 2 = groups to satisfy the following minimality pj , pk , and G3 pk, pi .Pairwise To- tal{ Order} allows the={ following} histories on property: p , p , p : i j k (STRONG MINIMALITY) The execution of the algorithm implementing total order pi : TO-deliver(m3) multicast for a message m involves ··· −→ ··· TO-deliver(m1) only sender(m), and the processes in −→ ··· pj : TO-deliver(m1) Dest(m). ··· TO-deliver(−→m ) ··· −→ 2 ··· This property is often too strong: it disal- pk : TO-deliver(m2) lows many interesting algorithms that use ··· TO-deliver(−→m ) ··· −→ 3 ··· a small number of external processes for message-ordering (e.g., algorithms which This situation is prevented by the speci- disseminate messages along some propa- fication of Global Total Order [Hadzilacos gation tree). A weaker property would al- and Toueg 1994], which is defined as fol- low an algorithm to involve a small set of lows: external processes.

(GLOBAL TOTAL ORDER) The relation < is acyclic, where < is defined as follows: 3.2.5. Transformation Algorithm. Delporte- Gallet and Fauconnier [2000] propose a m < m# if and only if any correct pro- cess delivers m and m ,in that order. generic algorithm that transforms a to- # tal order broadcast algorithm for a single Note 6. Fekete [1993] gives another closed group into one for multiple groups. specification for total order multicast The algorithm splits destination groups which also prevents the scenario just men- into smaller entities and supports multi- tioned. The specification, called AMC, is ple groups with Strong Minimality.

ACM Computing Surveys, Vol. 36, No. 4, December 2004. 382 X. D´efago et al.

3.3. Dynamic Groups that belong to the primary partition are al- lowed to deliver messages, while the other The specification in Section 2 is the stan- processes must wait until they can merge dard specification of total order broadcast back with the primary partition. in a static system, that is, a system in In contrast, the partitionable group which all processes are created at system membership allows all processes to de- initialization. In practice, however, it is of- liver messages, regardless of the partition ten desirable that processes join and leave they belong to. Doing so requires adapting groups at runtime. the specification of total order broadcast. A dynamic group is a group of processes Chockler, Keidar, and Vitenberg [2001] with a membership that can change dur- define three order properties in a parti- ing the computation: processes can dy- tionable system: Strong Total Order (mes- namically join or leave the group, or can sages are delivered in the same order by be removed from the group (removal in all processes that deliver them), Weak To- the face of failures is discussed later in tal Order (the order requirement is re- Section 6.2). With a dynamic group, the stricted within a view), and Reliable To- successive memberships of the group are tal Order (extends the Strong Total Order called the views of the group [Chockler, property to require processes to deliver a Keidar, and Vitenberg 2001]. prefix of a common sequence of messages With dynamic groups, the basic com- within each view). In other words, with munication abstraction is called view syn- only slight differences, Strong Total Or- chrony, which can be seen as the coun- der corresponds to the Uniform Total Or- terpart of reliable broadcast in static sys- der property of Section 2.3, and Reliable tems. Reliable broadcast is defined by Total Order to the Prefix Ordering prop- the Validity, Agreement, and Uniform In- erty of Section 2.5. Other properties, such tegrity properties of Section 2. Roughly as Validity, are also defined differently in speaking, view synchrony adopts a sim- partitionable systems. This is explained in ilar definition, while relaxing the Agree- considerably more detail by Chockler, Kei- ment property.7 Total order broadcast in a dar, and Vitenberg [2001] and Fekete et al. system with dynamic groups can thus be [2001]. specified as view synchrony, plus a prop- erty of total order. 4. MECHANISMS FOR MESSAGE 3.4. Partitionable Groups ORDERING In a wide-area network, the network can In this section, we propose a classification temporarily become partitioned; that is, of total order broadcast algorithms in the some of the nodes can no longer com- absence of failures. The first question that municate, as all links between them are we ask is: “who builds the order?” More broken. When this happens, destination specifically, we are interested in the entity groups can be split into several isolated that generates the information necessary subgroups (or partitions). There are two for defining the order of messages (e.g., main approaches to coping with parti- timestamp or sequence number). tioned groups: (1) the primary partition We identify three different roles that membership, and (2) the partitionable a participating process can take with re- membership. spect to the algorithm: sender, destination, With the primary partition member- or sequencer. A sender process is a pro- ship, one of the partitions is recognized cess ps from which a message originates as the primary partition.8 Only processes (i.e., ps !sender). A destination process is ∈

7Discussing this primitive in detail is beyond the partition only one which retains a majority of the scope of this survey (see paper by Chockler, Keidar, processes from the previous view. This does not en- and Vitenberg [2001] for details). sure that a primary partition always exists, but it 8A simple way to do this is to recognize as primary guarantees that, if one exists, it is unique.

ACM Computing Surveys, Vol. 36, No. 4, December 2004. Total Order Broadcast and Multicast Algorithms: Taxonomy and Survey 383

Fig. 4.Fixed sequencer algorithms.

of classes is specific to a client-server ar- chitecture. In the remainder of this section, we present each of the five classes and il- lustrate each class with a simple algo- Fig. 3. Classes of total order broadcast algorithms. rithm. The algorithms are merely pre- sented for the purpose of illustrating the corresponding category, and should not be a process pd to which a message is sent regarded as full-fledged working exam- (i.e., pd !dest). Finally, a sequencer pro- ples. Although inspired by existing algo- ∈ cess is not necessarily a sender or a desti- rithms, they are largely simplified, and nation, but is somehow involved in the or- none of them is fault-tolerant. dering of messages. A given process may Note 7. Atomic blocks. The algorithms simultaneously take several roles (e.g., are written in pseudocode, with the as- sender and sequencer and destination). sumption that blocks associated with a However, we represent these roles sepa- when-clause are executed atomically with rately as they are conceptually different. respect to when-clauses of the same pro- According to the three different roles, cess, except when a process is blocked on we define three basic classes for total or- await statement. This assumption greatly der broadcast algorithms, depending on simplifies the expression of the algorithms whether the order is respectively built with respect to concurrency. by a sequencer, the sender, or destina- tion processes. Among algorithms of the 4.1. Fixed Sequencer same class, significant differences remain. To account for this problem, we intro- In a fixed sequencer algorithm, one pro- duce a further division, leading to five cess is elected as the sequencer and is re- subclasses in total. These classes are sponsible for ordering messages. The se- named as follows (see Figure 3): fixed quencer is unique, and the responsibility sequencer, moving sequencer, privilege- is not normally transfered to another pro- based, communication history, and des- cesses (at least in the absence of failure). tinations agreement. Privilege-based and The approach is illustrated in Figure 4 moving sequencer algorithms are com- and Figure 5. One specific process takes monly referred to as token-based algo- the role of a sequencer and builds the total rithms. order. To broadcast a message m,asender The terminology defined in this arti- sends m to the sequencer. Upon receiv- cleis partly borrowed from other au- ing m, the sequencer assigns it a sequence thors. For instance, “communication his- number and relays m with its sequence tory” and “fixed sequencer” were proposed number to the destinations. The latter by Cristian and Mishra [1995]. The term then deliver messages according to the se- “privilege-based” was suggested by Dahlia quence numbers. This algorithm does not Malkhi in a private discussion. Finally, tolerate the failure of the sequencer. Le Lann and Bres [1991] group algorithms In fact, three variants of fixed se- into three classes, based on where the or- quencer algorithms exist. We call these der is built. Unfortunately, their definition three variants “UB” (unicast-broadcast),

ACM Computing Surveys, Vol. 36, No. 4, December 2004. 384 X. D´efago et al.

Fig. 5. Simple fixed sequencer algorithm.

Fig. 6. Common variants of fixed sequencer algorithms.

“BB” (broadcast-broadcast), and “UUB” messages than the previous approach, ex- (unicast-unicast-broadcast), taking inspi- cept in broadcast networks. However, it ration from Kaashoek and Tanenbaum can reduce the load on the sequencer, and [1996]. makes it easier to tolerate the crash of the In the first variant, called “UB” (see sequencer. Isis (sequencer) [Birman et al. Figure 6(a)), the protocol consists of a uni- 1991] is an example of the second variant. cast to the sequencer, followed by a broad- The third variant, called “UUB” cast from the sequencer. This variant gen- (Figure 6(c)), is less common than the erates few messages, and it is the simplest others. In short, the protocol consists of of the three approaches. It is, for instance, the following steps. The sender requests adopted by Navaratnam et al. [1988], and a sequence number from the sequencer corresponds to the algorithm in Figure 5. (unicast). The sequencer replies with a In the second variant, called “BB” sequence number (unicast). Then, the (Figure 6(b)), the protocol consists of a sender broadcasts the sequenced message broadcast to all destinations plus the se- to the destination processes.9 quencer, followed by a second broadcast from the sequencer. This generates more 9The protocol to tolerate failures is complex.

ACM Computing Surveys, Vol. 36, No. 4, December 2004. Total Order Broadcast and Multicast Algorithms: Taxonomy and Survey 385

thus avoiding the bottleneck caused by a single process. This is illustrated in sev- eral studies (e.g., Cristian et al. [1994] and Urban´ et al. [2000]). One could then wonder why a fixed sequencer algorithm should be preferred to a moving sequencer algorithm. There are, in fact, at least three possible reasons. First, fixed sequencer al- Fig. 7. Moving sequencer algorithms. gorithms are considerably simpler, leaving 4.2. Moving Sequencer less room for implementation errors. Sec- ond, the latency of fixed sequencer algo- Moving sequencer algorithms are based on rithms is often better, as shown by Urban´ the same principle as fixed sequencer al- et al. [2000]. Third, it is often the case gorithms, but allow the role of sequencer that some machines are more reliable, to be transferred between several pro- more trusted, better connected, or simply cesses. The motivation is to distribute the faster than others. When this is the case, it load among them. This is illustrated in makes sense to use one of them as a fixed Figure 7, where the sequencer is cho- sequencer (see MTP in Section 7.1.2). sen among several processes. The code executed by each process is, however, 4.3. Privilege-Based more complex than with a fixed sequencer, which explains the popularity of the lat- Privilege-based algorithms rely on the ter approach. Notice that with moving se- idea that senders can broadcast mes- quencer algorithms, the roles of sequencer sages only when they are granted the and destination processes are normally privilege to do so. Figure 9 illustrates combined. this class of algorithms. The order is de- Figure 8 shows the principle of mov- fined by the senders when they broadcast ing sequencer algorithms. To broadcast their messages. The privilege to broad- a message m,asender sends m to the cast (and order) messages is granted to sequencers. Sequencers circulate a token only one process at a time, but this priv- message that carries a sequence num- ilege circulates from process to process, ber and a list of all messages to which among the senders. In other words, due a sequence number has been attributed to the arbitration between senders, build- (i.e., all sequenced messages). Upon re- ing the total order requires solving the ceipt of the token, a sequencer assigns problem of FIFO broadcast (easily solved a sequence number to all received, but with sequence numbers at the sender), yet unsequenced, messages. It sends the and ensuring that passing the privilege newly sequenced messages to the destina- to the next sender does not violate this tions, updates the token, and passes it to order. the next sequencer. Figure 10 illustrates the principle Note 8. Similar to fixed sequencer algo- of privilege-based algorithms. Senders rithms, it is possible to develop a moving circulate a token message that carries a sequencer algorithm according to one of sequence number to be used when broad- three variants. However, the difference be- casting the next message. When a pro- tween the variants is not as clear cut as it cess wants to broadcast a message m, it is for a fixed sequencer. It turns out that all must first wait until it receives the to- of the moving sequencer algorithms sur- ken message. Then, it assigns a sequence veyed follow the equivalent of the fixed se- number to each of its messages and sends quencer variant BB. Hence, we do not dis- them to all destinations. Following this, cuss this issue any further. the sender updates the token and sends it Note 9. As mentioned, the main motiva- to the next sender. Destination processes tion for using a moving sequencer is to dis- deliver messages in increasing sequence tribute the load among several processes, numbers.

ACM Computing Surveys, Vol. 36, No. 4, December 2004. 386 X. D´efago et al.

Fig. 8. Simple moving sequencer algorithm.

Note 10. In privilege-based algorithms, senders usually need to know each other in order to circulate the privilege. This con- straint makes privilege-based algorithms poorly suited to open groups, where there is no fixed and previously known set of senders. Note 11. In synchronous systems, Fig. 9. privilege-based algorithms. privilege-based algorithms are based on the idea that each sender process is allowed to send messages only during predetermined time slots. These time on a token passing mechanism. However, slots are attributed to each process in they differ in one significant aspect: the such a way that no two processes can send total order is built by senders in privilege- messages at the same time. By ensur- based algorithms, while it is built by se- ing that the communication medium is quencers in moving sequencer algorithms. accessed in mutual exclusion, the total This has at least two major consequences. order is easily guaranteed. The technique First, moving sequencer algorithms are is also known as time division multiple easily adapted to open groups. Second, in access (TDMA). privilege-based algorithms, the passing of Note 12. It is tempting to consider that the token is necessary to ensure the live- privilege-based and moving sequencer al- ness of the algorithm, while with moving gorithms are equivalent, since both rely sequencer algorithms, it is mostly used for

ACM Computing Surveys, Vol. 36, No. 4, December 2004. Total Order Broadcast and Multicast Algorithms: Taxonomy and Survey 387

Fig. 10. Simple privilege-based algorithm. improving performance, for example, by ical) timestamp. The destinations observe load balancing. the messages generated by the other pro- Note 13. It is difficult to ensure fairness cesses and their timestamps, that is, the with privilege-based algorithms. Indeed, history of communication in the system, if a process has a very large number of to learn when delivering a message will messages to broadcast, it could keep the no longer violate the total order. token for an arbitrarily long time, thus There are two fundamentally differ- preventing other processes from broad- ent variants of communication history casting their own messages. To overcome algorithms. In the first variant, called this problem, algorithms often enforce an causal history, communication history al- upper limit on the number of messages gorithms use a partial order, defined by and/or the time that some process can keep the causal history of messages, and trans- the token. Once the limit is passed, the form this partial order into a total order. process is compelled to release the token, Concurrent messages are ordered accord- regardless of the number of messages re- ing to some predetermined function. In maining to be broadcast. the second variant, known as determinis- tic merge, processes send messages times- tamped independently (thus not reflecting 4.4. Communication History causal order), and delivery takes place ac- In communication history algorithms, as cording to a deterministic policy of merg- in privilege-based algorithms, the delivery ing the streams of messages coming from order is determined by the senders. How- each process. ever, in contrast to privilege-based algo- Figure 11 illustrates a typical communi- rithms, processes can broadcast messages cation history algorithm of the first vari- at any time, and total order is ensured ant. The algorithm, inspired by Lamport by delaying the delivery of messages. The [1978b], works as follows. The algorithm messages usually carry a (physical or log- uses logical clocks [Lamport 1978b] to

ACM Computing Surveys, Vol. 36, No. 4, December 2004. 388 X. D´efago et al.

Fig. 11. Simple communication history algorithm (causal history).

Fig. 12. Simple communication history algorithm (deterministic merge).

“timestamp” each message m with the log- Note 14. The algorithms of Figure 11 ical time of the TO-broadcast(m) event, de- and Figure 12 are not live. Indeed, con- noted ts(m). Messages are then delivered sider the algorithm of Figure 11 and a in the order of their timestamps. However, scenario where a single process p broad- we can have two messages, m and m#, with casts a single message m, while no other the same timestamp. To arbitrate between process ever broadcasts any message. Ac- these messages, the algorithm uses the cording to the algorithm in Figure 11, a lexicographical order on the identifiers of process q can deliver m only after it has sending processes. In Figure 11, we refer received, from every process, a message to this order as the (ts(m), sender(m)) or- that was broadcast after the reception der, where sender(m)is the identifier of the of m. This is, of course, impossible if at sender process. least one of the processes never broad- A simple example of the second variant casts any message. To overcome this prob- is illustrated in Figure 12. The algorithm lem, communication history algorithms assumes that communication is FIFO, and proposed in the literature usually send that sender processes broadcast messages empty messages when no application mes- at the same rate. Destination processes ex- sages are broadcast. ecute an infinite loop where they accept, Note 15. In synchronous systems, com- in a round-robin fashion, a single message munication history algorithms rely on from each sender process. Aguilera and synchronized clocks, and use physical Strom [2000] (Section 7.4.9), for instance, timestamps (timestamps coming from the propose a more elaborate algorithm based synchronized clocks), instead of logical on the same principle. ones. The nature of such systems makes

ACM Computing Surveys, Vol. 36, No. 4, December 2004. Total Order Broadcast and Multicast Algorithms: Taxonomy and Survey 389

The most representative algorithm of the second variant of agreement is the al- gorithm proposed by Chandra and Toueg [1996] (Section 7.5.4). The algorithm transforms total order broadcast into a se- quence of consensus problems.10 Each in- stance of the consensus decides on a set Fig. 13. Destinations agreement algorithms. of messages to deliver, that is, consensus number k allows the processes to agree k it unnecessary to send empty messages in on a set Msg of messages. For k < k#, order to ensure liveness. Indeed, this can the messages in Msgk are delivered before be seen as an example of the use of time to the messages in Msgk# . The messages in a communicate [Lamport 1984]. set Msgk are delivered according to some predetermined order (e.g., in lexical order 4.5. Destinations Agreement of their identifiers). In destinations agreement algorithms, as With the third variant of agreement, a the name indicates, the delivery order re- tentative message delivery order is first sults from an agreement between des- proposed (usually by one of the desti- tination processes (see Figure 13). We nations). Then, the destination processes distinguish three different variants of must agree to either accept or reject the agreement: (1) agreement on a message proposal. In other words, this variant sequence number, (2) agreement on a mes- of destinations agreement relies on an sage set, or (3) agreement on the accep- atomic commitment protocol. The algo- tance of a proposed message order. rithm proposed by Luan and Gligor [1990] Figure 14 illustrates an algorithm of typically belongs to the third variant. the first variant: for each message, the Note 16. There is a thin line between destination processes reach an agreement the second and the third variants of agree- on a unique (yet not consecutive) se- ment. For instance, Chandra and Toueg’s quence number. The algorithm is adapted [1996] total order broadcast algorithm re- from Skeen’s algorithm (Section 7.5.1), al- lies on consensus, as described. However, though it operates in a decentralized man- when it is combined with the rotating co- ner. Briefly, the algorithm works as fol- ordinator consensus algorithm [Chandra lows. To broadcast a message m,asender and Toueg 1996], the resulting algorithm sends m to all destinations. Upon receiv- can be seen as an algorithm of the third ing m,adestination assigns it a local form. Indeed, the coordinator proposes a timestamp and sends this timestamp to tentative order (given as a set of message all destinations. Once a destination pro- plus message identifiers) that it tries to cess has received a local timestamp for validate. Thus it is important to note that m from all destinations, a unique global two seemingly identical algorithms may timestamp sn(m)is assigned to m, calcu- use different forms of agreement, simply lated as the maximum of all local times- because they are described at different lev- tamps. Messages are delivered in the order els of abstraction. of their global timestamp, that is, a mes- sage m can only be delivered once it has 4.6. Time-Free Versus Time-Based Ordering been assigned its global timestamp sn(m), and no other undelivered message m# We introduce a further distinction be- can possibly receive a timestamp sn(m#) tween algorithms, orthogonal to the above smaller or equal to sn(m). As with the com- munication history algorithm (Figure 11), 10The consensus problem is informally defined as the identifier of the message sender is used follows: every process proposes some value, and all to break ties between messages with the processes must eventually decide on the same value, same global timestamp. which must be one (any one) of the proposed values.

ACM Computing Surveys, Vol. 36, No. 4, December 2004. 390 X. D´efago et al.

Fig. 14. Simple destinations agreement algorithm. classification. The distinction is between 5.1. Synchrony and Timeliness algorithms that use physical time for mes- The synchrony of a system defines the sage ordering, and algorithms that do timing assumptions that are made on the not use physical time. For instance, in behavior of processes and communication Section 4.4 (see Figure 11), we presented a channels. More specifically, one usually simple communication-history algorithm considers two major parameters. The first based on logical time. It is indeed pos- parameter is the process speed inter- sible to design a similar algorithm that val, which is given by the difference be- uses the physical time (and synchronized tween the speed of the slowest and the clocks) instead. fastest processes in the system. The sec- In short, we distinguish algorithms with ond parameter is the communication de- time-based ordering, that rely on physical lay, which is given by the time elapsed time, and algorithms with time-free order- between the sending and the receipt of ing that do not use physical time. messages. The synchrony of the system is defined by considering various bounds on 5. CONCEPTUAL ISSUES RELATED TO these two parameters. FAILURES A system where both parameters have a known upper bound is called a syn- In Section 4, we discussed ordering mecha- chronous system.At the other extreme, nisms, ignoring the problem of fail- a system in which process speed and ures. Mechanisms for fault-tolerance are communication delays are unbounded is discussed below in Section 6. How- called an asynchronous system. Between ever, fault-tolerance cannot be discussed those two extremes lie the definition without some prior discussion on sys- of various partially synchronous system tem model issues. This is done in this models [Dolev et al. 1987; Dwork et al. section. 1988].

ACM Computing Surveys, Vol. 36, No. 4, December 2004. Total Order Broadcast and Multicast Algorithms: Taxonomy and Survey 391

A third model that is considered by sev- for instance, whether a given process has eral total order broadcast algorithms is crashed or not. the timed asynchronous model defined by The notion of failure detectors has been Cristian and Fetzer [1999]. In its most formalized by Chandra and Toueg [1996]. simple form, this model can be seen as Briefly, a failure detector is modeled as a an asynchronous model with the notion set of distributed modules, one module FDi of physical time and an assumption that attached to each process pi. Any process pi “most messages are likely to reach their can query its failure detector module FDi destination within a known delay δ” [Cris- about the status of other processes. tian et al. 1997; Cristian and Fetzer 1999]. Failure detectors may be unreliable, in the sense that they provide information 5.2. Impossibility Results that may not always correspond to the real state of the system. For instance, a failure There is an important theoretical result detector module FDi may provide the er- related to the consensus problem (see roneous information that some process pj Footnote 10). It has been proven that there has crashed while, in reality, pj is correct is no deterministic solution to the prob- and running. Conversely, FDi may provide lem of consensus in asynchronous systems the information that a process pk is cor- if just a single process can crash [Fischer rect, while pk has actually crashed. et al. 1985]. Dolev et al. [1987] have shown To reflect the unreliability of the infor- that total order broadcast can be trans- mation provided by failure detectors, we formed into consensus, thus proving that say that a process pi suspects some pro- the impossibility of consensus also holds cess pj whenever FDi, the failure detector for total order broadcast. These impossi- module attached to pi, returns the (unreli- bility results were the motivation to ex- able) information that pj has crashed. In tend the asynchronous system with the other words, a suspicion is a belief (e.g., “pi introduction of oracles to make consensus believes that pj has crashed”) as opposed and total order broadcast solvable.11 to a known fact (e.g., “pj has crashed and pi knows that”). 5.3. Oracles There exist several classes of failure de- tectors, depending on how unreliable the In short, a (distributed) oracle can be seen information provided by the failure de- as some component that processes can tector can be. Classes are defined by two query. An oracle provides information that properties, called completeness and accu- algorithms can use to guide their choices. racy, that constrain the range of possi- The oracles most frequently considered in ble mistakes. In this article, we consider distributed systems are failure detectors four different classes of failure detectors, and coin flips. Since the information pro- called (perfect), (eventually perfect), vided by these oracles make consensus (strong),P and .P (eventually strong). and total order broadcast solvable, they TheS four classes share.S the same property augment the power of the asynchronous of completeness, and only differ by their system model. accuracy property [Chandra and Toueg 1996]: 5.3.1. Failure Detectors. A failure detec- tor is an oracle that provides informa- (STRONG COMPLETENESS) Eventually every tion about the current status of processes, faulty process is permanently sus- pected by all correct processes.

11Chandra and Toueg [1996] show that consensus (STRONG ACCURACY)No process is sus- can be transformed into total order broadcast. The pected before it crashes. [class ] P result holds also for arbitrary failures. So, consensus (EVENTUAL STRONG ACCURACY) There is a and total order broadcast are equivalent problems, that is, if there exists an algorithm that solves one time after which correct processes are problem, then it can be transformed into an algo- not suspected by any correct process. rithm that solves the other problem. [class ] .P

ACM Computing Surveys, Vol. 36, No. 4, December 2004. 392 X. D´efago et al.

(WEAK ACCURACY) Some process is never with , , or , but does not solve suspected. [class ] uniform.P totalS order.S broadcast. Using the S (EVENTUAL WEAK ACCURACY) There is a transformation of total order broadcast time after which some correct process to consensus (see Section 5.2), this algo- is never suspected by any correct rithm could be used to obtain an algo- process. [class ] rithm that solves nonuniform consensus, .S but not consensus. This is in contradiction A failure detector of class with a .S to Guerraoui [1995]. Hence, enforcing uni- majority of correct processes allows us formity has no additional cost in the asyn- to solve consensus [Chandra and Toueg chronous models with , , and fail- 1996]. Moreover, Chandra et al. [1996] ure detectors. .P S .S have shown that a failure detector of Note however that the result does not class is the weakest failure detector hold for total order broadcast algorithms .S 12 that allows us to solve consensus. that rely on a perfect ( ), or almost perfect failure detector (see SectionP 5.5). 5.3.2. Random Oracle. Another ap- proach to extend the power of the asyn- 5.5. Process Controlled Crash chronous system model is to introduce the ability to generate random values. Process controlled crash is the ability For instance, processes could have access given to processes to kill other processes or to a module that generates a random bit to commit suicide. In other words, this is when queried (i.e., a Bernoulli random the ability to artificially force the crash of a variable). process. Allowing process controlled crash This approach is used by a class of al- in a system model augments its power. In- gorithms called randomized algorithms. deed, this makes it possible to transform These algorithms can solve problems such severe failures (e.g., omission, Byzantine) as consensus (and so total order broadcast) into less severe failures (e.g., crash), and in a probabilistic manner. The probabil- to emulate an “almost perfect” failure de- ity that such algorithms terminate before tector. However, this power does not come some time t, goes to one, as t goes to in- without a price. finity (e.g., Ben-Or [1983] and Chor and Automatic transformation of failures. Dwork [1989]). Note that solving a prob- Neiger and Toueg [1990] present a tech- lem deterministically and solving it with nique that uses process controlled crash probability 1 are not the same. to transform severe failures (e.g., omis- sion, Byzantine) into less severe ones (i.e., 5.4. Uniformity for Free crash failures). In short, the technique is based on the idea that processes have In Section 2, we explained the difference their behavior monitored. Then, whenever between uniform and nonuniform specifi- a process begins to behave incorrectly (e.g., cations. Guerraoui [1995] shows that any omission, Byzantine), it is killed.13 algorithm that solves Consensus with .P However, this technique cannot be used ( , , respectively), also solves Uniform in systems with lossy channels, or those ConsensusS .S with ( , , respectively). .P S .S subject to partitions. Indeed, in such con- It is easy to show that this result also texts, processes might end up killing each holds for total order broadcast. Assume other until not a single one is left alive in that there exists an algorithm that solves the system. nonuniform total order broadcast (nonuni- Emulation of an almost perfect failure form Agreement, nonuniform Total Order) detector. A perfect failure detector ( ) sat- isfies both strong completeness andP strong 12The weakest failure detector to solve consensus is accuracy (no process is suspected before usually said to be , which differs from by satis- fying a weak completeness.W property instead.S of Strong Completeness. However, Chandra and Toueg [1996] 13The actual technique is more complex than what is prove the equivalence of and . described here, but this gives the basic idea. .S .W

ACM Computing Surveys, Vol. 36, No. 4, December 2004. Total Order Broadcast and Multicast Algorithms: Taxonomy and Survey 393 it crashes [Chandra and Toueg 1996]). In in terms of two properties: accuracy and practical systems, perfect failure detectors completeness (see Section 5.3.1). Com- are extremely hard to implement because pleteness prevents the blocking problem of the difficulty in distinguishing crashed just mentioned. Accuracy prevents algo- processes from very slow ones. The idea of rithms from running forever without solv- the emulation is simple: whenever a fail- ing the problem. ure detector suspects a process p, then p is Unreliable failure detectors might be killed (forced to crash). Fetzer [2003] pro- too weak for some total order broadcast poses a different emulation, based on reli- algorithms which require reliable failure able watchdogs, to ensure that no process detection information provided by a per- is suspected before it crashes. fect failure detector, known as (see Cost of a free lunch. Process controlled Section 5.5). P crash has a price. A fault-tolerant algo- rithm can only tolerate the crash of a 6.2. Group Membership Service bounded number of processes. In a sys- tem with process controlled crash, this The low-level failure detection mechanism limit includes not only genuine failures, is not the only way to address the block- but also failures provoked through process ing problem mentioned in the previous controlled crash. This means that each section. Blocking can also be prevented provoked failure effectively decreases the by relying on a higher-level mechanism, number of genuine failures that can be tol- namely a group membership service. erated, thus degrading the actual fault- A group membership service is a dis- tolerance of the system. tributed service that is responsible for managing the membership of groups of processes (see Section 3.4 and survey by 6. MECHANISMS FOR FAULT-TOLERANCE Chockler, Keidar, and Vitenberg [2001]). The total order broadcast algorithms de- The successive memberships of a group scribed in Section 4 are not tolerant to are called the views of the group. When- failures: if a single process crashes, the ever the membership changes, the service properties specified in Section 2.3 are not reports change to all group members by satisfied. To be fault-tolerant, total or- providing them with the new view. der broadcast algorithms rely on vari- A group membership service usually ous techniques presented in this section. provides strong completeness: if a pro- Note that it is difficult to discuss these cess p member of some group G crashes, techniques without getting into specific the membership service provides to the implementation details. Nevertheless, we surviving members of G a new view from try to keep the discussion as general which p is excluded. In the primary- as possible. Notice also that algorithms partition model (see Section 3.4), the ac- may actually combine several of these curacy of failure notifications is ensured techniques, for example, failure detection by forcing the crash of processes that (Section 6.1) with resilient communica- have been incorrectly suspected and ex- tion patterns (Section 6.3). cluded from the membership, a mecha- nism called process-controlled crash (see 6.1. Failure Detection Section 5.5). Moreover in the primary-partition A recurrent pattern in all distributed al- model, the group membership service gorithms is for a process p to wait for a provides consistent notifications to the message from some other process q. If q group members: the successive views of a crashes, process p is blocked. Failure de- group are notified in the same order to all tection is one basic mechanism to prevent of its members. p from being blocked. To summarize, while failure detectors Unreliable failure detection has been provide inconsistent failure notifications, formalized by Chandra and Toueg [1996] a group membership service provides

ACM Computing Surveys, Vol. 36, No. 4, December 2004. 394 X. D´efago et al. consistent failure notifications. Moreover, sage m is said to be k-stable, if m has total order algorithms that rely on a group been received by k processes. In a system membership service for fault-tolerance, in which at most f processes may crash, exploit another property that is usu- f 1-stability is the important property to ally provided along with the member- detect.+ If some message m is f 1-stable, ship service, namely view synchrony (see then m is received by at least one+ correct Section 3.3). Roughly speaking, view syn- process. With such a guarantee, an algo- chrony ensures that between two suc- rithm can easily ensure that m is even- cessive views v and v#, processes in the tually received by all correct processes. intersection v v# deliver the same set f 1-stability is often simply called stabil- of messages. Group∩ membership service ity+. The detection of stability is generally and view synchrony have been used to based on some acknowledgment scheme or implement complex group communica- token passing. tion systems (e.g., Isis [Birman and van Another use for message stability is the Renesse 1994], Totem [Moser et al. 1996], reclaiming of resources. Indeed, when a Transis [Dolev and Malkhi 1994, 1996; process detects that a message has be- Amir et al. 1992], Phoenix [Malloth et al. come stable throughout the system, it 1995; Malloth 1996]). can release resources associated with that message. 6.3. Resilient Communication Patterns 6.5. Consensus As shown in the previous sections, an algo- rithm can rely on a failure detection mech- The mechanisms described so far are low- anism, or on a group membership service, level mechanisms on which fault-tolerant to avoid the blocking problem. To be fault- total broadcast algorithms may rely. tolerant, another solution is to avoid any Another option for a fault-tolerant to- potential blocking pattern. tal order broadcast algorithm is to rely Consider, for example, a process p wait- on higher-level mechanisms that solve all ing for n f messages, where n is the num- the problems related to fault-tolerance ber of processes− in the system, and f the (i.e., the problems previously mentioned). maximum number of processes that may The consensus problem (see Footnote 10) crash. If all correct processes send a mes- is such a mechanism. Some algorithms sage to p, then the above pattern is non- solve total order broadcast by reducing blocking. We call such a pattern a resilient it to a consensus problem. This way, pattern. If an algorithm uses only resilient fault-tolerance, including failure detection patterns, it avoids the blocking problem and message stability detection, is hidden without using any failure detector mecha- within the consensus abstraction. nism or group membership service. Such algorithms have, for instance, been pro- 6.6. Mechanisms for Lossy Channels posed by Rabin [1983], Ben-Or [1983], and Pedone et al. [2002] (the first two are con- Apart from the mechanisms used to tol- sensus algorithms, see Footnote 10). erate process crashes, we need to say a few words about mechanisms to tolerate 6.4. Message Stability channel failures. First, it should be men- tioned that several total order broadcast Avoiding blocking is not the only problem algorithms avoid the issue by relying on that fault-tolerant total order broadcasts some communication layer that takes care algorithms have to address. Figure 1 il- of message loss (i.e., these algorithms as- lustrates a violation of the Uniform Agree- sume reliable channels, and hence do not ment property. Notice that this problem is discuss message loss). In contrast, other unrelated to blocking. algorithms are built directly on top of lossy The mechanism that solves the prob- channels, and so address message loss lem is called message stability.Ames- explicitly.

ACM Computing Surveys, Vol. 36, No. 4, December 2004. Total Order Broadcast and Multicast Algorithms: Taxonomy and Survey 395

To address message loss, the stan- Table II. Abbreviations Used in Tables III–V dard solution is to rely on a positive or yes ! somewhat explained in the a negative acknowledgment mechanism. / text With positive acknowledgment, the re- no ceipt of messages is acknowledged; with spec.× special explained in the negative acknowledgment, the detection text inf. informal explained in the of a missing message is signaled. The two text schemes can be combined. NS not specified means also “not Token-based algorithms (i.e., moving discussed” sequencer or privilege-based algorithms) n/a not applicable rely on token passing to detect message a positive acknowledgment +a negative acknowledgment losses: the token can be used to convey ac- GM− group membership knowledgments, or to detect missing mes- FD failure detector/detection sages. So token-based algorithms use the Cons. consensus token for ordering purpose, but also for im- RCP resilient communication patterns plementing reliable channels. ByzA. Byzantine agreement 7. SURVEY OF EXISTING ALGORITHMS whether the mechanism is time-based or This section provides an extensive survey not (Section 4.6). of total order broadcast algorithms. We (2) The General information rows are present about sixty algorithms published followed by rows describing the assump- in scientific journals or conference pro- tions upon which the algorithm is based, ceedings over the past three decades. We that is, what is provided to it: have made every possible effort to be ex- haustive, and we are quite confident that (a) The System model rows specify the this article presents a good picture of the synchrony assumptions, the assump- field at the time of writing. However, be- tions made about process failures cause of the continuous flow of papers on (rows: crash, omission, Byzantine), and the subject, we might have overlooked one communication channels (rows: reli- algorithm or two. able, FIFO). Reliable channels guaran- In Tables III–V, we present a condensed tee that if a correct process p sends a overview of all surveyed algorithms, in message m to a correct process q, then which we summarize the important char- q will eventually receive m [Aguilera acteristics of each algorithm. The Tables et al. 1999]. The row partitionable in- present only factual information about the dicates whether or not the algorithm algorithms as it appears in the relevant works with partitionable membership papers. In particular, the Tables do not semantics (see Section 3.4). In particu- present information that is the result of lar, algorithms in which only processes extrapolation, or nonobvious deduction; in a primary partition can work are not the exception is when we had to inter- considered partitionable. pret information to overcome differences (b) The rows called Condition for liveness in terminology. Also, properties that are discuss the assumptions necessary to discussed in the original paper, yet not ensure the liveness of the algorithm: proved correct, are reported as “informal” —The row live...X means that the in the Tables. For the sake of conciseness, liveness of the algorithm requires several symbols and abbreviations have the liveness of the building block X been used throughout the Tables. They are (on which the algorithm relies). explained in Table II. For each algorithm, For example, live...GM means Tables III–V provide the following infor- that the algorithm is live, if the mation: group membership building block on (1) General information, that is, the or- which the algorithm relies, is itself dering mechanism (see Section 4), and live.

ACM Computing Surveys, Vol. 36, No. 4, December 2004. 396 X. D´efago et al. ! ! ! NS NS NS NS n/a n/a GM 7.3.8 MARS § ! ! ! ! ! ! n/a GM GM 7.3.7 synchronous RTCAST § synch. clocks oueg ! ! ! ! ! ! ! ! n/a n/a Gopal T 7.3.6 Cons. § (spec.) synch. a − / ! ! ! ! ! a, TPM NS GM 7.3.5 § + × × a / / ! ! ! ! ! ! otem − NS GM 7.3.4 T ! ! § privilege-based ! ! ! ! ! FD n/a 7.3.3 oken-FD spec. § asynchronous T ! NS inf. inf. Train GM GM 7.3.2 § a − ! ! On- a, NS inf. inf. GM 7.3.1 demand § + timed a × ! ! Pin- − NS inf. wheel GM 7.2.4 § asynchronous a − ! ! ! DTP a, inf. GM GM 7.2.3 § + a − ! ! ! ! a, NS RMP inf. inf. GM 7.2.2 § + a moving sequencer − ! a, NS inf. inf. GM 7.2.1 Chang Maxem. § + ! ! ! ! ! ! ! ! ! ! n/a NS GM 7.1.9 Rampart § × × / / ! ! ! ! ! n/a GM GM 7.1.8 ! ! Phoenix § Overview of Total Order Broadcast Algorithms (Part I) ! ! n/a NS et al. inf. inf. GM 7.1.7 Navarat. § III. e ! ! ! ! ! ! ! Isis n/a inf. inf. bl (seq.) GM GM 7.1.6 § asynchronous Ta a × / / Jia ! ! ! ! − NS NS GM 7.1.5 § a fixed sequencer × × recovery ! ! ! ! ! − 7.1.4 Spauster § block. Garcia-M. a ! + NS inf. inf. andem GM 7.1.3 T § a × / ! ! − MTP inf. 7.1.2 spec. § a − ! a, NS inf. inf. GM 7.1.1 Amoeba § + otal Order ault tolerance mechanism lass multiple open process comm. other Agreement Unif. A T Unif. TO FIFO order causal ord. FIFO live ... other view sync. reliable b. causal b. consensus synchrony crash omission Byzantine partitionable reliable Algorithm c time-based Destination groups F Properties ensured Condition needed for liveness Building blocks System model Ordering mechanism

ACM Computing Surveys, Vol. 36, No. 4, December 2004. Total Order Broadcast and Multicast Algorithms: Taxonomy and Survey 397 ! ! ! ! NS n/a NS inf. inf. 7.6.4 in WAN Opt. TO § ! ! ! ! ! ! ! ! n/a 7.6.3 Indulg. unif. TO Cons. Cons. § hybrid ! ! ! ! n/a NS inf. inf. GM 7.6.2 Hybrid § asynchronous × × ! ! ! ! ! ! ! ! ! ! n/a NS GM 7.6.1 (asym.) Newtop § × × ! ! ! ! QoS n/a n/a n/a GM preserv. 7.4.15 § / ! ! ! ! ! ! ! ! n/a n/a Atom sync. clocks GM synchronous 7.4.14 § a × ! ! ! ! ABP + n/a GM sync. 7.4.13 § × × / / ! ! ! ! ! ! n/a n/a n/a n/a ! ! spec. Quick-S 7.4.12 ByzA. ByzA. § × × ! ! ! ! han. n/a c RCP spec. 7.4.11 Redund. § × / × × ! ! ! ! ! HAS n/a sync. clocks ! synchronous RCP 7.4.10 flood. § ! ! ! ! ! ! n/a n/a n/a n/a 7.4.9 merge Determ. § ! ! ! ! ! ! ! ! ! ! n/a GM 7.4.8 COReL spec. spec. § × ! ! ! ! ! communication history n/a n/a inf. GM GM ATOP 7.4.7 § Overview of Total Order Broadcast Algorithms (Part II) a × × − / / otal / ! ! ! ! a, NS n/a T 7.4.6 ! ! RCP § + × × / / Table IV. ! ! ! ! ! ! ! ! ! ToTo n/a NS GM 7.4.5 ! ! § asynchronous Ng ! ! ! ! FD n/a NS inf. 7.4.4 § × × ! ! ! ! ! ! ! ! ! ! n/a NS GM (sym.) 7.4.3 Newtop § a / / ! ! ! + FD NS inf. Psync 7.4.2 § ! ! ! ! n/a n/a n/a n/a inf. inf. 7.4.1 Lamport § otal Order ault tolerance mechanism lass multiple open process comm. other Agreement Unif. A T Unif. TO FIFO order causal ord. FIFO live ... other view sync. reliable b. causal b. consensus synchrony crash omission Byzantine partitionable reliable Algorithm c time-based Destination groups F Properties ensured Condition needed for liveness Building blocks System model Ordering mechanism

ACM Computing Surveys, Vol. 36, No. 4, December 2004. 398 X. D´efago et al. × ! ! ! ! ! ! ! AMp n/a GM xAMp spec. sync. 7.5.15 § × × / / ! ! ! ! ! ! n/a n/a n/a ! ! 7.5.14 Quick-A Cons. ByzA. ByzA. § / ! ! ! ! ! ! n/a weak order. RCP spec. 7.5.13 § / ! ! ! ! ! ! ! ! ! n/a n/a thrifty generic spec. 7.5.12 § ! ! ! ! ! ! ! ! n/a bcast generic 7.5.11 Cons. Cons. § ! ! ! ! ! ! ! n/a prefix spec. agreem. 7.5.10 Cons. Cons. § × × ! ! ! ! ! ! ! n/a 7.5.9 optim. ABcast Cons. Cons. § ! ! ! ! ! ! ! ! ! ritzke n/a et al. 7.5.8 F Cons. Cons. § asynchronous destinations agreement ! ! ! ! ! ! ! ! ! ! Scal- n/a atom 7.5.7 Cons. Cons. § P ! ! ! ! ! ! ! ATR . n/a GM 7.5.6 § ! ! ! ! ! 7.5.5 Overview of Total Order Broadcast Algorithms (Part III). Raynal spec. Cons. Cons. § Rodrigues gossip. × × oueg ! ! ! ! ! ! n/a T 7.5.4 Chandra Cons. Cons. § Table V. × × ! ! ! Bres NS n/a 7.5.3 RCP RCP Le Lann § a ! − FD NS inf. inf. Luan 7.5.2 Gligor § ! ! ! ! n/a NS inf. inf. GM Skeen 7.5.1 § otal Order ault tolerance mechanism lass multiple open process comm. other Agreement Unif. A T Unif. TO FIFO order causal ord. live ... other view sync. reliable b. causal b. consensus time-based synchrony crash omission Byzantine partitionable reliable FIFO Algorithm c F Properties ensured Destination groups Condition needed for liveness Building blocks System model General

ACM Computing Surveys, Vol. 36, No. 4, December 2004. Total Order Broadcast and Multicast Algorithms: Taxonomy and Survey 399

—The row other adds the following in- We first discuss Agreement and formation: NS not specified means Uniform Agreement, then Total Order that liveness is= not discussed in the and Uniform Total Order. Finally, article; n/a not applicable means we mention whether the algorithm that no additional= assumption is additionally ensures FIFO order or needed to ensure liveness (this ap- causal order. In all these entries, one plies mostly to algorithms that as- would expect either a yes or a no. sume a synchronous model); Unfortunately, many papers do not somewhat and spec. special refers/= provide proofs (often only informal to a discussion of liveness= later in the arguments), which means that these article; recovery means that the al- properties can be questioned. In this gorithm is blocking, that is, liveness case, inf. informal appears in the requires the recovery of crashed pro- Table. If an= algorithm does not discuss cesses; / refers to the failure the properties of total order broad- detector. neededP .S to ensure liveness. cast at all, the corresponding entry mentions NS not specified. If the (c) The next group of rows indicate = the building block(s) used by the nonuniform property is only discussed algorithm. The building blocks informally, then the corresponding considered are: view synchrony entry for the uniform property is left (Section 6.2), which encompasses empty (in an informal discussion, the a group membership service; reli- distinction between the uniform and able broadcast (Section 2.3) causal the nonuniform property usually does not appear). / ( yes/no) appears broadcast (Section 2.6.2); consensus ! × = (Section 4.5); or other. Other can be in some entries for the uniform prop- either ByzA. Byzantine agreement14 erty, meaning that these algorithms or spec. special= , which means that provide several levels of Quality of the explanation= is in the text. Service (QoS), which include a uni- form and a nonuniform version of (3) After discussing what is provided “to” the algorithm, where the nonuniform the algorithms, we discuss what is pro- version is more efficient. Moreover, vided “by” the algorithms. for being able to compare nonparti- (a) The first rows give the Properties en- tionable algorithms with partitionable sured by the algorithms. As discussed algorithms, we consider the properties in Section 2, total order broadcast is enforced by the former, when exe- specified by the following properties: cuted in a nonpartitionable system Validity, Uniform Agreement, Uni- model. form Integrity, Uniform Total Order. For the rows FIFO order and causal Validity and Uniform Integrity do not order, = yes appears only if this appear in the Tables. The reason is characteristic! is explicit in the paper. that these properties are rarely dis- Otherwise the entry is simply left cussed in the papers (authors usually blank. Finally, if an algorithm is not assume they are trivially ensured). fault-tolerant, then the distinction between the uniform and the nonuni- 14In the Byzantine agreement problem, also com- form properties does not make sense. monly known as the “Byzantine generals problem” In this case, the entry mentions n/a = [Lamport et al. 1982], every process has an a priori not applicable. knowledge that a particular process s is supposed to broadcast a single message m. Informally, the prob- (b) The rows called destination groups in- lem requires that all correct processes deliver the dicate whether the algorithm supports same message, which must be m,if the sender s is correct. As the name indicates, Byzantine agree- the total order broadcast of a message ment has mostly been studied in relation to Byzan- to multiple groups (row multiple), and tine failures. whether the algorithms support open

ACM Computing Surveys, Vol. 36, No. 4, December 2004. 400 X. D´efago et al.

groups (see Section 3). The entry is We think that it is useful to stress left blank if the issue is not discussed again the respective roles of the Tables explicitly in the paper. and the accompanying text in Sections 7.1 to 7.6. The Tables provide factual infor- (4) The last group of rows, called fault- mation about each algorithm, as it was tolerant mechanisms, discusses the mech- published in the relevant papers. In con- anisms used to provide fault-tolerance. trast, the text provides complementary The row process mentions the mecha- information, including information that nisms used to tolerate process crashes we have extrapolated.In particular, the (see Section 6). Note that some of these text explains the originality of each algo- fault-tolerant mechanisms also appear as rithm, and complements items that are building blocks. However, not all build- left vague in the Tables (i.e., these points ing blocks have been reported as fault- are vague in the paper itself). Specifically, tolerant mechanisms (e.g., reliable broad- for some of the algorithms, the properties cast, causal broadcast).15 reported in the Tables are weaker than The row comm. mentions the mecha- those the algorithm might ensure. In such nisms used to address message losses. a case, the text mentions (and discusses) Most of the algorithms assume underly- the stronger property that might hold. We ing reliable channels, in which case the emphasize this point, as misunderstand- entry mentions n/a not applicable. The ing the respective roles of text and Ta- acronyms a and =a indicate a posi- bles might lead to the erroneous impres- tive, respectively+ negative,− acknowledg- sion that the text and the Tables are in ment mechanism. The other entries are contradiction. flood (flooding), special (explanation in the text below), and GM group member- ship.In the context of= unreliable chan- 7.1. Fixed Sequencer Algorithms nels, the GM mechanism is used in the Regardless of the variant they adopt (see case were some process p waits for a mes- Section 4.1), all sequencer algorithms sage from some other process q and if no assume an asynchronous system model message is received (e.g., due to loss), then and use time-free ordering. They toler- p requests the exclusion of q from the ate crash failures, except for Rampart, membership. which also tolerates Byzantine failures. In Sections 7.1 through 7.6, we give a They all rely on process-controlled crash brief description of each individual algo- to cope with failure; either explicitly (e.g., rithm to the information pro- Tandem), or through group membership vided in the Tables. Unlike the Tables, the and exclusion (e.g., Isis, Rampart). text descriptions also present information that we have deduced from the relevant 7.1.1. Amoeba. The Amoeba [Kaashoek papers. In some cases, the lack of technical and Tanenbaum 1996] group communi- details about the algorithms (in particular, cation system supports algorithms of the in the case of failures) leads us to extrap- first two variants of fixed sequencer algo- olate their behavior. In this case, we have rithms. The first one corresponds to the attempted to avoid being too assertive (by, variant UB (unicast-broadcast) illustrated e.g., using the conditional) and kindly rec- in Figure 6(a) (Section 4.1). The second ommend that the reader treat this specu- variant corresponds to BB (broadcast- lative information with an appropriate de- broadcast), see Figure 6(b). The two vari- gree of skepticism. ants share the same properties. Amoeba assumes lossy channels and 15The decision of what is a fault-tolerance mecha- implements message retransmission as nism, and what is not is somewhat arbitrary.We have decided to keep the number of mechanisms men- part of the total order broadcast algo- tioned in Section 6 low, that is, to mention only key rithm. Amoeba uses a combination of pos- mechanisms. itive and negative acknowledgments. The

ACM Computing Surveys, Vol. 36, No. 4, December 2004. Total Order Broadcast and Multicast Algorithms: Taxonomy and Survey 401 actual protocol is quite complex because cast at a time, and thus does not need it is combined with flow control, and also sequence numbers. Later, Cristian et al. tries to minimize the communication cost. [1994] describe a variant UB of Tandem Amoeba tolerates failures using a group that allows concurrent broadcasts (and membership service. Suspected processes thus needs sequence numbers). are excluded from the group as the re- sult of the unilateral decision of a single 7.1.4. Garcia-Molina and Spauster. The al- process. gorithm proposed by Garcia-Molina and The properties of the Amoeba algo- Spauster [1991] is based on a propagation rithms are only discussed informally in graph (a forest) to support multiple over- the paper. However, since messages are lapping groups. The propagation graph delivered before they are stable, the al- is constructed is such a way that each gorithm can only satisfy the nonuniform group is assigned a starting node. Senders properties of Agreement and Total Order. send their messages to the corresponding starting nodes and messages travel along 7.1.2. MTP. MTP [Armstrong et al. the edges of the propagation graph. Or- 1992] is an algorithm primarily designed dering decisions are resolved along the for video streaming and similar multime- path. When used in a single group setting, dia applications. The algorithm assumes the algorithm behaves like other fixed se- that the system is not uniform with re- quencer algorithms (i.e., the propagation spect to the probability of process fail- graph is a tree of depth 1). ures. In particular, it assumes that a The algorithm assumes an asyn- process, called the master process, never chronous model and requires synchro- fails. The master is then designated as the nized clocks. However, synchronized sequencer, and the protocols follow vari- clocks are only needed to yield bounds on ant UUB (unicast-unicast-broadcast, see the behavior of the algorithm when crash Figure (c) 6). When a process p has a mes- failures occur. Neither the ordering mech- sage m to broadcast, p requests a sequence anism nor the fault-tolerance mechanism number for m from the sequencer. Once actually need them. it has obtained the sequence number, it In the event of failures, the algorithm sends m, together with the sequence num- behaves in an unconventional manner. In- ber, to all destinations and the master. At deed, if a nonleaf process p crashes, then the same time, destination processes learn its descendants in the propagation graph about the status of previous messages and do not receive any message until p has deliver those that have been accepted by recovered. Hence, the algorithm tolerates the master. process crashes only if those processes are The protocol tolerates crash failures of guaranteed to eventually recover. destination processes and senders, since all parts involving decisions are executed 7.1.5. Jia. Jia [1995] proposed another by the master. The failure of the master algorithm based on propagation graphs, is briefly discussed at the end of the pa- which creates simpler graphs than the al- per. The authors suggest that the master gorithm of Garcia-Molina and Spauster could be rendered more resilient by intro- [1991] (see Section 7.1.4). Unfortunately, ducing redundancy and using replication the algorithm originally proposed by Jia techniques. [1995] is incorrect. Chiu and Hsiao [1998] provide a correction to the algorithm 7.1.3. Tandem. The Tandem global up- which works only in a more restricted date protocol [Carr 1985] is a fixed se- model (i.e., only for closed groups). Also, quencer algorithm of variant UUB (see Shieh and Ho [1997] provide a correction Figure 6(c)). The algorithm allows, at to the message complexity calculated by most, one application message to be broad- Jia [1995].

ACM Computing Surveys, Vol. 36, No. 4, December 2004. 402 X. D´efago et al.

Jia’s algorithm relies on the notion of Finally, the authors also briefly mention meta-groups, defined in the paper as “the that moving the role of the sequencer in set of processes which have exactly the the absence of failures might be a way to same group memberships” (i.e., the set avoid a bottleneck. However, the idea is of processes which belong to the exact not developed further. same set of destination groups). The meta- groups are organized into propagation 7.1.7. Navaratnam et al.. Navaratnam trees, according to the membership they et al. [1988] propose a fixed sequencer represent. The flow of messages is stream- protocol of variant UB (see Figure 6(a)). lined down the trees, thus creating the de- The fault-tolerance of the algorithm livery order. relies on a group membership service Jia [1995] describes a form of group and the ability to exclude wrongly sus- membership mechanism that is used to pected processes. Similar to Amoeba (Sec- redefine the parts of the propagation tion 7.1.1), the decision to exclude a sus- graph that must change when a process pected process can be taken unilaterally is deleted. Jia also suggests that, un- by one single process. like Garcia-Molina and Spauster’s algo- The properties of this algorithm are dis- rithm [1991] (Section 7.1.4), the nodes cussed informally, and it is easy to see that in the tree consist of entire meta-groups, it satisfies the nonuniform properties of rather than single processes. Thus, mes- Agreement and Total Order. The authors sages would not be stopped unless all also make a brief remark suggesting that members in an intermediary meta-group the algorithm does not guarantee uniform fail. The issue is, however, only addressed properties, but the wording is a little am- informally. biguous and the information provided in the paper is not sufficient to verify this interpretation. 7.1.6. Isis (Sequencer). Birman et al. [1991] describe several broadcast primi- tives of the Isis system, including a total 7.1.8. Phoenix. Phoenix [Wilhelm and order broadcast primitive called ABCAST. Schiper 1995] consists of three algorithms The ABCAST primitive is implemented which provide different levels of guar- using a fixed sequencer algorithm (differ- antees. The first algorithm (weak order) ent from the algorithm used in earlier ver- only guarantees Total Order and Agree- sions of the system; see Section 7.5.1). ment. The second algorithm (strong or- The Isis (sequencer) algorithm is a fixed der) guarantees both Uniform Total Or- sequencer algorithm of variant BB (see der and Uniform Agreement. Then, the Figure 6 (b)), which uses a causal broad- third algorithm (hybrid order) combines cast primitive. The algorithm assumes both guarantees on a per-message basis. crash failures. The three algorithms are based on a Constructed over a causal broadcast group membership service and view syn- primitive, the Isis ABCAST algorithm pre- chrony (see Section 3.3). serves causal order. Moreover, although the algorithm does not support total 7.1.9. Rampart. Unlike other sequencer order for multiple overlapping groups, algorithms, which only assume crash fail- causal order is nevertheless preserved ures, the algorithm of Rampart [Reiter in this context. The total order broad- 1994, 1996] is designed to tolerate Byzan- cast algorithm ensures only the nonuni- tine failures. This sets this algorithm form properties of Agreement and Total somewhat apart from the other sequencer Order. algorithms. For fault-tolerance, the total order Rampart assumes an asynchronous sys- broadcast algorithm relies on a group tem model with reliable FIFO chan- membership service and view synchrony nels, and a public key infrastructure in (Section 6.2). which every process initially knows the

ACM Computing Surveys, Vol. 36, No. 4, December 2004. Total Order Broadcast and Multicast Algorithms: Taxonomy and Survey 403 public key of every other process. In ad- and negative acknowledgments. More pre- dition, communication channels are as- cisely, the token carries positive acknowl- sumed to be authenticated, so that the in- edgments, but when a process detects that tegrity of messages between two honest a message is missing, it sends a negative (i.e., non-Byzantine) processes is always acknowledgment to the token site. The guaranteed. negative acknowledgment scheme is used Unlike most early work on Byzan- for message retransmission, while the pos- tine failures, Rampart treats honest and itive scheme is used to detect message Byzantine processes separately. In partic- stability. ular, the paper defines uniformity as a property that applies to honest processes 7.2.1. Chang and Maxemchuck. The algo- only (see Note 1 in Section 2.4). With this rithm proposed by Chang and Maxemchuk definition, Rampart satisfies both Uni- [1984] is based on the existence of a logi- form Agreement and Uniform Total Order. cal ring along which a token is passed. The The algorithm is based on a group process that holds the token, also known membership service, which requires that as the token site, is responsible for se- at least one third of all processes in quencing the messages that it receives. the current view reach an agreement on The passing of the token simultaneously the exclusion of some process from the serves two purposes: (1) the transmission group. This condition is necessary because of the sequencer role, and (2) the detec- Byzantine processes could otherwise pur- tion of message stability. Point (2) requires posely exclude correct processes from the that the logical ring spans all destina- group. tion processes. This requirement is, how- ever, not necessary for ordering messages 7.2. Moving Sequencer Algorithms (point (1)), and hence the algorithm qual- ifies as a sequencer-based algorithm ac- We describe here four moving sequencer cording to our classification. algorithms, all of which are time-free. To When a process failure is detected (per- the best of our knowledge, there is no time- haps wrongly) or when a process recovers, based moving sequencer algorithm. It is the algorithm goes through a reformation actually questionable whether time-based phase. The reformation phase redefines ordering would even make sense for algo- the logical ring and elects a new initial to- rithms of this class. ken holder. The reformation algorithm can The four algorithms behave in a be seen as an ad hoc implementation of a very similar fashion. Actually, three of group membership service. them—Pinwheel (Section 7.2.4), RMP The properties of the total order broad- (Section 7.2.2), and DTP (Section 7.2.3)— cast algorithm are discussed only infor- are based on the fourth—Chang and mally. Nevertheless, it seems plausible Maxemchuck’s algorithm [1984] (Sec- that the algorithm ensures Uniform Total tion 7.2.1)—which they each improve in Order and Uniform Agreement. a different way. Pinwheel is optimized for a uniform message arrival pattern, RMP provides various levels of guaran- 7.2.2. RMP. RMP [Whetten et al. 1994] tees, and DTP provides a faster detection differs from the other three algorithms in of message stability. The four algorithms this group in that it is designed to op- also handle process failures very similarly, erate with open groups. Additionally, the using a reformation algorithm (see Sec- authors claim that “RMP provides multi- tion 7.2.1). Except for DTP (Section 7.2.4), ple multicast groups, as opposed to a sin- all algorithms rely on a logical ring along gle broadcast group.” However, according which the token circulates. to their description, supporting multiple The four algorithms tolerate message multicast groups is merely a character- loss by relying on a message retrans- istic associated with the group member- mission protocol that combines positive ship service. It is, therefore, dubious that

ACM Computing Surveys, Vol. 36, No. 4, December 2004. 404 X. D´efago et al.

“multiple groups” is used with the mean- serves causal order, but this is valid only ing that total order is guaranteed for pro- under certain restrictions that make the cesses that are at the intersection of two problem trivial to solve.16 groups (see discussion in Section 3.2). Depending on the user’s choice, RMP 7.3. Privilege-Based Algorithms satisfies Agreement, Uniform Agreement, or neither of these properties. However, Like moving sequencer algorithms, most in order to ensure the strong guarantees, privilege-based algorithms are based on RMP must assume that a majority of the a logical ring, and for most of them rely processes remain correct and always con- on some kind of group membership or re- nected. Also, RMP does not preclude the configuration protocol to handle process contamination of the group. failures.

7.3.1. On-Demand. The On-demand pro- 7.2.3. DTP. As mentioned, DTP [Kim tocol [Cristian et al. 1997], unlike other and Kim 1997] differs from the other al- privilege-based algorithms, does not rely gorithms of this class in that it does not on a logical ring. Instead, processes with rely on a logical ring for the passing of a message to broadcast must obtain the the token. Instead, DTP follows a heuris- token by issuing a request to the current tic, where the token is always passed to the token holder. As a consequence, the pro- process seen as the least active. Doing this tocol is more efficient if senders send long ensures that messages are acknowledged bursts of messages and such bursts rarely more quickly when the activity (i.e., broad- overlap. Also, in contrast to the other algo- casting messages) is not uniformly spread rithms, all processes must be aware of the among processes. identity of the token holder. So, the pass- ing of the token is done using a broadcast. 7.2.4. Pinwheel. The originality of Pin- The On-demand protocol relies on the wheel [Cristian et al. 1997] is that the to- same model as the Pinwheel protocol (Sec- ken circulates among the processes at a tion 7.2.4). In other words, it assumes speed proportional to the global activity a timed asynchronous system model and of the sending processes (i.e., broadcasting physical clocks for timeouts. rate). A similar algorithm, called Reqtoken, is Pinwheel assumes that a majority of the also described by Friedman and van Re- processes remains correct and connected nesse [1997]. at all times (majority group). The algo- rithm is based on the timed asynchronous 7.3.2. Train. The Train protocol model of Cristian and Fetzer [1999]. Al- [Cristian 1991] is inspired by the im- though it relies on physical clocks for time- age of a train that transports messages outs, Pinwheel does not need to assume and circulates among processes. More that these clocks are synchronized. Fur- concretely, a token (a.k.a., the train) thermore, the algorithm is time-free, since moves along a logical ring and carries time is not used for ordering messages. the messages. When a process gets the Pinwheel can ensure Uniform Total Or- token, it receives the new messages der, given an adequate support from its carried by the token, acknowledges them, group membership (not detailed in the pa- and appends its own messages to the per). Pinwheel only satisfies (nonuniform) token. Then, it passes the token to the Agreement, but the authors argue that the next process. The Train protocol, where algorithm could easily be modified to sat- messages are carried by the token, is in isfy Uniform Agreement [Cristian et al. 1997]. Doing this would only require that 16In systems with a single closed group, where pro- destination processes wait until a message cesses are only allowed to communicate using total is known to be stable before delivering it. order broadcast, causal order is satisfied trivially by The authors claim that the algorithm pre- simply enforcing FIFO order.

ACM Computing Surveys, Vol. 36, No. 4, December 2004. Total Order Broadcast and Multicast Algorithms: Taxonomy and Survey 405 clear contrast to the other algorithms cast a message) have good latency at low of the same class, where messages are loads, latency increases at high loads and broadcast directly to the destinations. in the presence of processor crashes. More- The Train protocol is hence less attractive over, according to Agarwal et al. [1998], than the others in a broadcast network. the ring and the token-passing scheme make privilege-based algorithms highly efficient in broadcast LANs, but less suited 7.3.3. Token-FD. Ekwall et al. [2004] to interconnected LANs. To overcome this also present an algorithm based on token- problem, they extend Totem to an envi- passing in a ring. The algorithm is spe- ronment consisting of multiple intercon- cial because it relies on an unreliable fail- nected LANs. The resulting algorithm per- ure detector to tolerate failures, while all forms better in such an environment, but other token-based algorithms use a form otherwise has the same properties as the of group membership. original single-ring one. In the basic version of the algorithm, the token is the only carrier of informa- tion, just as in the Train protocol (an op- 7.3.5. TPM. TPM [Rajagopalan and timization is also described, in which the McKinley 1989] is closely related to token carries message identifiers). How- Totem. The main difference is that TPM ever, the token is sent not only to the is not partitionable (it only supports pri- immediate successor in the ring, but to mary partition membership). Moreover, f 1 successors, where f is the number of TPM only provides uniform agreement crashes+ that the algorithm tolerates. The and total order. Finally, while TPM only additional copies are only used if a process supports a closed group, the authors suspects the crash of its predecessor. For discuss some ideas on how to extend its liveness, the algorithm requires a fail- the algorithm to support multiple closed ure detector defined specifically for rings. groups. This failure detector is stronger than , Rajagopalan and McKinley [1989] also but weaker than (see Section 5.3.1)..S propose a modification of TPM in which re- .P transmission requests are sent separately from the token in order to improve the be- 7.3.4. Totem. The specificity of Totem havior in networks with a high rate of mes- [Amir et al. 1995] compared to other sage loss. privilege-based algorithms is that it is de- signed for partitionable systems. The or- dering guarantee ensured is Strong Total 7.3.6. Gopal and Toueg. Gopal and Order. Totem provides both (nonuniform) Toueg’s [1989] algorithm is based on the agreement and total order (called agreed round synchronous model. The round order), and uniform agreement and total synchronous model is a computation order (called safe order), when operated in model in which the execution of processes a nonpartitionable system. Causal order is is synchronized according to rounds. Dur- also ensured. ing each round, every process performs The algorithm uses a membership pro- the same actions: (1) send a message tocol, which has the responsibility for de- to all processes, (2) receive a message tecting processor failures, network parti- from all noncrashed processes, and then tioning, and loss of the token. When such (3) perform some computations. failures are detected, the membership pro- The algorithm works as follows. For tocol reconstructs a new ring, generates each round, one of the processes is des- a new token, and recovers messages that ignated as the transmitter. The transmit- had not been received by some of the pro- ter of some round r is the only process cessors when the failure occurred. which is allowed to broadcast new appli- The authors observe that, while mov- cation messages in round r.In that round, ing sequencer algorithms (in which hold- the other processes broadcast acknowl- ing the token is not required to broad- edgments of previous messages. Messages

ACM Computing Surveys, Vol. 36, No. 4, December 2004. 406 X. D´efago et al. are delivered once they are acknowledged, in Section 4.4 (see Figure 11). Actually, three rounds after their initial broadcast. the paper describes a mutual exclusion al- gorithm. However, it is straightforward to 7.3.7. RTCAST. RTCAST [Abdelzaher derive a total order broadcast algorithm et al. 1996] was designed for applica- from the mutual exclusion algorithm. tions that need real-time guarantees. Since the delivery order of a message m The algorithm assumes a synchronous is determined by the timestamp of the system with synchronized clocks. These broadcast event of m, the total order is an strong guarantees allow for simplification extension of causal order. The algorithm in the protocol. The paper also shows is not tolerant to failures. how the maximum token rotation time A similar algorithm is described by can be used for the admission control Attiya and Welch [1994], when comparing and schedulability analysis of real-time consistency criteria. messages (with the goal to guarantee the delivery deadline of these messages). 7.4.2. Psync. The Psync algorithm 7.3.8. MARS. MARS [Kopetz et al. [Peterson et al. 1989] is used in several 1991] is based on the technique of time group communication systems: Consul division multiple-access (TDMA; see [Mishra et al. 1993], Coyote [Bhatti et al. Note 11). TDMA consists of predeter- 1998], and Cactus [Hiltunen et al. 1999]. mined periodic time slots assigned to each In Psync, processes dynamically build a process. Processes are then allowed to causality graph of messages they receive. send or broadcast messages only during Psync then delivers messages according their own time slots. The system assumes to a total order that is an extension of the that processes have synchronized clocks, causal order. whereby they are able to accurately Psync assumes an asynchronous sys- determine the beginning and the end of tem model with (permanent) crash fail- their own time slot. In addition, commu- ures and (transient) lossy communication. nication is assumed to be reliable, and To tolerate process failures, the algorithm with bounded delays. seems to assume a perfect failure detec- Based on the mutual exclusion provided tor, although this is not stated explicitly by TDMA and the communication model, in the paper. To implement reliable chan- total order broadcast is easily imple- nels, the algorithm uses negative acknowl- mented. The ordering mechanism can be edgments (to request the retransmission seen as similar to Gopal and Toueg’s algo- of lost messages). rithm (Section 7.3.6), but in a time-based Psync is specified only informally. Nev- model, where communication uses time ertheless, we believe that the protocol en- rather than messages [Lamport 1984]. sures Total Order in the absence of fail- Kopetz et al. [1991] do not discuss the ures. The behavior in the face of failures behavior of their total order broadcast al- is unfortunately not described in enough gorithm in the presence of failures. This detail to make a confident claim about it. makes it difficult to determine whether Agreement is a little more complex. In the algorithm is uniform or not. We be- the absence of message loss, Psync en- lieve that it is not uniform, simply because sures Agreement. However, with certain uniformity induces a cost in performance combinations of process crash and mes- that the authors are unlikely to consider sage loss, it is possible that some correct affordable. processes discard messages that are oth- erwise delivered by others. Hence, when 7.4. Communication History Algorithms message loss is considered, Agreement can be violated. This problem is discussed in 7.4.1. Lamport. The principle of detail by the authors, who relate it to Lamport’s algorithm [Lamport 1978b], an instance of the “last acknowledgment which uses logical clocks, was explained problem.”

ACM Computing Surveys, Vol. 36, No. 4, December 2004. Total Order Broadcast and Multicast Algorithms: Taxonomy and Survey 407

Malhis et al. [1996] provide an anal- 7.4.5. ToTo. The ToTo algorithm [Dolev ysis of the performance of Psync in the et al. 1993] ensures Weak Total Order presence of message loss. They conclude (see Section 3.4; it is called “agreed mul- that Psync performs well if broadcasts are ticast” in Dolev et al. [1993]). It is built frequent and message loss rare, but per- on top of the Transis partitionable group forms poorly when broadcasts are infre- communication system [Dolev and Malkhi quent and message loss common. They 1996]. ToTo extends the order of an un- show that the performance can be im- derlying causal broadcast algorithm. It is proved by regularly sending empty mes- based on dynamically building a causal- sages, as is done by other communica- ity graph of received messages. The Tran- tion history algorithms (see Note 14 in sis system offers both a uniform and a Section 4.4). nonuniform variant of the algorithm. A particularity of ToTo (nonuniform variant) is that, to deliver a message m,apro- 7.4.3. Newtop (Symmetric). Ezhilchelvan cess must have received acknowledgments et al. [1995] propose two algorithms: for m, from as few as a majority of the pro- a symmetric one and an asymmetric cesses in the current view (instead of all one. The symmetric algorithm extends view members). Lamport’s algorithm (Section 7.4.1) in several ways: it makes it fault-tolerant, allows a process to be member of multiple 7.4.6. Total. The Total algorithm [Moser groups, and allows the broadcast of a mes- et al. 1993] is built on top of a reliable sage to multiple groups. As for Lamport’s broadcast algorithm called Trans (Trans algorithm, Newtop preserves causal is defined together with Total). However, order. Trans is not used as a black box (which Newtop is based on a partitionable explains why we did not list reliable broad- group membership service (see Sec- cast as a building block for this algorithm tion 3.4). The Newtop platform leaves it to in Table IV). Trans uses an acknowledg- applications to decide whether or not they ment mechanism that defines a partial or- should maintain more than one subgroup der on messages. Total extends the par- in such a situation. Newtop satisfies the tial order of Trans into a total order. Two property of Weak Total Order mentioned variants are defined: the more efficient one in Section 3.4. tolerates f < n/3 crashes, and the other The asymmetric algorithm belongs to a tolerates f < n/2 crashes. different class, and is hence discussed in The Total algorithm fulfills the Agree- the relevant section (Section 7.6.1). The ment property (in fact, Uniform Agree- two algorithms (symmetric and asymmet- ment) with high probability.Actually Total ric) can easily be combined to allow the requires the underlying Trans reliable use of the symmetric algorithm in some broadcast protocol to provide probabilis- groups, and the asymmetric algorithm in tic guarantees about not reordering mes- others. sages. This has some similarities with the notion of oracles (see Section 7.5.13). 7.4.4. Ng. Ng [1991] presents a com- Moser and Melliar-Smith [1999] pro- munication history algorithm that uses pose an extension of Total to tolerate a minimum-cost spanning tree to propa- Byzantine failures. gate messages. The ordering of messages is based on Lamport’s clocks, similar to Lamport’s algorithm. However, messages 7.4.7. ATOP. ATOP [Chockler et al. and acknowledgments are propagated and 1998] is an algorithm following the deter- gathered, using a minimum-cost spanning ministic merge approach (Section 4.4). The tree. The use of a spanning tree improves focus of the paper is on adapting the algo- the scalability of the algorithm and makes rithm to different, and possibly changing it adequate for wide-area networks. sending rates. A pseudo-random number

ACM Computing Surveys, Vol. 36, No. 4, December 2004. 408 X. D´efago et al. generator is used in computing the deliv- sage is delayed to ensure total order, and ery order. to have as few messages as possible sent The paper is mostly concerned with en- by destination processes. The algorithm suring an ordering property. This property is designed for systems in which several is Strong Total Order, defined in the con- senders send a constant stream of mes- text of partitionable systems (Section 3.4). sages (at an approximately fixed rate). In The algorithm ensures FIFO order, and this algorithm, each received message de- ensures causal order trivially (see Foot- terministically defines the sender of the note 16). next message to be accepted. The algo- rithm relies on approximately synchro- nized clocks that are used by senders to 7.4.8. COReL. The COReL algorithm put a physical timestamp on their mes- [Keidar and Dolev 2000] is built on top sages. Upon receiving such a timestamped of a partitionable group membership ser- message, a destination process computes vice like Transis. The underlying service (using the timestamp and the sending must also offer Strong Total Order (Sec- rates of messages) the next sender from tion 3.4), as well as causal order. COReL which it will accept a message. The qual- gradually builds a global order (Reliable ity of the synchronization is important Total Order) by tagging messages accord- to ensure good performance of the algo- ing to three different color levels (red, yel- rithm, but it is not required for its cor- low, green). A message starts as red (no rectness. The algorithm is most efficient if knowledge about its position in the global clocks are synchronized (but works even order), then passes to yellow (received and if they are not) and each sender sends acknowledged when the process is a mem- messages at some fixed rate known a pri- ber of a majority component), and green ori (the rate may be different for each (all members of the majority component sender). To ensure the liveness of the al- acknowledged the message, and its posi- gorithm, senders need to send empty mes- tion in the global order is known). Green sages when they have no message to send messages are delivered to the application. (these messages divide the execution into Messages are retransmitted and promoted independent epochs). The algorithm, as to green whenever partitions merge. All described, is not fault-tolerant. acknowledgments sent by the algorithm are piggybacked. COReL provides the fol- lowing liveness guarantee: if eventually 7.4.10. HAS. Cristian et al. [1995] pro- there is a stable majority component, all pose a collection of total order broad- messages sent by the members of this com- cast algorithms (called HAS) that as- ponent are delivered. sume a synchronous system model with COReL also supports process recovery #-synchronized clocks. The authors de- if processes are equipped with stable stor- scribe three algorithms—HAS- , HAS- , age. This requires that processes log each and HAS- —that are respectivelyO toler-T message that is sent (before sending the ant to omissionB failures, timing failures, message), and each message that is re- and Byzantine failures. These algorithms ceived (before sending an acknowledg- are based on the principle of information ment). diffusion, which is itself based on the no- Fekete et al. [2001] formalize a variant tion of flooding, or gossiping. In short, of the COReL algorithm and the guaran- when a process wants to broadcast a mes- tees offered by the underlying group mem- sage m,it timestamps it with the time of bership service, using I/O automata. emission T, according to its local clock, and sends it to all neighbors. Whenever a 7.4.9. Deterministic Merge. The main mo- process receives m for the first time, it re- tivation for the deterministic merge algo- lays it to its neighbors. Processes deliver rithm of Aguilera and Strom [2000] is to message m at time T $, according to minimize the expected time that a mes- their local clocks (where+$ is constant that

ACM Computing Surveys, Vol. 36, No. 4, December 2004. Total Order Broadcast and Multicast Algorithms: Taxonomy and Survey 409 depends on the topology of the network, The paper also presents an algorithm the number of failures tolerated, and the for asynchronous systems. However, this maximum clock drift #). algorithm belongs to the class of desti- The paper proves that the three HAS nations agreement algorithms and is dis- algorithms satisfy Agreement. The au- cussed there (Quick-A; Section 7.5.14). thors do not prove Total Order, but by the properties of synchronized clocks and the 7.4.13. ABP. The principle of ABP timestamps, Uniform Total Order is not [Minet and Anceaume 1991b; An- too difficult to enforce. However, if the syn- ceaume 1993a] is close to the principle chronous assumptions do not hold, the al- of Lamport’s algorithm (Section 7.4.1): gorithms could violate the safety of the messages are delivered according to protocol (i.e., Total Order), rather than timestamps attached to messages by just its liveness. their sender. Each process manages a local sequence number variable, used 7.4.11. Redundant Broadcast Channels. to timestamp messages. Let process p Cristian [1990] presents an adaption of broadcast message m.In the first phase, the HAS- algorithm (omission failures) m and its timestamp value tsm are sent to to broadcastO channels. The system model all processes. Any process q that receives assumes the availability of f 1 inde- message m sends back a reply to p. The pendent broadcast channels (or networks)+ reply of process q to p may also include that connect all processes together, thus some message m#, if q had previously creating f 1 independent communi- broadcast m# with the same timestamp + cation paths between any two processes value (tsm# tsm). Upon reception of all = (where f is the maximum number of fail- replies from correct processes, process p ures). Compared to HAS- , the algorithm knows the set Msg(tsm)of all messages for redundant broadcastO channels issues with the same timestamp value tsm. Pro- significantly fewer messages. cess p delivers those messages (ordered according to the identifier of the sender of each message). Process p also broadcasts 7.4.12. Quick-S. Berman and Bharali the set Msg(tsm), thus allowing the other [1993] present several closely related to- processes to deliver the same sequence of tal order broadcast algorithms in a vari- messages. ety of system models. In synchronous sys- tems (three variants in the paper), the algorithms are similar to the HAS algo- 7.4.14. Atom. In Atom [Bar-Joseph rithms: messages are timestamped (with et al. 2002], streams of messages from physical or logical timestamps, depend- all senders are merged in a round-robin ing on the system model), and a message, fashion. To make the algorithms live, timestamped with T, can be delivered at senders need to send empty messages T $, for some value of $. The difference if they have no message to send. This is that+ they use a Byzantine agreement al- approach can be seen as a special case of gorithm with a bounded termination time deterministic merge (see Section 7.4.9). to send messages. There are algorithms that work with Byzantine failures, and 7.4.15. QoS Preserving Atomic Broadcast. ones that work with crash failures only; Bar-Joseph et al. [2000] present another the latter ensure Uniform Prefix Order. algorithm, based on the same ordering For Byzantine failures, the algorithm en- mechanism as Atom (Section 7.4.14). As sures only nonuniform properties. This is its name indicates, the QoS preserving al- because, unlike the specification of Ram- gorithm provides support for quality of part (Section 7.1.9), the specification used service (QoS), unlike Atom. On the other by Quick-S does not distinguish between hand, the QoS preserving algorithm does Byzantine processes and those that only not guarantee Agreement (i.e., uniform or fail by crashing. not), and only nonuniform Total Order.

ACM Computing Surveys, Vol. 36, No. 4, December 2004. 410 X. D´efago et al.

7.5. Destinations Agreement Algorithms liveness is ensured are not discussed (and difficult to infer). 7.5.1. Skeen. Skeen’s algorithm, des- cribed by Birman and Joseph [1987], was used in an early version of the Isis toolkit. 7.5.3. Le Lann and Bres. Le Lann and The algorithm corresponds roughly to the Bres [1991] wrote a position paper dis- algorithm in Figure 14. The main dif- cussing total order broadcast in a system ference is that Skeen’s algorithm com- with omission faults. The paper sketches putes the global timestamp in a cen- a total order broadcast algorithm based on tralized manner, while the algorithm in quorums. Figure 14 does it in a decentralized way. Fault-tolerance is achieved using a group 7.5.4. Chandra and Toueg. Chandra and membership service, which excludes sus- Toueg [1996] propose a transformation pected processes from the group. of atomic broadcast into a sequence of Dasser [1992] propose a simple op- consensus problems, where each consen- timization of Skeen’s algorithm called sus decides on a set of messages, easily TOMP, where additional information is transformed into a sequence of messages. appended to protocol messages in order The transformation uses reliable broad- to deliver application messages a little cast. The idea of this transformation, de- earlier. scribed in Section 4.5, is not repeated here. The algorithm assumes an asynchro- nous system model, reliable broadcast, 7.5.2. Luan and Gligor. Luan and Gligor and a black box that solves consensus. [1990] proposed an algorithm based on The algorithm is extremely elegant, in the majority voting. The idea is the following. sense that all difficult issues related to Upon execution of TO-broadcast(m), mes- fault-tolerance are hidden in the consen- sage m is sent to all processes. Upon re- sus black box. ception of m by some process q, m is put There have been several proposals to into q’s receiving buffer. The message de- optimize this algorithm. For example, livery order is then decided by a voting Mostefaoui´ and Raynal [2000] propose an protocol, which can be initiated by any of optimistic approach in which the consen- the processes. In case of concurrent initia- sus algorithm is split into two parts. The tion of the protocol, an arbitration rule is first phase is optimized, but does not al- used. ways succeed. If this happens, the full con- Voting is initiated by broadcasting an sensus algorithm is executed. “invitation” message. Consider this mes- sage broadcast by process p. Processes reply by sending the content of their re- 7.5.5. Rodrigues and Raynal. Rodrigues ceiving buffer to p. Process p waits for a and Raynal [2000] present a total order majority of replies. Based on the messages broadcast algorithm in a model where pro- received, process p then constructs a se- cesses have access to stable storage and quence of message identifiers, and broad- may recover after a crash. The algorithm casts this sequence. A process receiving is very close to the Chandra-Toueg algo- the sequence sends an acknowledgment rithm (Section 7.5.4): it uses the same to p. Once p has received acknowledg- transformation of total order broadcast to ments from a majority of processes, the consensus. The only difference is that, be- proposed sequence is committed. cause of the crash-recovery model, the al- To summarize, the protocol tries to gorithm relies on the crash-recovery con- reach consensus among the destination sensus algorithm of Aguilera, Chen, and processes on a sequence of messages. How- Toueg [2000]). ever, the authors did not identify consen- sus as a subproblem to solve, which makes 7.5.6. ATR. Delporte-Gallet and Fau- the protocol more complex. The conse- connier [1999] describe the ATR algo- quence is that the conditions under which rithm, which is based on an abstraction

ACM Computing Surveys, Vol. 36, No. 4, December 2004. Total Order Broadcast and Multicast Algorithms: Taxonomy and Survey 411 called Synchronized Phase System (SPS). ally received in the same order by every The SPS abstraction is defined in an asyn- process. When this assumption is met, the chronous system. An SPS decomposes the algorithm delivers messages very quickly. execution of an algorithm in rounds, al- However, if the assumption does not hold, most like a synchronous round model. the algorithm is less efficient than other The ATR algorithm distinguishes between algorithms (but still delivers messages in even and odd rounds. In even rounds, pro- total order). cesses send ordered sets of messages to Unlike most optimistic algorithms, the each other. Upon reception of these mes- optimistic atomic broadcast of Pedone and sages, each process constructs a sequence Schiper [2003] is optimistic internally. of messages. In the subsequent odd round, This means that the optimistic mech- processes try to validate the order and de- anism of the algorithm is not appar- liver messages. ent to the application. In other words, there is no weakening of the delivery 7.5.7. SCALATOM. SCALATOM Ro- properties. drigues et al. [1998] is based on Skeen’s algorithm (Section 7.5.1) and supports the 7.5.10. Prefix Agreement. Anceaume broadcast of messages to multiple groups. [1997] defines a variant of consensus, The algorithm satisfies the Strong Mini- called prefix agreement, where processes mality property (Section 3.2.4). The global agree on a stream of values, rather than timestamp is computed using a variant on a single value. Considering streams of Chandra and Toueg’s [1996] consensus rather than single values makes the algorithm (Section 7.5.4). SCALATOM prefix agreement algorithm particularly corrects an earlier algorithm, called MTO well suited to solve total order broadcast. [Guerraoui and Schiper 1997]. The total order broadcast algorithm uses prefix agreement to repeatedly decide on 7.5.8. Fritzke et al. Fritzke et al. [2001] the sequence of messages to be delivered also propose an algorithm for the broad- next. cast of messages to multiple groups. The algorithm satisfies the Strong Minimality 7.5.11. Generic Broadcast. Generic broad- property (Section 3.2.4). Consider a mes- cast [Pedone and Schiper 1999; 2002] is sage m broadcast to multiple groups. First, not a total order broadcast per se. In- the algorithm uses consensus to decide on stead, the algorithm assumes a conflict the timestamp of m within each destina- relation on the messages, and two mes- tion group. The destination groups then sages m and m# are delivered in the same exchange information to compute the final order at each destination process, only timestamp, and a second consensus is ex- if they conflict. Two messages m and m# ecuted in each group to update the logical that do not conflict are not ordered by clock. the algorithm. If all messages conflict, then generic broadcast provides the same 7.5.9. Optimistic Atomic Broadcast. Opti- guarantee as total order broadcast. If no mism is a technique used for several messages conflict, then generic broadcast years in the context of concurrency con- provides the guarantees of (uniform) re- trol [Bernstein et al. 1987] and file system liable broadcast. The strong point of this replication [Guy et al. 1993]. However, it algorithm is that performance varies ac- has only recently been considered in the cording to the required “amount of or- context of total order broadcast [Pedone dering”. The generic broadcast algorithm 2001]. uses a consensus algorithm only in case of The optimistic atomic broadcast algo- conflicts. rithm of Pedone and Schiper [1998, 2003] is based on the experimental observation 7.5.12. Thrifty Generic Broadcast. Aguil- that messages broadcast in a LAN are usu- era, Delporte-Gallet et al. [2000] also

ACM Computing Surveys, Vol. 36, No. 4, December 2004. 412 X. D´efago et al. propose a generic broadcast algorithm. decide on the round in which messages are When conflicting messages are detected, to be delivered. Pedone and Schiper [2002] solve generic broadcast by reduction to consensus, while 7.5.15. AMp /xAMp. The AMp [Ver´ıssimo Aguilera et al. [2000] solve generic broad- et al. 1989] and xAMp [Rodrigues and cast by reduction to total order broadcast. Ver´ıssimo 1992] algorithms rely on the as- In addition, the algorithm is thrifty in the sumption that, most of the time, broadcast sense that, if there is a time after which messages are received by all destination broadcast messages do not conflict with processes in the same order (a realistic as- each other, then eventually atomic broad- sumption in LANs, as already mentioned). cast is no longer used. The algorithm of So, when a process broadcasts a message, Pedone and Schiper [2002] also satisfies it initiates a commitment protocol. If the this property with respect to consensus, messages are received in order by all des- but the property was not identified in the tination processes, then the outcome is paper. positive: all destination processes commit and deliver the message. Otherwise, the 7.5.13. Weak Ordering Oracles. Pedone message is rejected and the sender must et al. [2002] define a weak ordering ora- try again (thus potentially leading to a cle as an oracle that orders messages that livelock). are broadcast, but is allowed to make mis- takes (i.e., the messages broadcast may be 7.6. Hybrid Algorithms delivered out of order). This oracle models Here we discuss algorithms that do not the behavior observed in local-area net- fit into one of our five classes of total works, where broadcast messages are of- order broadcast algorithms. These algo- ten spontaneously delivered in total order. rithms usually combine two different or- The paper shows that total order broad- dering mechanisms. cast can be solved using a weak ordering oracle. If the optimistic assumption is met, 7.6.1. Newtop (Asymmetric). Ezhilchel- the proposed algorithm, which assumes van et al. [1995] propose two algorithms; f < n , solves total order broadcast in two 3 one symmetric and the other asymmetric. communication steps. The symmetric algorithm was described Interestingly, the algorithm has the earlier (Section 7.4.3). same structure as the randomized consen- The asymmetric algorithm uses a se- sus algorithm proposed by Rabin [1983]. quencer process, and allows a process to The authors also mention that the weak be a member of multiple groups (each ordering oracle could be used to design group has an independent sequencer). For a total order broadcast algorithm with ordering, the algorithm uses Lamport’s the same structure as the randomized logical clocks [1978b] in addition to the consensus algorithm proposed by Ben-Or sequencer. Hence the asymmetric algo- [1983]. rithm is a hybrid between a communica- tion history algorithm (due to the use of 7.5.14. Quick-A. Berman and Bharali Lamport’s clocks) and a fixed sequencer al- [1993] present a series of four algorithms, gorithm. The asymmetric algorithm, like three of which belong to another class (see the symmetric one, preserves causal or- Quick-S, Section 7.4.12). Their algorithm der delivery. However, note that a pro- for asynchronous systems is quite differ- cess p, which is a member of more than ent from their algorithms for synchronous one group, cannot broadcast a message systems (Section 7.4.12). Processes main- m to a group immediately after broad- tain a round number, and broadcast mes- casting some message m# to a different sages are timestamped with this round group. Process p can only deliver m# after number. The processes then execute a se- it has delivered m. Hence, the asymmet- quence of randomized binary consensus, to ric algorithm does not technically allow a

ACM Computing Surveys, Vol. 36, No. 4, December 2004. Total Order Broadcast and Multicast Algorithms: Taxonomy and Survey 413 message to be broadcast to more than one (Section 7.6.2). The delivery order is deter- group. mined by sequence numbers attached to As mentioned in Section 7.4.3, Newtop messages. A sequence number attached to [Ezhilchelvan et al. 1995] supports the a message m must be validated by a major- combination of groups, even if one group ity of processes before the total order of m uses the asymmetric algorithms and the is decided. Nevertheless, the algorithm op- other group uses the symmetric one. Also, timistically delivers m according to its se- Newtop is based on a partitionable group quence number before the sequence num- membership service. ber is actually validated.

7.6.2. Hybrid. Rodrigues et al. [1996] 7.6.4. Optimistic Total Order in WANs. Op- present an algorithm optimized for large timistic total order broadcast algorithms networks. The algorithm is hybrid: on a lo- rely heavily on the assumption that mes- cal scale, a sequence number is attached to sages are very often received by all pro- each message by a fixed sequencer, and on cesses in some spontaneous total order. a global scale, the ordering is of type com- This assumption was motivated by obser- munication history. More precisely, each vations made in local networks often over sender p has an associated sequencer pro- a single hub. The assumption is, however, cess that issues a sequence number for questionable for wide-area networks, in each message of p. The original message which the spontaneous total order is sig- and its sequence number are sent to all, nificantly less likely to occur. Sousa et al. and messages are finally ordered using [2002] propose a time-based solution to ad- a standard communication history tech- dress this problem and increase the prob- nique (see Section 7.4.1). The authors also ability of spontaneous total order in wide- describe interesting heuristics to change area networks. The technique, called delay the sequencer process. The reasons for compensation, consists of artificially de- such changes can be failures, membership laying received messages, so that all des- changes, or changes in the traffic pattern. tinations will process them at roughly the same time. A delay is kept for each incom- ing communication channel, and the dura- 7.6.3. Indulgent Uniform Total Order. Vi- tion of this delay is adapted dynamically. cente and Rodrigues [2002] propose an op- timistic algorithm for wide-area networks. 8. OTHER WORK ON TOTAL ORDER AND The algorithm is based on external op- RELATED ISSUES timism, as initially proposed by Kemme et al. [1999, 2003]. This means that the Apart from papers proposing total algorithm distinguishes between two de- order broadcast algorithms, there is livery events following the broadcast of other closely related work that is worth message m: the optimistic delivery, de- mentioning. noted Opt-deliver(m), and the traditional Backoff Protocol. Chockler, Malkhi, total order delivery, denoted Adeliver(m). and Reiter [2001] describe a replication Upon Opt-deliver(m), the delivery order protocol which emulates state machine of m is not yet decided. However, the ap- replication [Lamport 1978a; Schneider plication can start processing m.If later, 1990]. The protocol is based on quorum To-deliver(m) invalidates the optimistic systems and relies on a mutual exclusion delivery order, then the application must protocol. Basically, a client process want- rollback and undo the processing of m. The ing to perform some operation op on the optimism of Kemme et al. [2003] is re- replicated servers proceeds as follows: the lated to the spontaneous total ordering in client first waits to enter the critical sec- LANs. tion, and then (1) accesses a quorum of The optimistic algorithm of Vicente replicas to get an up-to-date state σ of the and Rodrigues [2002] extends the hy- replicated servers, (2) performs the opera- brid algorithm of Rodrigues et al. [1996] tion op on σ which leads to a new state σ #,

ACM Computing Surveys, Vol. 36, No. 4, December 2004. 414 X. D´efago et al. and (3) updates a quorum of replicas with and Lee 1996], or the ability to reserve the new state σ #. The protocol is safe even buffers [Chen et al. 1996]. if the mutual exclusion protocol violates Performance of Total Order Broadcast safety (more than one process in the criti- Algorithms. Compared to the host of cal section): safety of the mutual exclusion publications describing algorithms, rel- protocol is only needed to ensure progress atively few papers are concerned with of the replication protocol. evaluating the performance of total or- Optimistic Active Replication. Felber der broadcast (e.g., Cristian et al. [1994], and Schiper [2001] describe another repli- Friedman and van Renesse [1997], Mayer cation protocol that is integrated with [1992], described in Section 1). Recently, a total order broadcast algorithm. The we presented a comparative performance replication protocol is based on an opti- analysis based on the classification devel- mistic fixed-sequencer total-order broad- oped in this survey [Defago´ et al. 2003]: cast algorithm, which is executed among algorithms are taken from all five classes the servers. The optimistic algorithm may of ordering mechanisms, and both uni- lead some servers to deliver messages form and nonuniform algorithms are con- out of order, in which case these servers sidered. Urban´ et al. [2003] go beyond have to rollback. Rollback is limited to simply evaluating some algorithm or com- servers; client processes never have to paring different algorithms. They propose rollback. benchmarks including well-defined perfor- Probabilistic Protocols. Recently, mance metrics, workloads, and faultloads Felber and Pedone [2002] have proposed describing how failures and related events a total ordered broadcast algorithm with occur. probabilistic safety. This means that their Formal Methods. Formal methods algorithms enforce the properties of total have been applied both to the problem of order broadcast with a known probability. total order broadcast, in order to verify Doing so makes room for extremely scal- the properties of algorithms [Zhou and able solutions, but it is only acceptable Hooman 1995; Toinard et al. 1999; Fekete for applications with very weak require- et al. 2001], and to the problem of consen- ments. In particular, Felber and Pedone sus, in order to construct a truly formal [2002] propose a specification where proof for an algorithm [Nestmann et al. agreement is guaranteed with probabil- 2003]. The proofs of Fekete et al. [2001] ity γa, total order with probability γo, and were subsequently checked by a theorem validity with probability γv. The authors prover. Liu et al. [2001] use the notion of propose an algorithm based on gossiping meta-properties to describe and analyze and discuss sufficient conditions under a protocol which switches between two which their algorithm can enforce the total order broadcast algorithms. above properties with probability one. Group Communication Controversy. Hardware-Based Protocols. Due to Eleven years ago, Cheriton and Skeen their specificity, we have deliberately [1993] published a polemic about group omitted algorithms that make explicit communication systems that provide use of dedicated hardware. However, they causally and totally ordered communi- deserve to be cited here. Some protocols cation primitives. Their major argument are based on a modification of the network against group communication systems controllers (e.g., Jalote [1998] and Minet was that systems based on transactions and Anceaume [1991a]). The idea is to are more efficient, while providing a slightly modify the network so that it stronger consistency model. This was can be used as a virtual sequencer. In subsequently answered by Birman [1994] our classification system, these proto- and Shrivastava [1994]. More than a cols can be classified as fixed sequencer decade later, it appears that work on protocols. Some other protocols rely on transaction systems and on group com- the characteristics of specific networks munication systems tended to influence such as a specific topology [Cordova´ each other for a mutual benefit [Schiper

ACM Computing Surveys, Vol. 36, No. 4, December 2004. Total Order Broadcast and Multicast Algorithms: Taxonomy and Survey 415 and Raynal 1996; Agrawal et al. 1997; define the limits within which the algo- Pedone et al. 1998; Kemme and Alonso rithm can be used. 2000; Wiesmann et al. 2000; Kemme et al. 2003]. ACKNOWLEDGMENTS

We are grateful to the following people for their 9. CONCLUSION helpful advice and constructive criticisms: David E. Bakken, Bernadette Charron-Bost, Adel Cherif, The vast literature on total order broad- Alan D. Fekete, Takuya Katayama, Tohru Kikuno, cast and the large number of published Dahlia Malkhi, Keith Marzullo, Rui C. Oliveira, algorithms show the complexity of the Fernando Pedone, Michel Raynal, Luis T. Rodrigues, problem. However, this abundance of in- Richard D. Schlichting, Tatsuhiro Tsuchiya, and formation is a problem in itself, because it Matthias Wiesmann, as well as Fred B. Schneider, makes it difficult to understand the exact and the anonymous reviewers. We are also thankful tradeoffs associated with each proposed to Judith Steeh for proofreading this text. solution. The main contribution of this article is REFERENCES the definition of a classification for total ABDELZAHER, T., SHAIKH, A., JAHANIAN, F., AND SHIN, K. order broadcast algorithms, that makes it 1996. RTCAST: Lightweight multicast for real- easier to understand the relationship be- time process groups. In IEEE Real-Time Tech- tween them. It also provides a good basis nology and Applications Symposium (RTAS’96). for comparing the algorithms and under- (Boston, MA). IEEE Computer Society Press. 250–259. standing some tradeoffs. Furthermore, the AGARWAL,D.A., MOSER, L. E., MELLIAR-SMITH,P.M., article has presented a vast survey of most AND BUDHIA,R.K. 1998. The Totem multiple- of the existing algorithms and discussed ring ordering and topology maintenance proto- their respective characteristics. col. ACMTrans. Comput. Syst. 16,2(May), 93– In spite of the large number of total or- 132. der broadcast algorithms published, most AGRAWAL, D., ALONSO, G., EL ABBADI, A., AND STANOI, I. 1997. Exploiting atomic broadcast in repli- are merely improvements or variants of cated databases (extended abstract). In Proceed- each other (even if this is not immedi- ings of EuroPar’97 (Passau, Germany). Lecture ately obvious to the untrained eye). Ac- Notes in Computer Science, vol. 1300. Springer- tually, there are only a few truly original Verlag. 496–503. algorithms, but a large collection of vari- AGUILERA, M. K., CHEN, W., AND TOUEG,S. 1999. Us- ous optimizations. Nevertheless, it is im- ing the heartbeat failure detector for quiescent reliable communication and consensus in par- portant to stress that clever optimizations titionable networks. Theor. Comput. Sci. 220,1 of existing algorithms often have a very (June), 3–30. significant impact on performance. For in- AGUILERA, M. K., CHEN, W., AND TOUEG,S. 2000. stance, Friedman and van Renesse [1997] On quiescent reliable communication. SIAM J. show that piggybacking messages, in spite Comput. 29,6, 2040–2073. of its simplicity, can significantly improve AGUILERA, M. K., DELPORTE-GALLET, C., FAUCONNIER, H., AND TOUEG,S. 2000. Thrifty generic broad- the performance of algorithms. cast. In Proceedings of 14th International Sym- Even though the specification of the to- posium on Distributed Computing (DISC’00) tal order broadcast problem dates back (Toledo, Spain). M. Herlihy, Ed. Lecture Notes to some of the earliest publications about in Computer Science, vol. 1914. Springer-Verlag. the subject, few papers actually specify 268–282. the problem that they solve. In fact, too AGUILERA, M. K. AND STROM,R.E. 2000. Efficient atomic broadcast using deterministic merge. few algorithms are properly specified, let In Proceedings of 19th ACM Symposium on alone proven correct. Fortunately, this is Principles of Distributed Computing (PODC-19) changing and we hope that this article will (Portland, OR). ACM Press. 209–218. contribute to more rigorous work in the AMIR, Y., DOLEV, D., KRAMER, S., AND MALKI,D. future. Without pushing formalism to ex- 1992. Transis: A communication sub-system for high availability. In Proceedings of 22nd tremes, a clear specification and a sound International Symposium on Fault-Tolerant proof of correctness are as important as Computing (FTCS-22) (Boston, MA). IEEE Com- the algorithm itself. Indeed, they clearly puter Society Press. 76–84.

ACM Computing Surveys, Vol. 36, No. 4, December 2004. 416 X. D´efago et al.

AMIR, Y., MOSER, L. E., MELLIAR-SMITH,P.M., AGARWAL, BIRMAN,K. 1994. A response to Cheriton and D. A., AND CIARFELLA,P. 1995. The Totem Skeen’s criticism of causal and totally ordered single-ring ordering and membership protocol. communication. ACM Operat. Syst. Rev. 28,1 ACMTrans. Comput. Syst. 13,4(Nov.), 311–342. (Jan.), 11–21. ANCEAUME,E. 1993a. Algorithmique de fiabilisa- BIRMAN, K. AND VAN RENESSE,R. 1994. Reliable Dis- tion de systemes` repartis.´ Ph.D. thesis, Univer- tributed Computing with the Isis Toolkit (Los sited´ eParis-sud (Paris XI), Paris, France. Alamitos, CA). IEEE Computer Society Press. ANCEAUME,E. 1993b. A comparison of fault- BIRMAN, K. P. AND JOSEPH,T.A. 1987. Reliable com- tolerant atomic broadcast protocols. In Proceed- munication in the presence of failures. ACM ings of 4th IEEE Computer Society Workshop on Trans. Comput. Syst. 5,1(Feb.), 47–76. Future Trends in Distributed Computing Sys- BIRMAN, K. P., SCHIPER, A., AND STEPHENSON,P. 1991. tems (FTDCS-4) (Lisbon, Portugal). IEEE Com- Lightweight causal and atomic group multicast. puter Society Press. 166–172. ACMTrans. Comput. Syst. 9,3(Aug.), 272– ANCEAUME,E. 1997. A lightweight solution to uni- 314. form atomic broadcast for asynchronous sys- CARR,R. 1985. The Tandem global update proto- tems. In Proceedings of 27th International Sym- col. Tandem Syst. Rev. 1,2(June), 74–85. posium on Fault-Tolerant Computing (FTCS-27) (Seattle, WA). IEEE Computer Society Press. CHANDRA,T.D., HADZILACOS, V., AND TOUEG,S. 1996. 292–301. The weakest failure detector for solving consen- sus. J. ACM 43,4(July), 685–722. ANCEAUME, E. AND MINET,P. 1992. Etude´ de pro- tocoles de diffusion atomique. TR 1774, INRIA CHANDRA,T.D.AND TOUEG,S. 1996. Unreliable fail- (Oct.) Rocquencourt, France. ure detectors for reliable distributed systems. J. ACM 43,2, 225–267. ARMSTRONG, S., FREIER, A., AND MARZULLO,K. 1992. Multicast transport protocol. RFC 1301, IETF CHANG,J.-M. AND MAXEMCHUK,N.F. 1984. Reli- (Feb.) able broadcast protocols. ACMTrans. Comput. Syst. 2,3(Aug.), 251–273. ATTIYA, H. AND WELCH,J.L. 1994. Sequential con- sistency versus linearizability. ACMTrans. Com- CHARRON-BOST, B., PEDONE, F., AND DEFAGO´ ,X. 1999. put. Syst. 12,2(May), 91–122. Private communications. (Showed an example il- lustrating the fact that even the combination of BAR-JOSEPH, Z., KEIDAR, I., ANKER, T., AND LYNCH,N. strong agreement, strong total order, and strong 2000. QoS preserving totally ordered multi- integrity does not prevent a faulty process from cast. In Proceedings of 4th International Con- reaching an inconsistent state.) ference on Principles of Distributed Systems (OPODIS). Studia Informatica, Paris, France, CHARRON-BOST, B., TOUEG, S., AND BASU,A. 2000. 143–162. Revisiting safety and liveness in the context of failures. In Concurrency Theory, 11th Interna- BAR-JOSEPH, Z., KEIDAR, I., AND LYNCH,N. 2002. tional Conference (CONCUR 2000) (University Early-delivery dynamic atomic broadcast. In Park, PA). Lecture Notes in Computer Science, Proceedings of 16th International Symposium vol. 1877. Springer-Verlag. 552–565. on Distributed Computing (DISC’02) (Toulouse, France). D. Malkhi, Ed. Lecture Notes in Com- CHEN, X., MOSER, L. E., AND MELLIAR-SMITH, P. M. puter Science, vol. 2508. Springer-Verlag. 1–16. 1996. Reservation-based totally ordered mul- ticasting. In Proceedings of 16th IEEE Interna- BEN-OR,M. 1983. Another advantage of free tional Conference on Distributed Computing Sys- choice: Completely asynchronous agree- tems (ICDCS-16) (Hong Kong). IEEE Computer ment protocols. In Proceedings of 2nd ACM Society Press. 511–519. SIGACT-SIGOPS Symposium on Principles of Distributed Computing (PODC-2) (Montreal CHERITON, D. R. AND SKEEN,D. 1993. Understand- Quebec, Canada). ACM Press. 27–30. ing the limitations of causally and totally or- dered communication. In Proceedings of 14th BERMAN,P.AND BHARALI,A.A. 1993. Quick atomic broadcast (extended abstract). In Pro- ACM Symposium on Operating Systems Prin- ceedings of 7th International Workshop on ciples (SoSP-14) (Ashville, NC). ACM Press. Distributed Algorithms (WDAG’93) (Lausanne, 44–57. Switzerland). A. Schiper Ed. Lecture Notes in CHIU,G.-M. AND HSIAO,C.-M. 1998. A note on to- Computer Science, vol. 725. Springer-Verlag. tal ordering multicast using propagation trees. 189–203. IEEE Trans. Parall. Distrib. Syst. 9, 2 (Feb.), 217–223. BERNSTEIN,P.A., HADZILACOS, V., AND GOODMAN,N. 1987. Concurrency Control and Recovery in CHOCKLER, G., KEIDAR, I., AND VITENBERG,R. 2001. Database Systems. Addison-Wesley, Boston, MA. Group communication specifications: A compre- http://research.microsoft.com/pubs/ccontrol/. hensive study. ACM Comput. Surv. 33,4(Dec.), 427–469. BHATTI,N.T., HILTUNEN, M. A., SCHLICHTING, R. D., AND CHIU,W. 1998. Coyote: A system for con- CHOCKLER,G.V., HULEIHEL, N., AND DOLEV,D. structing fine-grain configurable communication 1998. An adaptive total ordered multicast services. ACMTrans. Comput. Syst. 16,4(Nov.), protocol that tolerates partitions. In Proceed- 321–366. ings of 17th ACM Symposium on Principles

ACM Computing Surveys, Vol. 36, No. 4, December 2004. Total Order Broadcast and Multicast Algorithms: Taxonomy and Survey 417

of Distributed Computing (PODC-17) (Puerto tiple groups. In Proceedings of 4th International Vallarta, Mexico). ACM Press. 237–246. Conference on Principles of Distributed Systems CHOCKLER,G.V., MALKHI, D., AND REITER,M.K. 2001. (OPODIS). Studia Informatica, Paris, France, Backoff protocols for distributed mutual exclu- 143–162. sion and ordering. In Proceedings of 21st IEEE DOLEV, D., DWORK, C., AND STOCKMEYER,L. 1987. On International Conference on Distributed Com- the minimal synchrony needed for distributed puting Systems (ICDCS-21) (Phoenix, AZ). IEEE consensus. J. ACM 34,1(Jan.), 77–97. Computer Society Press. 11–20. DOLEV, D., KRAMER, S., AND MALKI,D. 1993. Early CHOR,B.AND DWORK,C. 1989. Randomization in delivery totally ordered multicast in asyn- Byzantine Agreement. Adv. Comput. Res. 5, 443– chronous environments. In Proceedings of 23rd 497. International Symposium on Fault-Tolerant CORDOVA´ ,J.AND LEE,Y.-H. 1996. Multicast trees Computing (FTCS-23) (Toulouse, France). IEEE to provide message ordering in mesh networks. Computer Society Press. 544–553. Comput. Syst. Sci. Eng. 11,1(Jan.), 3–13. DOLEV,D.AND MALKHI,D. 1994. The design of the CRISTIAN,F. 1990. Synchronous atomic broadcast Transis system. In Theory and Practice in Dis- for redundant broadcast channels. Real-Time tributed Systems,K. Birman, F. Mattern, and Syst. 2,3(Sept.), 195–212. A. Schiper, Eds. Lecture Notes in Computer Sci- CRISTIAN,F. 1991. Asynchronous atomic broad- ence, vol. 938. Springer-Verlag, 83–98. cast. IBM Technical Disclosure Bulletin 33, 9, DOLEV,D.AND MALKHI,D. 1996. The Transis ap- 115–116. proach to high availability cluster communnica- CRISTIAN, F., AGHILI, H., STRONG, R., AND DOLEV,D. tion. Comm. ACM 39,4(April), 64–70. 1995. Atomic broadcast: From simple message DWORK, C., LYNCH, N. A., AND STOCKMEYER,L. 1988. diffusion to Byzantine agreement. Inf. Com- Consensus in the presence of partial synchrony. put. 18,1, 158–179. J. ACM 35,2(April), 288–323. CRISTIAN, F., DE BEIJER, R., AND MISHRA,S. 1994. EKWALL, R., SCHIPER, A., AND URBAN´ ,P. 2004. Token- A performance comparison of asynchronous based atomic broadcast using unreliable fail- atomic broadcast protocols. Distrib. Syst. Eng. ure detectors. In Proceedings of 23nd IEEE In- J. 1,4(June), 177–201. ternational Symposium on Reliable Distributed CRISTIAN,F.AND FETZER,C. 1999. The timed asyn- Systems (SRDS’04) (Florianopolis,´ Brazil). IEEE chronous distributed system model. IEEE Trans. Computer Society Press. 52–65. Parall. Distrib. Syst. 10,6(June), 642–657. EZHILCHELVAN, P. D., MACEDOˆ , R. A., AND SHRIVASTAVA, CRISTIAN,F.AND MISHRA,S. 1995. The pinwheel S. K. 1995. Newtop: A fault-tolerant group asynchronous atomic broadcast protocols. In communication protocol. In Proceedings of 15th Proceedings of 2nd International Symposium IEEE International Conference on Distributed on Autonomous Decentralized Systems (Phoenix, Computing Systems (ICDCS-15) (Vancouver, AZ). IEEE Computer Society Press. 215– Canada). IEEE Computer Society Press. 296– 221. 306. CRISTIAN, F., MISHRA, S., AND ALVA REZ,G. 1997. FEKETE,A. 1993. Formal models of communica- High-performance asynchronous atomic broad- tion services: A case study. IEEE Comput. 26,8 cast. Distrib. Syst. Eng. J. 4,2(June), 109–128. (Aug.), 37–47. DASSER,M. 1992. TOMP: A total ordering multi- FEKETE, A., LYNCH, N., AND SHVARTSMAN,A. 2001. cast protocol. ACM Operat. Syst. Rev. 26,1(Jan.), Specifying and using a partitionable group 32–40. communication service. ACMTrans. Comput. DEFAGO´ ,X. 2000. Agreement-related problems: Syst. 19,2(May), 171–216. From semi-passive replication to totally ordered FELBER,P.AND PEDONE,F. 2002. Probabilistic broadcast. Ph.D. thesis, Ecole´ Polytechnique atomic broadcast. In Proceedings of 21st IEEE Fed´ erale´ de Lausanne, Lausanne, Switzerland. International Symposium on Reliable Dis- Number 2229. tributed Systems (SRDS’02) (Osaka, Japan). IEEE Computer Society Press. 170–179. DEFAGO´ , X., SCHIPER, A., AND URBAN´ ,P. 2003. Comparative performance analysis of order- FELBER,P.AND SCHIPER,A. 2001. Optimistic active ing strategies in atomic broadcast algorithms. replication. In Proceedings of 21st IEEE Inter- IEICE Trans. Inf. Syst. E86–D,12 (Dec.), 2698– national Conference on Distributed Computing 2709. Systems (ICDCS-21) (Phoenix, AZ). IEEE Com- DELPORTE-GALLET,C.AND FAUCONNIER,H. 1999. puter Society Press. 333–341. Real-time fault-tolerant atomic broadcast. In FETZER,C. 2003. Perfect failure detection in timed Proceedings of 18th IEEE International Sympo- asynchronous systems. IEEE Trans. Com- sium on Reliable Distributed Systems (SRDS’99) put. 52,2(Feb.), 99–112. (Lausanne, Switzerland). IEEE Computer Soci- FISCHER, M. J., LYNCH,N.A., AND PATERSON, M. S. ety Press. 48–55. 1985. Impossibility of distributed consensus DELPORTE-GALLET,C.AND FAUCONNIER,H. 2000. with one faulty process. J. ACM 32,2(April), Fault-tolerant genuine Atomic Broadcast to mul- 374–382.

ACM Computing Surveys, Vol. 36, No. 4, December 2004. 418 X. D´efago et al.

FRIEDMAN, R. AND VAN RENESSE,R. 1997. Packing JALOTE,P. 1998. Efficient ordered broadcasting messages as a tool for boosting the performance in reliable CSMA/CD networks. In Proceed- of total ordering protocols. In Proceedings of ings of 18th IEEE International Conference 6th IEEE Symposium on High Performance Dis- on Distributed Computing Systems (ICDCS-18) tributed Computing (Portland, OR). IEEE Com- (Amsterdam, The Netherlands). IEEE Com- puter Society Press. 233–242. puter Society Press. 112–119. FRITZKE, U., INGELS, P., MOSTEFAOUI´ , A., AND RAYNAL, JIA,X. 1995. A total ordering multicast protocol M. 2001. Consensus-based fault-tolerant to- using propagation trees. IEEE Trans. Parall. tal order multicast. IEEE Trans. Parall. Distrib. Distrib. Syst. 6,6(June), 617–627. Syst. 12,2(Feb.), 147–156. KAASHOEK, M. F. AND TANENBAUM,A.S. 1996. An GARCIA-MOLINA, H. AND SPAUSTER,A. 1989. Mes- evaluation of the Amoeba group communication sage ordering in a multicast environment. In system. In Proceedings of 16th IEEE Interna- Proceedings of 9th IEEE International Con- tional Conference on Distributed Computing Sys- ference on Distributed Computing Systems tems (ICDCS-16) (Hong Kong). IEEE Computer (ICDCS-9) (Newport Beach, CA). IEEE Com- Society Press. 436–447. puter Society Press. 354–361. KEIDAR, I. AND DOLEV,D. 2000. Totally ordered GARCIA-MOLINA, H. AND SPAUSTER,A. 1991. Ordered broadcast in the face of network partitions. In and reliable multicast communication. ACM Dependable Network Computing,D.R.Avresky, Trans. Comput. Syst. 9,3(Aug.), 242–271. Ed. Kluwer Academic Pub., Chapter 3, 51– GOPAL, A. AND TOUEG,S. 1989. Reliable broadcast 75. in synchronous and asynchronous environment. KEMME,B.AND ALONSO,G. 2000. Don’t be lazy, be In Proceedings of 3rd International Workshop consistent: Postgres-r, a new way to implement on Distributed Algorithms (WDAG’89) (Nice, database replication. In Proceedings of 26th In- France). Lecture Notes in Computer Science, vol. ternational Conference on Very Large Data Bases 392. Springer-Verlag. 111–123. (VLDB 2000) (Cairo, Egypt). Morgan Kaufmann. GOPAL, A. AND TOUEG,S. 1991. Inconsistency and 134–143. contamination. In Proceedings of 10th ACM KEMME, B., PEDONE, F., ALONSO, G., AND SCHIPER, Symposium on Principles of Distributed Com- A. 1999. Processing transactions over opti- puting (PODC-10). ACM Press. 257–272. mistic atomic broadcast protocols. In Proceed- GUERRAOUI,R. 1995. Revisiting the relationship ings of 19th IEEE International Conference between non-blocking atomic commitment and on Distributed Computing Systems (ICDCS-19) consensus. In Proceedings of 9th International (Austin, TX). IEEE Computer Society Press. Workshop on Distributed Algorithms (WDAG’95) 424–431. (Le Mont-St-Michel, France). Lecture Notes in KEMME, B., PEDONE, F., ALONSO, G., SCHIPER, A., AND Computer Science, vol. 972. Springer-Verlag. WIESMANN,M. 2003. Using optimistic atomic 87–100. broadcast in transaction processing systems. IEEE Trans. Know. Data Eng. 15,4(July), 1018– GUERRAOUI, R. AND SCHIPER,A. 1997. Genuine atomic multicast. In Proceedings of 11th Inter- 1032. national Workshop on Distributed Algorithms KIM,J.AND KIM,C. 1997. A total ordering protocol (WDAG’97) (Saarbrucken,¨ Germany). Lecture using a dynamic token-passing scheme. Distrib. Notes in Computer Science, vol. 1320. Springer- Syst. Eng. J. 4,2(June), 87–95. Verlag. 141–154. KOPETZ, H., GRUNSTEIDL¨ , G., AND REISINGER,J. 1991. GUERRAOUI, R. AND SCHIPER,A. 2001. Genuine Fault-tolerant membership service in a syn- atomic multicast in asynchronous distributed chronous distributed real-time system. In Pro- systems. Theor. Comput. Sci. 254, 297– ceedings of 2nd IFIP International Working Conf. 316. on Dependable Computing for Critical Appli- cations (DCCA-1) (Tucson, AZ). A. Avizienisˇ GUY, R. G., POPEK, G. J., AND PAGE JR., T. W. 1993. Consistency algorithms for optimistic replica- and J.-C. Laprie, Eds. Springer-Verlag. 411– tion. In Proceedings of 1st IEEE International 429. Conference on Network Protocols (ICNP’93) LAMPORT,L. 1978a. The implementation of reli- (San Francisco, CA). IEEE Computer Society able distributed multiprocess systems. Comput. Press. Netw. 2, 95–114. HADZILACOS,V.AND TOUEG,S. 1994. A modular ap- LAMPORT,L. 1978b. Time, clocks, and the order- proach to fault-tolerant broadcasts and related ing of events in a distributed system. Comm. problems. TR 94-1425, Dept. of Computer Sci- ACM 21,7(July), 558–565. ence, Cornell University, Ithaca, NY (May.) LAMPORT,L. 1984. Using time instead of time-outs HILTUNEN, M. A., SCHLICHTING, R. D., HAN, X., CARDOZO, in fault-tolerant systems. ACMTrans. Program. M. M., AND DAS,R. 1999. Real-time depend- Lang. Syst. 6,2, 256–280. able channels: Customizing QoS attributes for LAMPORT,L. 1986a. The mutual exclusion prob- distributed systems. IEEE Trans. Parall. Dis- lem: Part I—a theory of interprocess communi- trib. Syst. 10,6(June), 600–612. cation. J. ACM 33,2(April), 313–326.

ACM Computing Surveys, Vol. 36, No. 4, December 2004. Total Order Broadcast and Multicast Algorithms: Taxonomy and Survey 419

LAMPORT,L. 1986b. On interprocess communica- MOSER, L. E., MELLIAR-SMITH,P.M., AGRAWAL,D.A., tion. part I: Basic formalism. Distrib. Com- BUDHIA, R. K., AND LINGLEY-PAPADOPOULOS, C. A. put. 1,2, 77–85. 1996. Totem: A fault-tolerant multicast group LAMPORT, L., SHOSTAK, R., AND PEASE,M. 1982. The communication system. Comm. ACM 39,4 Byzantine generals problem. ACMTrans. Pro- (April), 54–63. gram. Lang. Syst. 4,3, 382–401. MOSER, L. E., MELLIAR-SMITH, P. M., AND AGRAWALA,V. LE LANN,G.AND BRES,G. 1991. Reliable atomic 1993. Asynchronous fault-tolerant total order- broadcast in distributed systems with omission ing algorithms. SIAM J. Comput. 22,4(Aug.), faults. ACM Operat. Syst. Rev. SIGOPS 25,2 727–750. (April), 80–86. MOSTEFAOUI´ , A. AND RAYNAL,M. 2000. Low cost LIU, X., VAN RENESSE, R., BICKFORD, M., KREITZ, C., AND consensus-based atomic broadcast. In Proceed- CONSTABLE,R. 2001. Protocol switching: Ex- ings of 7th IEEE Pacific Rim Symposium on De- ploiting meta-properties. In Proceedings of 21st pendable Computing (PRDC’00) (Los Angeles, IEEE International Conference on Distributed CA). IEEE Computer Society Press. 45–52. Computing Systems Workshops (ICDCSW’01), NAVARATNAM, S., CHANSON, S. T., AND NEUFELD,G.W. Intl. Workshop on Applied Reliable Group Com- 1988. Reliable group communication in dis- munication (WARGC 2001) (Mesa, AZ). IEEE tributed systems. In Proceedings of 8th IEEE In- Computer Society Press. 37–42. ternational Conference on Distributed Comput- ing Systems (ICDCS-8) (San Jose, CA). IEEE LUAN,S.-W. AND GLIGOR,V.D. 1990. A fault- tolerant protocol for atomic broadcast. IEEE Computer Society Press. 439–446. Trans. Parall. Distrib. Syst. 1,3(July), 271–285. NEIGER,G.AND TOUEG,S. 1990. Automatically in- creasing the fault-tolerance of distributed algo- LYNCH,N.A. 1996. Distributed Algorithms. Mor- gan Kaufmann, San Francisco, CA. rithms. J. Algorithms 11,3, 374–419. NESTMANN, U., FUZZATI, R., AND MERRO,M. 2003. LYNCH, N. A. AND TUTTLE,M.R. 1989. An intro- Modeling consensus in a process calculus. In duction to input/output automata. CWI Quar- (CONCUR 2003) Concurrency Theory, 14th In- terly 2,3(Sept.), 219–246. ternational Conference (Marseille, France). R. M. MALHIS, L. M., SANDERS, W. H., AND SCHLICHTING, R. D. Amadio and D. Lugiez, Eds. Lecture Notes in 1996. Numerical performability evaluation of Computer Science, vol. 2761. Springer-Verlag. a group multicast protocol. Distrib. Syst. Eng. 393–407. J. 3,1(March), 39–52. NG,T.P. 1991. Ordered broadcasts for large ap- MALLOTH,C.P. 1996. Conception and implementa- plications. In Proceedings of 10th IEEE Interna- tion of a toolkit for building fault-tolerant dis- tional Symposium on Reliable Distributed Sys- tributed applications in large scale networks. tems (SRDS’91) (Pisa, Italy). IEEE Computer Ph.D. thesis, Ecole´ Polytechnique Fed´ erale´ de Society Press. 188–197. Lausanne, Switzerland. Number 1557. PEDONE,F. 2001. Boosting system performance MALLOTH, C. P., FELBER, P., SCHIPER, A., AND WILHELM, with optimistic distributed protocols. IEEE U. 1995. Phoenix: A toolkit for building fault- Comput. 34,12 (Dec.), 80–86. tolerant distributed applications in large scale. PEDONE, F., GUERRAOUI, R., AND SCHIPER,A. 1998. In Workshop on Parallel and Distributed Plat- Exploiting atomic broadcast in replicated forms in Industrial Products; 7th IEEE Sym- databases. In Proceedings of EuroPar (Eu- posium on Parallel and Distributed Processing. roPar’98) (Southampton, UK). Lecture Notes in San Antonio, TX. Computer Science, vol. 1470. Springer-Verlag. MAYER,E. 1992. An evaluation framework for 513–520. multicast ordering protocols. In Proceedings PEDONE,F.AND SCHIPER,A. 1998. Optimistic atomic of ACM International Conference on Appli- broadcast. In Proceedings of 12th Interna- cations, Technologies, Architecture, and Proto- tional Symposium on Distributed Computing cols (SIGCOMM) (Baltimore, Maryland). ACM (DISC’98) (Andros, Greece). Lecture Notes in Press. 177–187. Computer Science, vol. 1499. Springer-Verlag. MINET,P.AND ANCEAUME,E. 1991a. ABP: An atomic 318–332. broadcast protocol. TR 1473, INRIA (June). Roc- PEDONE,F.AND SCHIPER,A. 1999. Generic broad- quencourt, France. cast. In Proceedings of 13th International Sym- MINET,P.AND ANCEAUME,E. 1991b. Atomic broad- posium on Distributed Computing (DISC’99) cast in one phase. ACM Operat. Syst. Rev. 25,2 (Bratislava, Slovak Republic). Lecture Notes in (April), 87–90. Computer Science, vol. 1693. Springer-Verlag. MISHRA, S., PETERSON, L. L., AND SCHLICHTING, R. D. 94–108. 1993. Consul: a communication substrate for PEDONE,F.AND SCHIPER,A. 2002. Handling mes- fault-tolerant distributed programs. Distrib. sage semantics with generic broadcast protocols. Syst. Eng. J. 1,1, 87–103. Distrib. Comput. 15,2, 97–107. MOSER, L. E. AND MELLIAR-SMITH,P.M. 1999. PEDONE,F.AND SCHIPER,A. 2003. Optimistic atomic Byzantine-resistant total ordering algorithms. broadcast: A pragmatic viewpoint. Theor. Com- Inf. Comput. 150,1(April), 75–111. put. Sci. 291, 79–101.

ACM Computing Surveys, Vol. 36, No. 4, December 2004. 420 X. D´efago et al.

PEDONE, F., SCHIPER, A., URBAN´ , P., AND CAVIN,D. Symposium on Reliable Distributed Systems 2002. Solving agreement problems with weak (SRDS’93) (Princeton, NJ). IEEE Computer So- ordering oracles. In Proceedings of 4th European ciety Press. 115–124. Dependable Computing Conference (EDCC-4) SCHIPER, A. AND RAYNAL,M. 1996. From group com- (Toulouse, France). Lecture Notes in Com- munication to transactions in distributed sys- puter Science, vol. 2485. Springer-Verlag. 44– tems. Comm. ACM 39,4(April), 84–87. 61. SCHNEIDER,F.B. 1990. Implementing fault- PETERSON, L. L., BUCHHOLZ, N. C., AND SCHLICHTING, tolerant services using the state machine R. D. 1989. Preserving and using context in- approach: a tutorial. ACM Comput. Surv. 22,4 formation in interprocess communication. ACM (Dec.), 299–319. Trans. Comput. Syst. 7,3, 217–246. SHIEH,S.-P. AND HO,F.-S. 1997. A comment on POLEDNA,S. 1994. Replica determinism in dis- “a total ordering multicast protocol using prop- tributed real-time systems: A brief survey. Real- agation trees”. IEEE Trans. Parall. Distrib. Time Syst. 6,3(May), 289–316. Syst. 8,10 (Oct.), 1084. RABIN,M. 1983. Randomized Byzantine generals. SHRIVASTAVA,S.K. 1994. To CATOCS or not to In Proceedings of 24th Annual ACM Symposium CATOCS, that is the... ACM Operat. Syst. on Foundations of Computer Science (FOCS) Rev. 28,4(Oct.), 11–14. (Tucson, AZ). ACM Press. 403–409. SOUSA, A., PEREIRA, J., MOURA, F., AND OLIVEIRA, R. RAJAGOPALAN,B.AND MCKINLEY,P. 1989. A token- 2002. Optimistic total order in wide area net- based protocol for reliable, ordered multicast works. In Proceedings of 21st IEEE Interna- communication. In Proceedings of 8th IEEE In- tional Symposium on Reliable Distributed Sys- ternational Symposium on Reliable Distributed tems (SRDS’02) (Osaka, Japan). IEEE Computer Systems (SRDS’89) (Seattle, WA). IEEE Com- Society Press. 190–199. puter Society Press. 84–93. TOINARD, C., FLORIN, G., AND CARREZ,C. 1999. A for- REITER,M.K. 1994. Secure agreement protocols: mal method to prove ordering properties of mul- Reliable and atomic group multicast in Rampart. ticast algorithms. ACM Operat. Syst. Rev. 33,4 In Proceedings of 2nd ACM Conf. on Computer (Oct.), 75–89. and Communications Security (CCS-2) (Fairfax, URBAN´ , P., DEFAGO´ , X., AND SCHIPER,A. 2000. VA). ACM Press. 68–80. Contention-aware metrics for distributed algo- REITER,M.K. 1996. Distributing trust with the rithms: Comparison of atomic broadcast algo- Rampart toolkit. Comm. ACM 39,4(April), 71– rithms. In Proceedings of 9th IEEE International 74. Conference on Computer Communications and RODRIGUES, L., FONSECA, H., AND VER´ıSSIMO,P. 1996. Networks (IC3N’00) (Las Vegas, NV). IEEE Com- Totally ordered multicast in large-scale sys- puter Society Press. 582–589. tems. In Proceedings of 16th IEEE International URBAN´ , P., SHNAYDERMAN, I., AND SCHIPER,A. 2003. Conference on Distributed Computing Systems Comparison of failure detectors and group mem- (ICDCS-16) (Hong Kong). IEEE Computer Soci- bership: Performance study of two atomic broad- ety Press. 503–510. cast algorithms. In Proceedings of IEEE Interna- RODRIGUES, L., GUERRAOUI, R., AND SCHIPER,A. 1998. tional Conference on Dependable Systems and Scalable atomic multicast. In Proceedings of 7th Networks (DSN’03) (San Francisco, CA). IEEE IEEE International Conference on Computer Computer Society Press. 645–654. Communications and Networks (Lafayette, VER´ıSSIMO, P., RODRIGUES, L., AND BAPTISTA,M. 1989. LA). IEEE Computer Society Press. 840– AMp: A highly parallel atomic multicast pro- 847. tocol. Comput. Comm. Rev. 19,4(Sept.), 83– RODRIGUES, L. T. AND RAYNAL,M. 2000. Atomic 93. broadcast in asynchronous crash-recovery VICENTE,P.AND RODRIGUES,L. 2002. An indulgent distributed systems. In Proceedings of 20th uniform total order algorithm with optimistic IEEE International Conference on Distributed delivery. In Proceedings of 21st IEEE Interna- Computing Systems (ICDCS-20) (Taipei, tional Symposium on Reliable Distributed Sys- Taiwan). IEEE Computer Society Press. 288– tems (SRDS’02) (Osaka, Japan). IEEE Computer 295. Society Press. 92–101. RODRIGUES, L. T. AND VER´ıSSIMO,P. 1992. xAMP: WHETTEN, B., MONTGOMERY, T., AND KAPLAN,S. 1994. a multi-primitive group communications ser- A high performance totally ordered multicast vice. In Proceedings of 11th IEEE International protocol. In Theory and Practice in Distributed Symposium on Reliable Distributed Systems Systems (Dagstuhl Castle, Germany). K. P. (SRDS’92) (Houston, TX). IEEE Computer So- Birman, F. Mattern, and A. Schiper, Eds. Lec- ciety Press. 112–121. ture Notes in Computer Science, vol. 938. RODRIGUES, L. T., VER´ıSSIMO, P., AND CASIMIRO, A. Springer-Verlag. 33–57. 1993. Using atomic broadcast to implement WIESMANN, M., PEDONE, F., SCHIPER, A., KEMME, B., a posteriori agreement for clock synchroniza- AND ALONSO,G. 2000. Understanding repli- tion. In Proceedings of 12th IEEE International cation in databases and distributed systems.

ACM Computing Surveys, Vol. 36, No. 4, December 2004. Total Order Broadcast and Multicast Algorithms: Taxonomy and Survey 421

In Proceedings of 20th IEEE International Distributed Systems (SRDS’95) (Bad Neuenahr, Conference on Distributed Computing Systems Germany). IEEE Computer Society Press. 106– (ICDCS-20) (Taipei, Taiwan). IEEE Computer 115. Society Press. 264–274. ZHOU,P.AND HOOMAN,J. 1995. Formal specifica- WILHELM,U.AND SCHIPER,A. 1995. A hierarchy tion and compositional verification of an atomic of totally ordered multicasts. In Proceedings of broadcast protocol. Real-Time Syst. 9,2(Sept.), 14th IEEE International Symposium on Reliable 119–145.

Received September 2000; revised August 2003; accepted July 2004

ACM Computing Surveys, Vol. 36, No. 4, December 2004.