IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 4, APRIL 2004 453 The Information Structure of Indulgent Consensus

Rachid Guerraoui, Member, IEEE, and

Abstract—To solve consensus, distributed systems have to be equipped with oracles such as a failure detector, a leader capability, or a random number generator. For each oracle, various consensus algorithms have been devised. Some of these algorithms are indulgent toward their oracle in the sense that they never violate consensus safety, no matter how the underlying oracle behaves. This paper presents a simple and generic indulgent consensus algorithm that can be instantiated with any specific oracle and be as efficient as any ad hoc consensus algorithm initially devised with that oracle in mind. The key to combining genericity and efficiency is to factor out the information structure of indulgent consensus executions within a new distributed abstraction, which we call “Lambda.” Interestingly, identifying this information structure also promotes a fine-grained study of the inherent complexity of indulgent consensus. We show that instantiations of our generic algorithm with specific oracles, or combinations of them, match lower bounds on oracle-efficiency, zero-degradation, and one-step-decision. We show, however, that no leader or failure detector-based consensus algorithm can be, at the same time, zero-degrading and configuration-efficient. Moreover, we show that leader-based consensus algorithms that are oracle-efficient are inherently zero-degrading, but some failure detector-based consensus algorithms can be both oracle-efficient and configuration-efficient. These results highlight some of the fundamental trade offs underlying each oracle.

Index Terms—Asynchronous distributed system, consensus, crash failure, fault tolerance, indulgent algorithm, information structure, leader oracle, modularity, random oracle, unreliable failure detector. æ

1INTRODUCTION 1.1 Context encapsulating eventual synchrony assumptions [12]. In NDERSTANDING the deep structure and the basic design particular, failure detector }S has received a lot of Uprinciples of algorithms solving fundamental distrib- attention. It provides each process with a list of processes uted computing problems is an important and challenging suspected to have crashed, in such a way that every process task. This task has been undertaken for basic problems such that crashes is eventually suspected (completeness prop- as distributed mutual exclusion [17], [30] and distributed erty) and there is a time after which some correct process is deadlock detection [6], [20]. Another such basic problem is no longer suspected (accuracy property). Another approach consensus [2], [13], [23]. This problem consists, for a set of consists of equipping the system with a leader oracle [21]. n processes, to propose each an initial value and, This oracle, denoted in [8], provides the processes with a eventually, agree on one of the proposed values, even if function leader which eventually always delivers the same some of the processes fail by crashing. Consensus is at the correct process identity to all processes. heart of reliable and it is tempting to Oracles }S and are, in a precise sense, minimal in that seek the fundamental structure of its algorithms, in each provides a necessary and sufficient amount of particular, consensus algorithms that are optimal in terms information about failures to solve consensus with a of resilience and performance. deterministic algorithm. Oracles }S and actually have the same computational power [8], [10]. The random oracle 1.2 Resilience Optimality solves a nondeterministic variant of consensus [3] and is, Given that it is impossible to solve consensus determinis- strictly speaking, incomparable with the other two. This tically in the presence of crash failures in a purely oracle is, however, in the sense of [9], as strong as consensus asynchronous system [13], several proposals have been and, hence, also somehow minimal. Interestingly, the made to augment the system with oracles that circumvent algorithms that rely on any of those oracles have all the the impossibility. A first approach consists of introducing a common inherent flavor that consensus safety is never random oracle [3] allowing us to design consensus algo- violated, no matter how the oracle behaves: They are rithms that provide eventual decision with probability 1. indulgent toward their oracle [15]. In other words, the oracles are only necessary for the liveness property of Another approach considers a failure detector oracle [7] consensus. A price to pay for this indulgence is that the upper bound f on the number of processes that are allowed . R. Guerraoui is with LPD-I&C-EPFL, CH 1015 Lausanne, Switzerland. to crash has to be smaller than n=2 (where n is the total E-mail: [email protected]. number of processes) and this is needed for each of these . M. Raynal is with IRISA, Campus de Beaulieu, 35042 Rennes Cedex, France. E-mail: [email protected]. oracles [3], [7], [8]. Manuscript received 13 Mar. 2003; revised 2 July 2003; accepted 10 July 1.3 Performance Optimality 2003. For information on obtaining reprints of this article, please send e-mail to: This paper focuses on the performance of indulgent [email protected], and reference IEEECS Log Number 118448. consensus algorithms in terms of time complexity (i.e.,

0018-9340/04/$20.00 ß 2004 IEEE Published by the IEEE Computer Society 454 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 4, APRIL 2004 latency measured in terms of communication steps). That is, algorithm that is generic and efficient. Genericity means we consider the number of communication steps needed for here that we could easily instantiate the algorithm with any the processes to reach a decision in certain runs of the oracle, whereas efficiency means that the resulting algo- execution. We overview here different optimality metrics rithm should be as efficient as any ad hoc consensus along this direction and we define them precisely later in algorithm designed for that specific oracle. the paper. The first difficulty underlying this objective lies in factoring out the appropriate information structure that is . Oracle-efficiency. When an oracle behaves perfectly, common to efficient indulgent consensus algorithms, each the consensus decision can typically be expedited. of which might be using a different oracle and making use More precisely, for all oracles discussed above, two of specific algorithmic techniques. In fact, it is not entirely communication steps are necessary and sufficient to clear whether such a common structure could be precisely reach consensus in failure-free runs where the oracle defined and whether the same generic algorithm could behaves perfectly, e.g., [19]. Algorithms that match encompass the specific characteristics of a random oracle, a this lower bound (e.g., [18], [26], [31]) are said to be failure detector, and a leader oracle. The second difficulty oracle-efficient in the sense that they are optimized for has to do with the possible conflicting nature of the lower the good behavior of the oracle. bounds. We know of no algorithm that matches all lower . Zero-degradation.Thispropertyextendsoracle- bounds recalled above and it is not clear whether such an efficiency from failure-free runs to runs with initial algorithm can indeed be devised. crashes [11]. Algorithms that have this property also match the two communication steps lower bound in 1.5 Related Work runs with initial crashes. For instance, the consensus In [18], several consensus algorithms were unified within algorithms of [11] need only two communication the same framework, all, however, relying on }S-like steps to reach consensus when the oracle behaves failure detectors. A similar unification was proposed in perfectly, even if some processes had crashed [4], for -based consensus algorithms. In [1], [28], consensus initially. This is particularly important because algorithms that make use of several oracles at the same time consensus is typically used in a repeated form and were presented. In these hybrid algorithms, however, a process failure during one consensus instance efficiency was not the issue and the oracles are used in a appearsasaninitialfailureinasubsequent hardwired manner, e.g., they cannot be interchangeable. A consensus instance. In a zero-degrading algorithm, first attempt to build a common consensus framework, a failure in a given instance does not impact the unifying a leader oracle, a random oracle, and a failure performance of any future instances. detector oracle, was proposed in [25]. Unfortunately, . One-step-decision. If the processes exploit an initial algorithms derived by instantiating that framework with a knowledge on a privileged value or on a specific given oracle are clearly not as efficient as ad hoc algorithms subset of processes, they can even, sometimes, reach devised directly with that oracle. Efficient indulgent consensus in a single communication step [5], e.g., consensus algorithms were presented in [11]. However, when all noncrashed processes propose that privi- for each oracle, a specific consensus algorithm is given. leged value. This can, for instance, be very useful if that specific value has a reasonable chance of being 1.6 Contribution proposed more often than others. Algorithms that This paper factors out the information structure of efficient exploit such a knowledge to expedite a decision are indulgent consensus algorithms within a new distributed called here one-step-decision algorithms. abstraction, which we call Lambda. This abstraction . Configuration-efficiency. Finally, when all processes encapsulates the use of any oracle (random, leader, or propose the same initial value, no underlying oracle failure detector) during every individual round of indul- is actually necessary to obtain a decision. In that gent consensus executions. case, two communication steps are also necessary Using this abstraction, we construct a generic indulgent and sufficient to reach a consensus decision, no consensus algorithm that can be instantiated with any oracle, matter how the underlying oracle behaves. Algo- while assuming the highest number of possible failures, i.e., rithms that match this lower bound are said to be f

2.1 Processes In the Consensus problem, every process pi is supposed to The computation model we consider is basically the propose a value vi and the processes have to decide on the asynchronous system model of [2], [7], [13], [23]. The same value v that has to be one of the proposed values. system consists of a finite set of n>1 processes: More precisely, the problem is defined by two safety ¼fp1; ...;png. A process can fail by crashing, i.e., by properties (validity and uniform agreement) and a liveness prematurely halting. Until it possibly crashes, the process property (termination) [7], [13]: behaves according to its specification. If it crashes, the process never executes any other action. By definition, a . Validity: If a process decides v, then v was proposed faulty process is a process that crashes and a correct process by some process. is a process that never crashes. Let f denote the maximum . Uniform Agreement: No two processes decide dif- 1 number of processes that may crash. We assume f

For presentation simplicity, we consider only consensus in 3.4 Random Oracle its binary form: 0 and 1 are the only values that can be A random oracle provides each process pi with a function proposed. random that outputs a binary value randomly chosen. Basically, random() outputs 0 (resp. 1) with probability 1=2. 3.2 Leader Oracle A binary consensus algorithm based on such a random A leader oracle is a distributed entity that provides the oracle is described in [3]. In the case where the processes processes with a function leader() that returns a process which have not initially crashed have the same initial value, name each time it is invoked. A unique correct leader is this algorithm reaches consensus in two communication eventually elected, but there is no knowledge of when the steps. (Note that, in this case, the algorithm does not leader is elected. Several leaders can coexist during an actually use the underlying random oracle. This situation does actually correspond to the notion of configuration- arbitrarily long period of time, and there is no way for the efficiency investigated in Section 6.) processes to learn when this “anarchy” period is over. The leader oracle, denoted , satisfies the following property:2 4AGENERIC CONSENSUS ALGORITHM . Eventual Leadership: There is a time t and a correct Our generic consensus algorithm is described in Fig. 1. As process p such that, after t, every invocation of announced in the Introduction, its combination of genericity leader() by any correct process returns p. and efficiency lies in the use of an appropriate information We say that the oracle is perfect if, from the very structure, called Lambda. This abstraction exports a single beginning, it provides the processes with the same correct function, itself denoted lambda(), and this function en- leader. capsulates the use of any underlying oracle. The algorithm -based consensus algorithms are described in [4], [11], borrows its skeleton from [26].3 [21], [27]. These algorithms are (or can be easily made) oracle-efficient: They can reach consensus in two commu- 4.1 Two Phases per Round nication steps when no process crashes and the oracle A process pi starts a consensus execution by invoking behaves perfectly. The -algorithm of [11] is not only oracle- function consensusðviÞ, where vi is the value proposed by pi for consensus. Function consensus() is made up of two efficient, but also zero-degrading: It reaches consensus in concurrent tasks: T1 (the main task) and T2 (the decision two communication steps when the oracle is perfect and no dissemination task). The execution of statement returnðvÞ process crashes during the consensus execution, even if by any process pi terminates the consensus execution (as far some processes had initially crashed. The notion of zero- as pi is concerned) and returns the decided value v to pi. degradation means here that a failure in one consensus In their main task (i.e., T1), the processes proceed instance does not impact the performance of future through consecutive asynchronous rounds. Each round is consensus instances (where the failure appears as an initial made up of two phases. During the first phase of a round, failure). the processes strive to select the same value, called an estimate value. Then, the processes try to decide during the 3.3 Failure Detector Oracle second phase. This occurs when they get the same value at A failure detector }S is defined as follows [7]: Each process the end of the selection phase. pi is provided with a set denoted by suspectedi.If Each process pi manages three local variables: ri (current pj 2 suspectedi, we say that “pi suspects pj.”.Failure detector round number), est1i (estimate of the decision value at the }S satisfies the following properties: beginning of the first phase), and est2i (estimate of the decision value at the beginning of the second phase). The . Strong Completeness: Eventually, every process that specific value ? denotes a default value (which cannot be crashes is permanently suspected by every correct proposed by the processes). process. . Eventual Weak Accuracy: There is a time after which 4.1.1 First Phase (Selection) some correct process is never suspected by the The aim of this phase (line 104) is to provide all processes correct processes. with the same estimate (est2i). When this occurs, a decision A }S oracle is perfect if it suspects exactly the processes that will be obtained during the second phase. To attain this have crashed and, hence, behaves as a perfect failure goal, this phase is made up of a call to the function detector. That is, in addition to strong completeness, the lambda ðÞ. For any process pi, the function has two input oracle never suspects any noncrashed process. parameters: a round number r (an integer) and est1i Several }S-based consensus algorithms have been pro- representing either a possible consensus value (i.e., a binary posed in [7], [11], [18], [26], [31]. The algorithms of [18], [26], value in our case) or the specific value ?. The function returns as an output parameter est2 , representing either a [31] reach consensus in two communication steps when the i possible consensus value or the specific value ?.A oracle is perfect and no process crashes: They are oracle- fundamental property ensured by this function is the efficient. The algorithm of [11] is also zero-degrading. 3. More precisely, our algorithm has the same structure as the }S-based 2. This property refers to a notion of global time. This notion is not algorithm of [26], but differs in its first phase. This phase relies on the accessible to the processes. Lambda abstraction instead of the particular }S failure detector. GUERRAOUI AND RAYNAL: THE INFORMATION STRUCTURE OF INDULGENT CONSENSUS 457

Fig. 1. The Generic Consensus Algorithm. following: For any two processes, pi and pj that, during a different from ? are equal to the same value v: The value is round r, return from lambdaðr; Þ, est2i and est2j, respec- indeed locked. tively, we have: The use of Reliable Broadcast (line 108) has the following motivation: Given that any process that decides stops ððest2i 6¼?Þ^ðest2j 6¼ ?ÞÞ ) ðest2i ¼ est2j ¼ vÞ: participating in the sequence of rounds and all processes do not necessarily decide during the same round, it is 4.1.2 Second Phase (Commitment) possible that processes that proceed to round r þ 1 wait The second phase (lines 105-111) starts with an exchange of forever for messages from processes that have terminated at the new estimates (note that these are equal to the same r. By disseminating the decided value, the Reliable Broad- value v or to ?). Then, the behavior of a process pi depends cast primitive prevents such occurrences. on the set reci of estimate values it has gathered. There are three cases to consider [26]. 4.2 The Lambda Abstraction This section defines the properties of our Lambda abstrac- 1. If reci ¼fvg, then pi decides on v. Note that, in this tion. It states the properties any infinite sequence of case, as p receives v from ðn fÞ >n=2 processes, i invocations of lambda () function has to satisfy. Section 5 any process p receives v from at least one process. j will then show how these properties can be ensured using (Obviously, the algorithm remains correct if a process waits for y messages, with n=2 1, est1i variables whose values are the configuration is monovalent as only v can be decided. 458 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 4, APRIL 2004

. Termination: For any round r, if all correct processes Proof. This follows from 1) f being the maximum number invoke lambda ðr; Þ, then all correct processes of processes that can crash, 2) the termination properties return from the invocation. of Lambda, as well as 3) the termination and integrity . Eventual convergence: If all correct processes keep properties of the Broadcast primitives. We show this on repeatedly invoking lambda ðÞ with increasing more precisely below. round numbers, then there is a round r such that If a process decides, then, by the termination property lambda ðr; Þ returns the same non-? value to all of the Reliable Broadcast of the DECIDE message, all correct processes. correct processes decide. Hence, no correct process blocks forever in a round. Assume by contradiction that 4.3 Correctness of the Generic Algorithm no process decides. Let r be the smallest round number f

Fig. 2. -based module.

Fig. 3. }calS-based module. progress in turn. During each round, the control flow at pi that, if all est1i are equal to some value v, then only v or moves during the first phase from the main co-routine (i.e., ? can be output at line 208 (notice that the second phase the algorithm), to the co-routine implementing Lambda, of the consensus algorithm can set est1i only to v or ?). and then returns to the main co-routine for the second Therefore, if all est1i are equal to v at the beginning of r, phase before progressing to the next round. due to the management of the prev est1i variables, all processes (that have a value) will have the same value v 5.1 Leader Module after executing line 201 during the next round. The An implementation of lambda ðri; estiÞ based on is eventual convergence property follows from the fact that described in Fig. 2. After resetting est1i to its last non-? there is a time after which all processes have the same value (if est1i ¼?), every process pi first invokes the correct leader p‘;whenthisoccurs,p‘ imposes its oracle . The latter provides pi with the identity of some estimate to all processes. tu process (i.e., the name of a leader—line 202). Then, pi exchanges with all other processes its current estimate value The generic consensus algorithm, instantiated with such plus the name of the process it considers leader (line 203). an implementation of the function lambda (), boils down to When pi has got such messages from at least ðn fÞ the algorithm of [11] which, as we pointed out, is the most -based processes (line 204), pi checks if some process p‘ is efficient algorithm we know of. When is perfect, considered leader by a majority of the processes. If there it provides the processes with the same correct process as a is such a process p‘, and pi has got its current estimate value leader from the very beginning. Moreover, due to line 205, p est1‘, then pi considers est1‘ as the value of est2i. In the the test of lines 206-207 is then satisfied as i has got the p other cases, pi sets est2i to ? (lines 206-208). estimate value from the common leader ‘ (which is leaderi). It follows that, at line 208, est2i is set to the value Theorem 4. The algorithm of Fig. 2 implements Lambda using . of est1‘ (which is different from ?). Thus, it is easy to see 5 Proof. We have to show that the LO module, described in that, in this case, all processes get the same estimate value Fig. 2, satisfies the validity, quasi-agreement, fixed point, after the execution of lambda ð1; Þ and the protocol termination, and eventual convergence properties de- terminates in two communication steps despite initial fined in Section 4.2. process crashes. So, the algorithm is zero-degrading (hence, Validity and termination follow directly from the also oracle-efficient). algorithm and f being the maximum number of processes that can crash. Quasi-agreement follows from 5.2 Failure Detector Module the fact that an est2i variable is equal to ? or the est1‘ A }S-based implementation of lambda ðri; estiÞ is described value of a process p‘ (let us notice that there is at most in Fig. 3. Its principle is particularly simple. Each round has one process p‘ that is considered leader by a majority of a coordinator (line 302) that tries to impose its estimate processes). The fixed point property follows from the fact value to all the processes (line 303). As the coordinator can crash, a process relies on the strong completeness property 5. Line 205 is related to the zero-degradation property (see the discussion below). It concerns neither the safety nor the liveness of the Lambda of failure detector }S in order not to wait indefinitely abstraction and is consequently not used in this proof. (line 304). If a process pi gets a value v from the round 460 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 4, APRIL 2004

Fig. 4. Random-based module. coordinator, pi sets est2i to v.Ifpi suspects the current their current estimates values and each process pi looks for round coordinator to have crashed, pi sets est2i to ? a value that is a majority value. If such a value is obtained, (line 305). In order not to miss the correct process that is process pi assigns it to est2i.Ifpi does not see a majority eventually no longer suspected (eventual weak accuracy), estimate value, pi sets est2i to ?. Note that, if est1i ¼?at all processes have to be considered in turn as coordinator the beginning of a round, process pi can conclude that both until a value is decided. This is realized with the help of the values have been proposed. Note also that, at the beginning mod function at line 302. of the first round, no estimate value is equal to ?. Theorem 5. The algorithm of Fig. 3 implements Lambda Theorem 6. Using random, the algorithm of Fig. 4 implements using }S. the validity, quasi-agreement, and termination properties of Proof. The proof is similar to the one used for Theorem 4.tu Lambda. Proof. Straightforward from the algorithm. tu The generic consensus algorithm instantiated with such an implementation of the Lambda abstraction boils down to The randomized consensus algorithm obtained when the }S-based consensus algorithm described in [26]. It is using this implementation of the Lambda abstraction boils easy to see that the resulting algorithm needs only two down to the algorithm proposed in [3]. As noticed in Section communication steps to decide when the first coordinator is 3.4, in the particular case where the processes that have not correct and the }S failure detector is perfect. So, this initially crashed propose the same initial value, this algorithm algorithm is oracle-efficient. However, it is not zero- does not use the underlying random oracle and reaches degrading because, if the first coordinator has crashed, consensus in two communication steps. Actually, this the algorithm has to proceed to the second round. In this algorithm uses the random oracle to allow the processes to case, at least three communication steps are required to “eventually” start a round with the same estimate value. decide. (One to proceed from the first to the second round When that round is reached, the process can decide without and the others in the second round.) A simple way to get a the help of the oracle. The algorithm satisfies the eventual zero-degrading }S-based consensus algorithm, despite the convergence property with probability 1. crash of the first coordinator, consists of customizing its first round. More precisely, the lambda () function is then 5.4 Module Composition implemented as follows: Interestingly, it is possible to provide implementations of the Lambda abstraction based on combinations of the . Round r ¼ 1: This round uses a module similar to previous LO, FD,andRO modules (or even any the one described in Fig. 2 except for its line 202 (the XY module satisfying the validity, quasi-agreement, fixed line where the oracle is used) which is replaced by point, termination, and eventual convergence properties of the following statement: the Lambda abstraction). Such combinations provide hybrid consensus algorithms that generalize the specific combina- leaderi min ð suspectediÞ: tions that have been proposed so far (namely, the algorithms that combine a failure detector oracle with a . Round r>1: These rounds use the module de- random oracle [1], [28]). scribed in Fig. 3. As examples, let us consider two module combinations When the failure detector is perfect, all processes get the that merge the LO and FD modules. same correct process as the “leader” of the first round, do . not suspect it, and, consequently, the decision is obtained The first combination is the LO_FD_1 module during that round. When we consider this implementation defined as follows: of the lambda () function, the first round satisfies validity, - The odd rounds of lambda() are implemented quasi-agreement, fixed point, and termination, whereas the with the LO module (Fig. 2), other rounds additionally satisfy eventual convergence. - The even rounds are implemented with the 5.3 Random Module FD module (Fig. 3) where the coordinator pc is defined as follows: c ¼ððr =2Þ mod nÞþ1.6 (Note A random-based implementation of Lambda is described in i Fig. 4. When a process starts a new round with est1i ¼?,it 6. In that way, no process is a priori prevented from being the correct sets est1i randomly to 0 or 1. The processes then exchange process that eventually is not suspected by the other processes. GUERRAOUI AND RAYNAL: THE INFORMATION STRUCTURE OF INDULGENT CONSENSUS 461

Fig. 5. Privileged value module.

Fig. 6. Privileged set of processes module.

that this composition requires a slight modifica- illustrate this idea by combining a specific implementation tion of the round coordinator definition.) with the -based implementation of Lambda given in Fig. 2. . The second combination is the LO_FD_2 module For a specific initial configuration, the processes can reach a defined as follows. Each round of lambda() is decision in one communication step. Interestingly, the one- implemented by the the concatenation made up of step decision characteristic of the resulting algorithm does the LO module immediately followed by the not impact the performance of the algorithm when the initial FD module. configuration is not a specific one. It is easy to see that each of the resulting LO_FD_1 and When combined with the LO module, each of the LO_FD_2 modules satisfies the properties associated with modules introduced below (named PV—Fig. 5—and PS the Lambda abstraction. Other combinations could be —Fig. 6) assumes that there is a leader p‘ defined in the defined in a similar way. Such combinations have to be LO module (line 202 of Fig. 2). The merging of the such that, given a round r, all processes use the same LO module with the PV module (respectively, PS) concerns module composition during r. only the first round. This merging is achieved through the Appropriate combinations merging the RO module to LO module of Fig. 2. More precisely, we test if there is a the LO and FD modules, provide implementations of leader as defined in the LO module (line 202 of Fig. 2). In Lambda that satisfies its validity, quasi-agreement, fixed that case, we invoke the PV (respectively, PS) module. point, and termination properties. As far as the eventual convergence property is concerned, it is satisfied if the LO 5.5.1 Existence of a Privileged Value (or FD) module is involved in an infinite number of rounds. Some applications have the property that some predeter- In the other cases, it is only satisfied with probability 1 mined value () appears much more often than other (assuming RO is involved in an infinite number of rounds). values. This means that is usually proposed much more In that sense, the generic algorithm provides indulgent often than other values. The a priori knowledge of such a consensus algorithms that can benefit from the best of predetermined value can help expedite the decision. This several “worlds” (leader oracle, failure detector oracle, or can be done by plugging the following module PV, random oracle). described in Fig. 5, as described above. The idea underlying 5.5 One-Step Decision this module is the following: If there is a leader p‘ (Fig. 2, line 206) and a majority of processes including p‘ have their This section considers two additional assumptions that, current estimates equal to , then p decides (line 502). when satisfied by an actual run, allow the consensus i Otherwise, if p has Delivered a PHASE1_LO message algorithm to expedite the decision, i.e., to terminate in one i carrying , then p updates its prev est1 local variable to communication step [5]. Each of these assumptions relies on i i . It is easy to see that, in any run where the processes that a specific a priori agreement. More precisely, the first one have not initially crashed propose and the oracle is assumes an a priori agreement on a particular value and perfect, the processes decide in one communication step. allows a one-step decision when enough processes do propose that value. The other assumes that there is a Theorem 7. The previous combination merging the LO and predefined majority set of processes and allows a one-step PV modules provides a correct implementation of the Lambda decision when those processes do propose the same value. abstraction. When used in the consensus protocol described in Interestingly, these additional assumptions can be merged Fig. 1, it allows the processes to decide, in one communication in any deterministic implementation of Lambda (i.e., LO, FD, step, when the privileged value is the only proposed value or SV—defined in the next section). In the following, we and the oracle is perfect. 462 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 4, APRIL 2004

Proof. Let us first notice that, if no process executes line 502, Let us finally observe that, when the set of values does then the only difference between the LO+PV merging not allow the PV (or PS) module to terminate in one and LO lies in the fact that some processes possibly set communication step, the consensus algorithm remains zero- their prev est1i variables to (which is then a proposed degrading: It still terminates in two communication steps value). This does not modify the output of the lambda () despite initial crashes if the underlying ( or }S) oracle function for that round. behaves perfectly. So, combining the additional assumption Let us now consider the case where a process decides that “there is a privileged value” or “there is a predefined at line 502. In this case, the process decides , which is majority set of processes” with the use of an or }S oracle then the estimate value of the unique leader p‘ of that does not prevent zero-degradation. In a precise sense, one- round. Moreover, as, in this case, has been sent by a step decision and zero-degradation are compatible. This has majority of processes and f

Fig. 7. Same value module.

Proof. Straightforward from the algorithm. tu steps (rounds) in any run where no process crashes, except possibly initially, and the oracle behaves perfectly. Let us This implementation does not satisfy eventual conver- recall here that behaves perfectly in a run if it always gence, so the termination of the consensus algorithm is not outputs the same correct process to all processes in that run; guaranteed in all cases. This implementation is particularly }S behaves perfectly when every process that crashes is interesting when there is a high likelihood that the eventually suspected (in a permanent way) by all correct processes do propose the same value. In such cases, the processes and no process is suspected before it crashes. resulting consensus algorithm is zero-degrading and Note that, when we say here that a run reaches consensus, terminates in two communication steps. Interestingly, this we mean that all correct processes have decided. module can be used in combination with other oracle-based modules to provide Lambda implementations. Theorem 9. Assuming f3. SV enriched with a “relatively weak oracle” whose aim is to Furthermore, we first consider a communication-closed help obtain eventual convergence. model [14]: If a process does not deliver, in a round r,a 6.2 Impossibility Result message Broadcast to it in that round, the process never delivers that message. We shall later discuss the general- Assuming f

Fig. 8. Runs R1-R4 used in the proof of Theorem 9.

after two rounds (zero-degradation) and, hence, received by all correct processes. The point here is that, p1 eventually decides 0 as well. given the asynchronous characteristic of the channels, 2. Process p3 cannot distinguish R1 from run R2 up these messages might arrive after the decision was made. to the second round: p3 receives exactly the same They simply do not change the contradiction argument.tu messages from p2 and gets the same output from its oracle. Hence, p decides 0 after two rounds in 3 6.3 Trading Zero-Degradation run R2 as well. Processes p1 and p2 have to eventually decide 0 in R2, even if p3 crashes This section discusses the possibility of trading zero- immediately after deciding at round 2. degradation with configuration-efficiency. The issue is to devise algorithms that are oracle-efficient and configura- 3. Consider now run R3. After two rounds, p1 and p2 tion-efficient instead of zero-degrading. Recall that oracle- cannot distinguish run R3 from R2 where p3 might have decided 0 after two steps. Assume efficiency (that concerns crash-free runs) is a weaker property than zero-degradation (that concerns runs with again that p3 crashes immediately after the second round in R3. Hence, p1 and p2 also have to only initial crashes). eventually decide 0 in run R as well. Process p 3 3 6.3.1 The Case of cannot distinguish run R3 from R4. We show below that any -based consensus algorithm that 4. In run R4, all processes might have the same initial value and if we assume that A is config- is oracle-efficient is also zero-degrading. As a consequence of our previous impossibility result (Theorem 9), there is no uration-efficient, then p3 must decide 1 after two way to trade the zero-degrading characteristic of oracle- steps. A contradiction as p3 cannot distinguish R4 efficient -based consensus algorithms with configuration- from R3. efficiency. Case n>3. Let us divide the set of processes into Theorem 10. Assuming ffdn=3e. Given that f dn=3e, all processes of Proof. Let A be any -based consensus algorithm assuming a given subset can crash in a run. The generalization of f

Fig. 9. Runs R1-R2 used in the proof of Theorem 10.

some process, say p1, crashes initially, 2) the oracle 6.3.3 On Perfect and }S Oracles behaves perfectly, say by permanently electing p2, and While and }S have the same computational power [8], [10], 3) either p2 or p3 decides after round 2. Let us observe it is important to notice that perfect and perfect }S are not that the oracle might also output p2 in a run R2 that is equivalent. This observation might help get an intuition of R p similar to 1,exceptfor 1 that does not crash. why any oracle-efficient -based consensus algorithm is also p2 p R R Processes and 3 cannot distinguish 1 from 2 as, zero-degrading, which is not the case with }S. in both runs, they get the same information from . But then, in R2, behaves perfectly, no process crashes, and some process decides after round 2: a contradiction as A 7CONCLUSION is an oracle-efficient -based consensus algorithm. tu This paper dissects the information structure of consensus algorithms that rely on oracles such as the random oracle [3], the leader oracle [21], and the failure detector oracle [7]. 6.3.2 The Case of }S The algorithms we consider are indulgent toward their We give here a }S-based consensus algorithm assuming oracle: No matter how the underlying oracle behaves, f1: These rounds use the module de- It is important to notice that our approach does not aim scribed in Fig. 3. at unifying all algorithmic principles underlying consensus. In particular, we focused on indulgent consensus algo- When we consider this implementation of the lambda () rithms [15]. Figuring out how to include, for instance, the function, the first round satisfies validity, quasi-agreement, synchronous dimension in our general information structure fixed point, and termination and the other rounds addi- might be feasible along the lines of [14], but requires a tionally satisfy eventual convergence. Consider a consensus careful study. Similarly, we did not consider the memory algorithm using this implementation of lambda. Clearly, if dimension of consensus algorithms to unify models with crash-stop message passing, crash-recovery message pas- all processes propose the same value, the processes return sing, and shared memory [4]. Integrating this dimension to that value within lambda and all correct processes decide in the oracle one is yet another nontrivial challenge. two steps (configuration-efficiency). Similarly, if the oracle behaves perfectly and no process crashes, then all processes ACKNOWLEDGMENTS get all values and return the same value within lambda: The authors are grateful to the referees for their useful Thus, the processes decide in two steps (oracle-efficiency). comments that helped us improve the presentation of this 466 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 4, APRIL 2004 paper. They would also like to thank Partha Dutta, Achour [23] N. Lynch, Distributed Algorithms. San Francisco: Morgan Kauf- ´ mann, 1996. Mostefaoui, Bastian Pochon, and Sergio Rasjbaum for [24] A. Moste´faoui, S. Rajsbaum, and M. Raynal, “Conditions on Input discussions on consensus algorithms and lower bounds. Vectors for Consensus Solvability in Asynchronous Distributed This paper is a revised and extended version of the paper Systems,” Proc. 33rd ACM Symp. Theory of Computing (STOC ’01), “A Generic Framework for Indulgent Consensus” which pp. 153-162, 2001. [25] A. Moste´faoui, S. Rajsbaum, and M. Raynal, “A Versatile and appeared in the Proceedings of the 23rd IEEE International Modular Consensus Protocol,” Proc. Int’l IEEE Conf. Dependable Conference on Distributed Computing Systems (ICDCS ’03), Systems & Networks (DSN ’02), pp. 364-373, 2002. May 2003. [26] A. Moste´faoui and M. Raynal, “Solving Consensus Using Chandra-Toueg’s Unreliable Failure Detectors: A General Quor- um-Based Approach,” Proc. 13th Int’l Symp. Distributed Computing REFERENCES (DISC ’99), pp. 49-63, 1999. [27] A. Moste´faoui and M. Raynal, “Leader-Based Consensus,” Parallel [1] M.K. Aguilera and S. Toueg, “Failure Detection and Randomiza- Processing Letters, vol. 11, no. 1, pp. 95-107, 2001. tion: A Hybrid Approach to Solve Consensus,” SIAM J. Comput- [28] A. Moste´faoui, M. Raynal, and F. Tronel, “The Best of Both ing, vol. 28, no. 3, pp. 890-903, 1998. Worlds: A Hybrid Approach to Solve Consensus” Proc. Int’l IEEE [2] H. Attiya and J. Welch, Distributed Computing: Fundamentals, Conf. Dependable Systems & Networks (DSN ’00), pp. 513-522, 2000. Simulations and Advanced Topics. McGraw-Hill, 1998. [29] L. Rodrigues and P. Verissimo, “Topology-Aware Algorithms for [3] M. Ben-Or, “Another Advantage of Free Choice: Completely Large Scale Communications,” Advances in Distributed Systems, Asynchronous Agreement Protocols,” Proc. Second ACM Symp. pp. 127-156, Springer-Verlag, 2000. Principles of Distributed Computing (PODC ’83), pp. 27-30, 1983. [30] B. Sanders, “The Information Structure of Distributed Mutual [4] R. Boichat, P. Dutta, S. Frolund, and R. Guerraoui, “Deconstruct- Exclusion Algorithms,” ACM Trans. Computer Systems, vol. 5, no. 3, ing ,” SIGACT News, Distributed Computing Column, 2003. pp. 284-299, 1987. [5] F. Brasileiro, F. Greve, A. Mostefaoui, and M. Raynal, “Consensus [31] A. Schiper, “Early Consensus in an Asynchronous System with a in One Communication Step,” Proc. Sixth Int’l Conf. Parallel Weak Failure Detector,” Distributed Computing, vol. 10, pp. 149- Computing Technologies (PaCT ’01), pp. 42-50, 2001. 157, 1997. [6] J. Brzezinsky, J.-M. He´lary, M. Raynal, and M. Singhal, “Deadlock Models and a General Algorithm for Distributed Deadlock Rachid Guerraoui received the PhD degree in Detection,” J. Parallel and Distributed Computing, vol. 31, no. 2, 1992 from the University of Orsay. He has been pp. 112-125, 1995. a professor in computer science since 1999 at [7] T. Chandra and S. Toueg, “Unreliable Failure Detectors for EPFL (Ecole Polytechnique Fe´de´rale de Lau- Reliable Distributed Systems,” J. ACM, vol. 43, no. 2, pp. 225- sanne) in Switzerland, where he founded the 267, 1996. distributed programming laboratory. Prior to that, [8] T. Chandra, V. Hadzilacos, and S. Toueg, “The Weakest Failure he was with HP Labs in Palo Alto, California, the Detector for Solving Consensus,” J. ACM, vol. 43, no. 4, pp. 685- Center of Atomic Energy (CEA) in Saclay, 722, 1996. France, and the Centre de Recherche de l’Ecole [9] B. Chor and L. Nelson, “Solvability in Asynchronous Environ- des Mines de Paris. His research interests ments: Finite Interactive Tasks,” SIAM J. Computing, vol. 29, no. 2, include distributed algorithms and distributed programming languages. pp. 351-377, 1999. In these areas, he has been principal investigator for a number of [10] F. Chu, “Reducing to }W,” Information Processing Letters, vol. 67, research grants and has published papers in various journals and no. 6, pp. 289-293, 1998. conferences. He has served on program committees for various [11] P. Dutta and R. Guerraoui, “Fast Indulgent Consensus with Zero conferences and chaired the program committees of ECOOP 1999, Degradation,” Proc. Fourth European Dependable Computing Conf. Middleware 2001, and SRDS 2002. He is a member of the IEEE. (EDCC ’02), pp. 191-208, 2002. [12] C. Dwork, N. Lynch, and L. Stockmeyer, “Consensus in the Michel Raynal has been a professor of compu- Presence of Partial Synchrony,” J. ACM, vol. 35, no. 2, pp. 288-323, ter science since 1981. At IRISA (CNRS-INRIA- 1988. University joint computing research laboratory [13] M.J. Fischer, N. Lynch, and M.S. Paterson, “Impossibility of located in Rennes), he founded a research Distributed Consensus with One Faulty Process,” J. ACM, vol. 32, group on distributed algorithms in 1983. His no. 2, pp. 374-382, 1985. research interests include distributed algorithms, [14] E. Gafni, “A Round-by-Round Failure Detector: Unifying Syn- distributed computing systems, networks, and chrony and Asynchrony,” Proc. 17th ACM Symp. Principles of dependability. His main interest lies in the Distributed Computing (PODC ’98), pp. 143-152, 1998. fundamental principles that underlie the design [15] R. Guerraoui, “Indulgent Algorithms,” Proc. 19th ACM Symp. and the construction of distributed computing Principles of Distributed Computing (PODC ’00), pp. 289-298, 2000. systems. He has been principal investigator for a number of research [16] V. Hadzilacos and S. Toueg, “Reliable Broadcast and Related grants in these areas and has been invited by many universities to give Problems,” Distributed Systems, S. Mullender, ed., pp. 97-145, lectures about operating systems and distributed algorithms in Europe, New-York: ACM Press, 1993. South and North America, Asia, and Africa. He serves as an editor for [17] J.-M. He´lary, A. Moste´faoui, and M. Raynal, “A General Scheme several international journals. He has published more than 75 papers in for Token- and Tree-Based Distributed Mutual Exclusion Algo- journals and more than 160 papers in conferences He has also written rithms,” IEEE Trans. Parallel and Distributed Systems, vol. 5, no. 11, seven books devoted to parallelism, distributed algorithms, and systems pp. 1185-1196, Nov. 1994. (MIT Press and Wiley). He has served on program committees for more [18] M. Hurfin, A. Moste´faoui, and M. Raynal, “A Versatile Family of than 70 international conferences and chaired the program committees Consensus Protocols Based on Chandra-Toueg’s Unreliable Fail- of more than 15 international conferences. He currently serves as the ure Detectors,” IEEE Trans. Computers, vol. 51, no. 4, pp, 395-408, chair of the steering committee leading the DISC symposium series. He Apr. 2002. received the IEEE ICDCS best paper Award three times in a row: 1999, [19] I. Keidar and S. Rajsbaum, “On the Cost of Fault-Tolerant 2000, and 2001. Consensus when There Are No Faults: A Tutorial,” SIGACT News, Distributed Computing Column, vol. 32, no. 2, pp. 45-63, 2001. [20] A.D. Kshemkalyani and M. Singhal, “On the Characterization and Correctness of Distributed Deadlock Detection,” J. Parallel and Distributed Computing, vol. 22, no. 1, pp. 44-59, 1994. . For more information on this or any computing topic, please visit [21] L. Lamport, “The Part-Time Parliament,” ACM Trans. Computer our Digital Library at www.computer.org/publications/dlib. Systems, vol. 16, no. 2, pp. 133-169, 1998. [22] L. Lamport, R. Shostak, and L. Pease, “The Byzantine General Problem,” ACM Trans. Programming Languages and Systems, vol. 4, no. 3, pp. 382-401, 1982.