IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 4, APRIL 2004 453 The Information Structure of Indulgent Consensus
Rachid Guerraoui, Member, IEEE, and Michel Raynal
Abstract—To solve consensus, distributed systems have to be equipped with oracles such as a failure detector, a leader capability, or a random number generator. For each oracle, various consensus algorithms have been devised. Some of these algorithms are indulgent toward their oracle in the sense that they never violate consensus safety, no matter how the underlying oracle behaves. This paper presents a simple and generic indulgent consensus algorithm that can be instantiated with any specific oracle and be as efficient as any ad hoc consensus algorithm initially devised with that oracle in mind. The key to combining genericity and efficiency is to factor out the information structure of indulgent consensus executions within a new distributed abstraction, which we call “Lambda.” Interestingly, identifying this information structure also promotes a fine-grained study of the inherent complexity of indulgent consensus. We show that instantiations of our generic algorithm with specific oracles, or combinations of them, match lower bounds on oracle-efficiency, zero-degradation, and one-step-decision. We show, however, that no leader or failure detector-based consensus algorithm can be, at the same time, zero-degrading and configuration-efficient. Moreover, we show that leader-based consensus algorithms that are oracle-efficient are inherently zero-degrading, but some failure detector-based consensus algorithms can be both oracle-efficient and configuration-efficient. These results highlight some of the fundamental trade offs underlying each oracle.
Index Terms—Asynchronous distributed system, consensus, crash failure, fault tolerance, indulgent algorithm, information structure, leader oracle, modularity, random oracle, unreliable failure detector. æ
1INTRODUCTION 1.1 Context encapsulating eventual synchrony assumptions [12]. In NDERSTANDING the deep structure and the basic design particular, failure detector }S has received a lot of Uprinciples of algorithms solving fundamental distrib- attention. It provides each process with a list of processes uted computing problems is an important and challenging suspected to have crashed, in such a way that every process task. This task has been undertaken for basic problems such that crashes is eventually suspected (completeness prop- as distributed mutual exclusion [17], [30] and distributed erty) and there is a time after which some correct process is deadlock detection [6], [20]. Another such basic problem is no longer suspected (accuracy property). Another approach consensus [2], [13], [23]. This problem consists, for a set of consists of equipping the system with a leader oracle [21]. n processes, to propose each an initial value and, This oracle, denoted in [8], provides the processes with a eventually, agree on one of the proposed values, even if function leader which eventually always delivers the same some of the processes fail by crashing. Consensus is at the correct process identity to all processes. heart of reliable distributed computing and it is tempting to Oracles }S and are, in a precise sense, minimal in that seek the fundamental structure of its algorithms, in each provides a necessary and sufficient amount of particular, consensus algorithms that are optimal in terms information about failures to solve consensus with a of resilience and performance. deterministic algorithm. Oracles }S and actually have the same computational power [8], [10]. The random oracle 1.2 Resilience Optimality solves a nondeterministic variant of consensus [3] and is, Given that it is impossible to solve consensus determinis- strictly speaking, incomparable with the other two. This tically in the presence of crash failures in a purely oracle is, however, in the sense of [9], as strong as consensus asynchronous system [13], several proposals have been and, hence, also somehow minimal. Interestingly, the made to augment the system with oracles that circumvent algorithms that rely on any of those oracles have all the the impossibility. A first approach consists of introducing a common inherent flavor that consensus safety is never random oracle [3] allowing us to design consensus algo- violated, no matter how the oracle behaves: They are rithms that provide eventual decision with probability 1. indulgent toward their oracle [15]. In other words, the oracles are only necessary for the liveness property of Another approach considers a failure detector oracle [7] consensus. A price to pay for this indulgence is that the upper bound f on the number of processes that are allowed . R. Guerraoui is with LPD-I&C-EPFL, CH 1015 Lausanne, Switzerland. to crash has to be smaller than n=2 (where n is the total E-mail: [email protected]. number of processes) and this is needed for each of these . M. Raynal is with IRISA, Campus de Beaulieu, 35042 Rennes Cedex, France. E-mail: [email protected]. oracles [3], [7], [8]. Manuscript received 13 Mar. 2003; revised 2 July 2003; accepted 10 July 1.3 Performance Optimality 2003. For information on obtaining reprints of this article, please send e-mail to: This paper focuses on the performance of indulgent [email protected], and reference IEEECS Log Number 118448. consensus algorithms in terms of time complexity (i.e.,
0018-9340/04/$20.00 ß 2004 IEEE Published by the IEEE Computer Society 454 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 4, APRIL 2004 latency measured in terms of communication steps). That is, algorithm that is generic and efficient. Genericity means we consider the number of communication steps needed for here that we could easily instantiate the algorithm with any the processes to reach a decision in certain runs of the oracle, whereas efficiency means that the resulting algo- execution. We overview here different optimality metrics rithm should be as efficient as any ad hoc consensus along this direction and we define them precisely later in algorithm designed for that specific oracle. the paper. The first difficulty underlying this objective lies in factoring out the appropriate information structure that is . Oracle-efficiency. When an oracle behaves perfectly, common to efficient indulgent consensus algorithms, each the consensus decision can typically be expedited. of which might be using a different oracle and making use More precisely, for all oracles discussed above, two of specific algorithmic techniques. In fact, it is not entirely communication steps are necessary and sufficient to clear whether such a common structure could be precisely reach consensus in failure-free runs where the oracle defined and whether the same generic algorithm could behaves perfectly, e.g., [19]. Algorithms that match encompass the specific characteristics of a random oracle, a this lower bound (e.g., [18], [26], [31]) are said to be failure detector, and a leader oracle. The second difficulty oracle-efficient in the sense that they are optimized for has to do with the possible conflicting nature of the lower the good behavior of the oracle. bounds. We know of no algorithm that matches all lower . Zero-degradation.Thispropertyextendsoracle- bounds recalled above and it is not clear whether such an efficiency from failure-free runs to runs with initial algorithm can indeed be devised. crashes [11]. Algorithms that have this property also match the two communication steps lower bound in 1.5 Related Work runs with initial crashes. For instance, the consensus In [18], several consensus algorithms were unified within algorithms of [11] need only two communication the same framework, all, however, relying on }S-like steps to reach consensus when the oracle behaves failure detectors. A similar unification was proposed in perfectly, even if some processes had crashed [4], for -based consensus algorithms. In [1], [28], consensus initially. This is particularly important because algorithms that make use of several oracles at the same time consensus is typically used in a repeated form and were presented. In these hybrid algorithms, however, a process failure during one consensus instance efficiency was not the issue and the oracles are used in a appearsasaninitialfailureinasubsequent hardwired manner, e.g., they cannot be interchangeable. A consensus instance. In a zero-degrading algorithm, first attempt to build a common consensus framework, a failure in a given instance does not impact the unifying a leader oracle, a random oracle, and a failure performance of any future instances. detector oracle, was proposed in [25]. Unfortunately, . One-step-decision. If the processes exploit an initial algorithms derived by instantiating that framework with a knowledge on a privileged value or on a specific given oracle are clearly not as efficient as ad hoc algorithms subset of processes, they can even, sometimes, reach devised directly with that oracle. Efficient indulgent consensus in a single communication step [5], e.g., consensus algorithms were presented in [11]. However, when all noncrashed processes propose that privi- for each oracle, a specific consensus algorithm is given. leged value. This can, for instance, be very useful if that specific value has a reasonable chance of being 1.6 Contribution proposed more often than others. Algorithms that This paper factors out the information structure of efficient exploit such a knowledge to expedite a decision are indulgent consensus algorithms within a new distributed called here one-step-decision algorithms. abstraction, which we call Lambda. This abstraction . Configuration-efficiency. Finally, when all processes encapsulates the use of any oracle (random, leader, or propose the same initial value, no underlying oracle failure detector) during every individual round of indul- is actually necessary to obtain a decision. In that gent consensus executions. case, two communication steps are also necessary Using this abstraction, we construct a generic indulgent and sufficient to reach a consensus decision, no consensus algorithm that can be instantiated with any oracle, matter how the underlying oracle behaves. Algo- while assuming the highest number of possible failures, i.e., rithms that match this lower bound are said to be f 2.1 Processes In the Consensus problem, every process pi is supposed to The computation model we consider is basically the propose a value vi and the processes have to decide on the asynchronous system model of [2], [7], [13], [23]. The same value v that has to be one of the proposed values. system consists of a finite set of n>1 processes: More precisely, the problem is defined by two safety ¼fp1; ...;png. A process can fail by crashing, i.e., by properties (validity and uniform agreement) and a liveness prematurely halting. Until it possibly crashes, the process property (termination) [7], [13]: behaves according to its specification. If it crashes, the process never executes any other action. By definition, a . Validity: If a process decides v, then v was proposed faulty process is a process that crashes and a correct process by some process. is a process that never crashes. Let f denote the maximum . Uniform Agreement: No two processes decide dif- 1 number of processes that may crash. We assume f For presentation simplicity, we consider only consensus in 3.4 Random Oracle its binary form: 0 and 1 are the only values that can be A random oracle provides each process pi with a function proposed. random that outputs a binary value randomly chosen. Basically, random() outputs 0 (resp. 1) with probability 1=2. 3.2 Leader Oracle A binary consensus algorithm based on such a random A leader oracle is a distributed entity that provides the oracle is described in [3]. In the case where the processes processes with a function leader() that returns a process which have not initially crashed have the same initial value, name each time it is invoked. A unique correct leader is this algorithm reaches consensus in two communication eventually elected, but there is no knowledge of when the steps. (Note that, in this case, the algorithm does not leader is elected. Several leaders can coexist during an actually use the underlying random oracle. This situation does actually correspond to the notion of configuration- arbitrarily long period of time, and there is no way for the efficiency investigated in Section 6.) processes to learn when this “anarchy” period is over. The leader oracle, denoted , satisfies the following property:2 4AGENERIC CONSENSUS ALGORITHM . Eventual Leadership: There is a time t and a correct Our generic consensus algorithm is described in Fig. 1. As process p such that, after t, every invocation of announced in the Introduction, its combination of genericity leader() by any correct process returns p. and efficiency lies in the use of an appropriate information We say that the oracle is perfect if, from the very structure, called Lambda. This abstraction exports a single beginning, it provides the processes with the same correct function, itself denoted lambda(), and this function en- leader. capsulates the use of any underlying oracle. The algorithm -based consensus algorithms are described in [4], [11], borrows its skeleton from [26].3 [21], [27]. These algorithms are (or can be easily made) oracle-efficient: They can reach consensus in two commu- 4.1 Two Phases per Round nication steps when no process crashes and the oracle A process pi starts a consensus execution by invoking behaves perfectly. The -algorithm of [11] is not only oracle- function consensusðviÞ, where vi is the value proposed by pi for consensus. Function consensus() is made up of two efficient, but also zero-degrading: It reaches consensus in concurrent tasks: T1 (the main task) and T2 (the decision two communication steps when the oracle is perfect and no dissemination task). The execution of statement returnðvÞ process crashes during the consensus execution, even if by any process pi terminates the consensus execution (as far some processes had initially crashed. The notion of zero- as pi is concerned) and returns the decided value v to pi. degradation means here that a failure in one consensus In their main task (i.e., T1), the processes proceed instance does not impact the performance of future through consecutive asynchronous rounds. Each round is consensus instances (where the failure appears as an initial made up of two phases. During the first phase of a round, failure). the processes strive to select the same value, called an estimate value. Then, the processes try to decide during the 3.3 Failure Detector Oracle second phase. This occurs when they get the same value at A failure detector }S is defined as follows [7]: Each process the end of the selection phase. pi is provided with a set denoted by suspectedi.If Each process pi manages three local variables: ri (current pj 2 suspectedi, we say that “pi suspects pj.”.Failure detector round number), est1i (estimate of the decision value at the }S satisfies the following properties: beginning of the first phase), and est2i (estimate of the decision value at the beginning of the second phase). The . Strong Completeness: Eventually, every process that specific value ? denotes a default value (which cannot be crashes is permanently suspected by every correct proposed by the processes). process. . Eventual Weak Accuracy: There is a time after which 4.1.1 First Phase (Selection) some correct process is never suspected by the The aim of this phase (line 104) is to provide all processes correct processes. with the same estimate (est2i). When this occurs, a decision A }S oracle is perfect if it suspects exactly the processes that will be obtained during the second phase. To attain this have crashed and, hence, behaves as a perfect failure goal, this phase is made up of a call to the function detector. That is, in addition to strong completeness, the lambda ðÞ. For any process pi, the function has two input oracle never suspects any noncrashed process. parameters: a round number r (an integer) and est1i Several }S-based consensus algorithms have been pro- representing either a possible consensus value (i.e., a binary posed in [7], [11], [18], [26], [31]. The algorithms of [18], [26], value in our case) or the specific value ?. The function returns as an output parameter est2 , representing either a [31] reach consensus in two communication steps when the i possible consensus value or the specific value ?.A oracle is perfect and no process crashes: They are oracle- fundamental property ensured by this function is the efficient. The algorithm of [11] is also zero-degrading. 3. More precisely, our algorithm has the same structure as the }S-based 2. This property refers to a notion of global time. This notion is not algorithm of [26], but differs in its first phase. This phase relies on the accessible to the processes. Lambda abstraction instead of the particular }S failure detector. GUERRAOUI AND RAYNAL: THE INFORMATION STRUCTURE OF INDULGENT CONSENSUS 457 Fig. 1. The Generic Consensus Algorithm. following: For any two processes, pi and pj that, during a different from ? are equal to the same value v: The value is round r, return from lambdaðr; Þ, est2i and est2j, respec- indeed locked. tively, we have: The use of Reliable Broadcast (line 108) has the following motivation: Given that any process that decides stops ððest2i 6¼?Þ^ðest2j 6¼ ?ÞÞ ) ðest2i ¼ est2j ¼ vÞ: participating in the sequence of rounds and all processes do not necessarily decide during the same round, it is 4.1.2 Second Phase (Commitment) possible that processes that proceed to round r þ 1 wait The second phase (lines 105-111) starts with an exchange of forever for messages from processes that have terminated at the new estimates (note that these are equal to the same r. By disseminating the decided value, the Reliable Broad- value v or to ?). Then, the behavior of a process pi depends cast primitive prevents such occurrences. on the set reci of estimate values it has gathered. There are three cases to consider [26]. 4.2 The Lambda Abstraction This section defines the properties of our Lambda abstrac- 1. If reci ¼fvg, then pi decides on v. Note that, in this tion. It states the properties any infinite sequence of case, as p receives v from ðn fÞ >n=2 processes, i invocations of lambda () function has to satisfy. Section 5 any process p receives v from at least one process. j will then show how these properties can be ensured using (Obviously, the algorithm remains correct if a process waits for y messages, with n=2