IEEE TRANSACTIONS ON COMPUTERS, VOL. 51, NO. 5, MAY 2002 561 On the Quality of Service of Failure Detectors

Wei Chen, Sam Toueg, and Marcos Kawazoe Aguilera

AbstractÐWe study the quality of service QoS) of failure detectors. By QoS, we mean a specification that quantifies 1) how fast the failure detector detects actual failures and 2) how well it avoids false detections. We first propose a set of QoS metrics to specify failure detectors for systems with probabilistic behaviors, i.e., for systems where message delays and message losses follow some distributions. We then give a new failure detector and analyze its QoS in terms of the proposed metrics. We show that, among a large class of failure detectors, the new algorithm is optimal with respect to some of these QoS metrics. Given a set of failure detector QoS requirements, we show how to compute the parameters of our algorithm so that it satisfies these requirements and we show how this can be done even if the probabilistic behavior of the system is not known. We then present some simulation results that show that the new failure detector algorithm provides a better QoS than an algorithm that is commonly used in practice. Finally, we suggest some ways to make our failure detector adaptive to changes in the probabilistic behavior of the network.

Index TermsÐFailure detectors, quality of service, , distributed algorithm, probabilistic analysis. æ

1INTRODUCTION AULT-TOLERANT distributed systems are designed to crashes is eventually suspected). Such specifications are Fprovide reliable and continuous service despite the appropriate for asynchronous systems in which there is no failures of some of their components. A basic building block timing assumption whatsoever.1 Many applications, how- of such systems is the failure detector. Failure detectors are ever, have some timing constraints and, for such applica- used in a wide variety of settings, such as network tions, failure detectors with eventual guarantees are not communication protocols [10], computer cluster manage- sufficient. For example, a failure detector that starts ment [24], group membership protocols [5], [9], [7], [29], suspecting a one hour after it crashed can be used [23], [22], etc.1a Roughly speaking, a failure detector provides some to solve asynchronous consensus, but it is useless to an information on which processes have crashed. This infor- application that needs to solve many instances of consensus mation, typically given in the form of a list of suspects, is not per minute. Applications that have timing constraints always up-to-date or correct: A failure detector may take a require failure detectors that provide a quality of service long time to start suspecting a process that has crashed and QoS) with some quantitative timeliness guarantees. it may erroneously suspect a process that has not crashed In this paper, we study the QoS of failure detectors in systems where message delays and message losses follow .in practice, this can be due to message losses and delays). some probability distributions. We first propose a set of Chandra and Toueg [12] provide the first formal metrics that can be used to specify the QoS of a failure specification of unreliable failure detectors and show that detector; these QoS metrics quantify 1) how fast it detects they can be used to solve some fundamental problems in actual failures and 2) how well it avoids false detections. We , namely, consensus and atomic broad- then give a new failure detector algorithm and analyze its cast. This approach was later used and generalized in other QoS in terms of the proposed metrics. We show that, among works, e.g., [21], [16], [17], [1], [3], [2]. a large class of failure detectors, the new algorithm is In all of the above works, failure detectors are specified optimal with respect to some of these QoS metrics. Given a set of failure detector QoS requirements, we show how to in terms of their eventual behavior .e.g., a process that compute the parameters of our algorithm so that it satisfies 1a. Editor's Note: This paper originally appeared in TC, vol. 51, no. 1. these requirements and we show how this can be done even Unfortunately, errors were introduced into the final paper which have led if the probabilistic behavior of the system is not known. us to reprint the paper in its entirety in this issue. Finally, we give simulation results showing that the new failure detector algorithm provides a better QoS than an . W. Chen is with Oracle Corp., One Oracle Dr., Nashua, NH 03062. algorithm that is commonly used in practice. The QoS E-mail: [email protected]. specification and the analysis of our failure detector . S. Toueg is with the Department of , University of algorithm are based on the theory of stochastic processes. Toronto, 6 King's College Rd., Toronto, Ontario, Canada M5S 3H5. E-mail: [email protected]. To the best of our knowledge, this work is the first . M.K. Aguilera is with Compaq Systems Research Center, 130 Lytton Ave., comprehensive and systematic study of the QoS of failure Palo Alto, CA 94301-1044. E-mail: [email protected]. detectors using probability theory. Manuscript received 20 Oct. 2000; revised 18 June 2001; accepted 22 June 2001. 1. Even though the fail-aware failure detector of [17] is implemented in For information on obtaining reprints of this article, please send e-mail to: the ªtimed asynchronousº model, its specification is for the asynchronous [email protected], and reference IEEECS Log Number 113022. model.

0018-9340/02/$17.00 ß 2002 IEEE 562 IEEE TRANSACTIONS ON COMPUTERS, VOL. 51, NO. 5, MAY 2002

Fig. 1. Detection time TD.

1.1 On the QoS Specification of Failure Detectors Fig. 2. FD1 and FD2 have the same query accuracy probability of 0:75,

We consider message-passing distributed systems in which but the mistake rate of FD2 is four times that of FD1. processes may fail by crashing and messages may be 2 3 delayed or dropped by communication links. A failure than FD1. In some applications, every mistake causes a detector can be slow, i.e., it may take a long time to suspect a costly interrupt and, for such applications, the mistake rate is process that has crashed, and it can make mistakes, i.e., it an important accuracy metric. may erroneously suspect some processes that are actually Note,however,thatthemistakeratealoneisnot up .such a mistake is not necessarily permanent: The failure sufficient to characterize accuracy: As shown in Fig. 3, detector may later stop suspecting this process). To be two failure detectors can have the same mistake rate, but useful, a failure detector has to be reasonably fast and different query accuracy . accurate. Even when used together, the above two accuracy In this paper, we propose a set of metrics for the QoS metrics are still not sufficient. In fact, it is easy to find two specification of failure detectors. In general, these QoS failure detectors, FD1 and FD2, such that 1) FD1 is better metrics should be able to describe the failure detector's than FD2 in both measures .i.e., it has a higher query speed .how fast it detects crashes) and its accuracy .how well accuracy probability and a lower mistake rate), but 2) FD2 is it avoids mistakes). Note that speed is with respect to better than FD1 in another respect: Specifically, whenever processes that , while accuracy is with respect to FD2 makes a mistake, it corrects this mistake faster than processes that do not crash. FD1; in other words, the mistake durations in FD2 are A failure detector's speed is easy to measure: This is smaller than in FD1. Having small mistake durations is simply the time that elapses from the moment when a important to some applications .e.g., applications that process p crashes to the time when the failure detector starts operate in a degraded mode while operating processes are suspecting p permanently. This QoS metric, called detection incorrectly suspected). time, is illustrated in Fig. 1. As can be seen from the above, there are several different How do we measure a failure detector's accuracy? It aspects of accuracy that may be important to different turns out that determining a good set of accuracy metrics is applications and each aspect has a corresponding accuracy a delicate task. To illustrate some of the subtleties involved, metric. consider a system of two processes, p and q, connected by a In this paper, we identify six accuracy metrics .since the lossy communication linkand suppose that the failure behavior of a failure detector is probabilistic, most of these detector at q monitors process p. The output of the failure metrics are random variables). We then use the theory of stochastic processes to quantify the relation between these detector at q is either ªI suspect that p has crashedº or ªI metrics. This analysis allows us to select two accuracy trust that p is upº and it may alternate between these two metrics as the primary ones in the sense that: 1) They are not outputs from time to time. For the purpose of measuring the redundant .one cannot be derived from the other) and accuracy of the failure detector at q, suppose that p does not 2) together, they can be used to derive the other four crash. accuracy metrics. Consider an application that queries q's failure detector The QoS metrics defined in this paper are similar to some at random times. For such an application, a natural measure measures. For example, the query accuracy of accuracy is the probability that, when queried at a random probability metric is similar to the availability measure, time, the failure detector at q indicates correctly that p is up. mistake duration is similar to time to recover, and mistake This QoS metric is the query accuracy probability.For recurrence time .Section 2.2) is similar to time between example, in Fig. 2, the query accuracy probability of FD1 failures. However, there are significant differences and one at q is 12= 12 ‡ 4†ˆ0:75. must be careful not to be confused. Dependability measures The query accuracy probability, however, is not suffi- refer to failures of the system, whereas our QoS metrics cient to fully describe the accuracy of a failure detector. To refer to mistakes of the failure detector which, itself, tracks see this, we show in Fig. 2 two failure detectors FD1 and failures of the system. Because of that, not all of our metrics FD2 such that 1) they have the same query accuracy have corresponding dependability measures. In particular, probability, but 2) FD2 makes mistakes more frequently the time to detect a failure is a natural metric for failure

2. We assume that process crashes are permanent or, equivalently, that a 3. The failure detector makes a mistake each time its output changes from process that recovers from a crash assumes a new identity. ªtrustº to ªsuspectº while p is actually up. CHEN ET AL.: ON THE QUALITY OF SERVICE OF FAILURE DETECTORS 563

heartbeat. In the simple algorithm, q would permanently suspect p only d ‡ TO time units after p crashes. Thus, the worst-case detection time for this algorithm is the maximum message delay plus TO. This is impractical because, in many systems, the maximum message delay is orders of magnitude larger than the average message delay. The source of the above problems is that, even though the heartbeats are sent at regular intervals, the timers to ªcatchº them expire at irregular times, namely the receipt times of the heartbeats plus a fixed TO. The algorithm that we propose eliminates this problem. As a result, the Fig. 3. FD1 and FD2 have the same mistake rate 1=16, but the query probability of a premature timeout on heartbeat mi does accuracy probabilities of FD and FD are 0:75 and 0:50, respectively. 1 2 not depend on the behavior of the heartbeats that precede m and the detection time does not depend on the maximum detectors, but it has no counterpart among dependability i measures. message delay. In summary, we show that the QoS specification of 1.2.2 A New Algorithm and Its QoS Analysis failure detectors can be given in terms of three basic metrics, namely, the detection time and the two primary In the new algorithm, process p sends heartbeat messages accuracy metrics that we identified. Taken together, these m1;m2; ... to q periodically every  time units .just as in the metrics can be used to characterize and compare the QoS simple algorithm). To determine whether to suspect p, q of failure detectors. Note that these QoS metrics are uses a sequence 1;2; ... of fixed time points, called applicable to all failure detectors, regardless of how they freshness points, obtained by shifting the sending time of are implemented, even though the second part of the the heartbeat messages by a fixed parameter . More paper focuses on the QoS of the failure detectors precisely, i ˆ i ‡ , where i is the time when mi is sent. implemented through heartbeats. For any time t, let i be so that t 2‰i;i‡1†; then, q trusts p at 1.2 The Design and Analysis of a New Failure time t if and only if q has received heartbeat mi or higher. Detector Algorithm Given the probabilistic behavior of the system .i.e., the In this paper, we consider a simple system of two processes, probability of message losses and the distribution of p and q, connected through a communication link. Process p message delays) and the parameters  and  of the may fail by crashing and the linkbetween p and q may algorithm, we determine the QoS of the new algorithm delay or drop messages. Message delays and message using the theory of stochastic processes. Simulation results losses follow some probabilistic distributions. Process q has given in Section 7 are consistent with our QoS analysis and a failure detector that monitors p and outputs either they show that the new algorithm performs better than the ªI suspect that p has crashedº or ªI trust that p is upº common one. .ªsuspect pº and ªtrust pº in short, respectively). In contrast to the common algorithm, the new algorithm guarantees an upper bound on the detection time. More- 1.2.1 A Common Failure Detection Algorithm and Its over, the new algorithm is optimal in the sense that it has Drawbacks the best possible query accuracy probability with respect to A simple failure detection algorithm, commonly used in any given bound on the detection time. More precisely, we practice, works as follows: At regular time intervals, process p sends a heartbeat message to q; when q receives a show that, among all failure detectors that send heartbeats heartbeat message, it trusts p and starts a timer with a fixed at the same rate .they use the same networkbandwidth) timeout value TO; if the timer expires before q receives a and satisfy the same upper bound on the detection time, the newer heartbeat message from p, then q starts suspecting p. new algorithm has the best query accuracy probability. This algorithm has two undesirable characteristics; one The first version of our algorithm .described above) regards its accuracy and the other its detection time, as we assumes that p and q have synchronized clocks. This now explain. Consider the ith heartbeat message mi. assumption is not unrealistic, even in large networks. For Intuitively, the probability of a premature timeout on mi example, GPS and Cesium clocks are becoming accessible should depend solely on mi and, in particular, on mi's and they can provide clocks that are very closely synchro- delay. With the simple algorithm, however, the probability nized .see, e.g., [31]). When synchronized clocks are not of a premature timeout on m also depends on the heartbeat i available, we propose a modification to this algorithm that m that precedes m ! In fact, the timer for m is started iÀ1 i i performs equally well in practice, as shown by our upon the receipt of m and, so, if m is ªfast,º the timer iÀ1 iÀ1 simulations. The basic idea is to use past heartbeat for mi starts early and this increases the probability of a messages to obtain accurate estimates of the expected premature timeout on mi. This dependency on past heart- beats is undesirable. arrival times of future heartbeats and then use these To see the second problem, suppose p sends a heartbeat estimates to find the freshness points. This is explained in just before it crashes and let d be the delay of this last Section 6. 564 IEEE TRANSACTIONS ON COMPUTERS, VOL. 51, NO. 5, MAY 2002

1.2.3 Configuring Our Algorithm to Meet the Failure The probabilistic networkmodel used in this paper is Detector Requirements of an Application similar to the ones used in [14], [6] for probabilistic clock synchronization. The method of estimating the expected Given a set of failure detector QoS requirements .provided arrival times of heartbeat messages is close to the method of by an application), we show how to compute the remote clockreading of [6]. parameters of our algorithm to achieve these requirements. The rest of the paper is organized as follows: In Section 2, We first do so assuming that one knows the probabilistic we propose a set of metrics to specify the QoS of failure behavior of the system .i.e., the probability distributions of detectors. In Section 3, we describe a new failure detector message delays and message losses). We then drop this algorithm and analyze its QoS in terms of these metrics; we assumption and show how to configure the failure detector also present an optimality result. We then explain how to to meet the QoS requirements of an application even when set the algorithm's parameters to meet some given QoS the probabilistic behavior of the system is not known. requirementsÐfirst, in the case when we know the 1.3 Related Work probabilistic behavior of messages .Section 4) and then in In [20], Gouda and McGuire measure the performance of the case when this is not known .Section 5). In Section 6, we deal with unsynchronized clocks. We present the results of some failure detector protocols under the assumption that some simulations in Section 7 and we conclude the paper the protocol stops as soon as some process is suspected to with some discussion in Section 8. Appendix A lists the have crashed .even if this suspicion is a mistake). This class main symbols used in the paper and Appendices B to D of failure detectors is less general than the one that we give the proofs of the main theorems. More detailed proofs studied here: In our work, a failure detector can alternate can be found in [13]. between suspicion and trust many times. In [30], van Renesse et al. propose a scalable gossip- style randomized failure detector protocol. They measure 2ON THE QOSSPECIFICATION OF FAILURE the accuracy of this protocol in terms of the probability of DETECTORS premature timeouts.4 The probability of premature timeouts, We consider a system of two processes, p and q. We assume however, is not an appropriate metric for the specification that the failure detector at q monitors p and that q does not of failure detectors in general: It is implementation-specific crash. Henceforth, real time is continuous and ranges from and it cannot be used to compare failure detectors that use 0 to 1. timeouts in different ways. This point is further explained at the end of Section 2.3. 2.1 The Failure Detector Model In [25], Raynal and Tronel present an algorithm that The output of the failure detector at q at time t is either S or detects member failures in a group: If some process detects T, which means that q suspects or trusts p at time t, a failure in the group .perhaps a false detection), then all respectively. A transition occurs when the output of the processes report a group failure and the protocol termi- failure detector at q changes: An S-transition occurs when nates. The algorithm is based on heartbeats and its timeout the output at q changes from T to S;aT-transition occurs mechanism is the same as the simple algorithm that we when the output at q changes from S to T. We assume that described in Section 1.2. there are only a finite number of transitions during any In [31], VerõÂssimo and Raynal study QoS failure detectors finite time interval. Ðthese are detectors that indicate when a service does not Since the behavior of the system is probabilistic, the meet its quality-of-service requirements. In contrast, this precise definition of our model and of our QoS metrics uses paper studies the QoS of failure detectors, i.e., how well a the theory of stochastic processes. To keep our presentation failure detector works. at an intuitive level, we omit the technical details related to Failure detector implementations based on heartbeats are this theory .they can be found in [13]). commonly used in practice. To keep both good detection We consider only failure detectors whose behavior time and good accuracy, many implementations rely on eventually reaches steady state, as we now explain infor- special features of the and communication mally. When a failure detector starts running and, for a system to try to ensure that heartbeat messages are received while after, its behavior depends on the initial condition at regular intervals .see discussion in Section 12.9 of [24]). .such as whether, initially, q suspects p or not) and on how This is not easy, even for closely-connected computer long it has been running. Typically, as time passes, the clusters, and it is very hard in wide-area networks. Fetzer effect of the initial condition gradually diminishes and its and Cristian [19] describe a failure detector implementation, behavior no longer depends on how long it has been which they call the independent assessment protocol. This runningÐi.e., eventually, the failure detector behavior implementation is identical to the common algorithm, reaches equilibrium or steady state. In steady state, the except that it has a mechanism to discard heartbeats that probability law governing the behavior of the failure are too old, based on a time threshold. However, in practical detector does not change over time. Typically, failure systems, the threshold must be much larger than the detector implementations reach steady state very quickly. average message delay .lest too many messages be For example, our failure detector does so soon after the first discarded). The resulting scheme has exactly the same heartbeat message is sent .see Section 3.2). drawbacks as the common algorithm. The QoS metrics that we propose refer to the behavior of a failure detector after it reaches steady state. Most of these 4. This is called ªthe probability of mistakesº in [30]. metrics are random variables. CHEN ET AL.: ON THE QUALITY OF SERVICE OF FAILURE DETECTORS 565

2.2 Primary Metrics We propose three primary metrics for the QoS specification of failure detectors. The first one measures the speed of a failure detector. It is defined with respect to the runs in which p crashes. Detection time .TD): Informally, TD is the time that elapses from p's crash to the time when q starts suspecting p permanently. More precisely, TD is a random variable Fig. 4. Mistake duration TM , good period duration TG, and mistake representing the time that elapses from the time that p recurrence time TMR. crashes to the time when the final S-transition .of the failure detector at q) occurs and there are no transitions afterward started at a random time in a good period. If the remaining .Fig. 1). If there is no such final S-transition, then TD ˆ1;if part of the good period is long enough, the short-lived 5 such an S-transition occurs before p crashes, then TD ˆ 0. application will be able to complete its task. The metric that We next define some metrics that are used to specify the measures the remaining part of the good period is: accuracy of a failure-detector. Throughout the paper, all Forward good period duration .TFG): This is a random accuracy metrics are defined with respect to failure-free runs, variable representing the time that elapses from a random 6 i.e., runs in which p does not crash. There are two primary time at which q trusts p to the time of the next S-transition. accuracy metrics: At first sight, it may seem that, on average, TFG is just Mistake recurrence time .T ): This measures the time MR half of TG .the length of a good period). But, this is incorrect between two consecutive mistakes. More precisely, T is a MR and, in Section 2.4, we give the actual relation between TFG random variable representing the time that elapses from an and TG. S-transition to the next one .Fig. 4). An important remarkis now in order. For timeout-based Mistake duration .TM ): This measures the time it takes failure detectors, the probability of premature timeouts has the failure detector to correct a mistake. More precisely, TM sometimes been used as the accuracy measure: This is the is a random variable representing the time that elapses from probability that, when the timer is set, it will prematurely an S-transition to the next T-transition .Fig. 4). timeout on a process that is actually up. The measure, As we discussed in the introduction, there are many however, is not appropriate because: 1) It is implementa- aspects of failure detector accuracy that may be important tion-specific and 2) it is not useful to applications unless it is to applications. Thus, in addition to T and T ,we MR M given together with other implementation-specific mea- propose four other accuracy metrics in the next section. We sures, e.g., how often timers are started, whether the timers selected T and T as the primary metrics because, given MR M are started at regular or variable intervals, whether the these two, one can compute the other four .this will be timeout periods are fixed or variable, etc. .many such shown in Section 2.4). variations exist in practice [10], [20], [30]). Thus, the 2.3 Derived Metrics probability of premature timeouts is not a good metric for We propose four additional accuracy metrics: the specification of failure detectors, e.g., it cannot be used to compare the QoS of failure detectors that use timeouts in Average mistake rate .M ): This measures the rate at which a failure detector makes mistakes, i.e., it is the different ways. The six accuracy metrics that we identified average number of S-transitions per time unit. This metric is in this paper do not refer to implementation-specific important to long-lived applications where each failure features, in particular, they do not refer to timeouts at all. detector mistake .each S-transition) results in a costly 2.4 How the Accuracy Metrics are Related interrupt. This is the case for applications such as group membership and cluster management. Theorem 1 below explains how our six accuracy metrics are related. We then use this theorem to justify our choice of the Query accuracy probability .PA): This is the probability that the failure detector's output is correct at a random time. primary accuracy metrics. Henceforth, Pr A† denotes the k This metric is important to applications that interact with probability of event A; E X†, E X †, and V X† denote the the failure detector by querying it at random times. expected value .or mean), the kth moment, and the variance Many applications can make progress only during good of random variable X, respectively. periodsÐperiods in which the failure detector makes no Parts 2 and 3 of Theorem 1 assume that, in failure-free mistakes. This observation leads to the following two runs, the probabilistic distribution of failure detector output metrics: histories is ergodic. Roughly speaking, this means that, in failure-free runs, the failure detector slowly ªforgetsº its Good period duration .TG): This measures the length of a good period. More precisely, TG is a random variable past history: From any given time on, its future behavior representing the time that elapses from a T-transition to the may depend only on its recent behavior. We call failure next S-transition .Fig. 4). detectors satisfying this ergodicity condition ergodic failure For short-lived applications, however, a closely related detectors. Ergodicity is a basic concept in the theory of metric may be more relevant. Suppose that an application is stochastic processes [27], but the technical details are substantial and outside the scope of this paper. 5. We omit the boundary cases of other metrics since they can be We have also determined the relations between our similarly defined. 6. As explained in [13], these metrics are also meaningful for runs in accuracy metrics in the case that ergodicity does not hold. which p crashes. The resulting expressions are more complex .they are 566 IEEE TRANSACTIONS ON COMPUTERS, VOL. 51, NO. 5, MAY 2002 generalized versions of those given below). More details, 1) message loss probability pL, which is the probability that a including a generalized form of Theorem 1 and its proof, message is dropped by the link, and 2) message delay D, can be found in [13]. which is a random variable with range 0; 1† representing Theorem 1. For any ergodic failure detector, the following the delay from the time a message is sent to the time it is received, under the condition that the message is not results hold: 1) TG ˆ TMR À TM .2)If0 y†dy=E TG†, 3b) k k‡1 E TFG†ˆE TG †=‰ k ‡ 1†E TG†Š.Inparticular,3c) to many practical systems. 2 E TFG†ˆ‰1 ‡ V TG†=E TG† ŠE TG†=2. Processes p and q have access to their own local clocks. For simplicity, we assume that there is no clockdrift, i.e., The fact that TG ˆ TMR À TM holds is immediate by local clocks run at the same speed as real time. In practice, definition. The proofs of parts 2) and 3) use the theory of clockdrift rate is usually very small .on the order of 10À6 stochastic processes. Part 2) is intuitive, while part 3), which [14]). Thus, for the purpose of failure detection, clockdrift is relates TG and TFG, is more complex. In particular, part 3c) usually negligible because, in most cases, only messages is counterintuitive: One may thinkthat E TFG†ˆE TG†=2, from a short period of time are used for detection and the but part 3c) says that E TFG† is, in general, larger than clockdrift in this short period is not significant. In E TG†=2 .this is a version of the ªwaiting time paradoxº in Sections 3, 4, and 5, we further assume that clocks are the theory of stochastic processes [4]). synchronizedÐan assumption that holds in many systems. We now explain how Theorem 1 guided our selection of We explain how to remove this assumption in Section 6. the primary accuracy metrics. Parts 2) and 3) show that M , For simplicity, we assume that the probabilistic behavior PA, and TFG can be derived from TMR, TM , and TG. This of the networkdoes not change over time. In Section 8, we suggests that the primary metrics should be selected among suggest some ways to modify the algorithm so that it T , T , and T . Moreover, since T ˆ T À T , it is clear MR M G G MR M dynamically adapts to changes in the probabilistic behavior that, given the joint distribution of any two of them, one can of the system. derive the remaining one. Thus, two of T , T , and T MR M G We assume that crashes cannot be predicted, i.e., the should be selected as the primary metrics, but which two? state of the system at any given time has no information By choosing T and T as our primary metrics, we get the MR M whatsoeverontheoccurrenceoffuturecrashes.this following convenient property that helps to compare failure excludes a system with program-controlled crashes [11]). detectors: If FD1 is better than FD2 in terms of both E TMR† Moreover, the delay and loss behaviors of the messages that and E TM † .the expected values of the primary metrics), a process sends are independent of whether .and when) the then we can be sure that FD1 is also better than FD2 in process crashes. terms of E TG† .the expected values of the other metric). We would not get this useful property if TG were selected as 3.2 The Algorithm one of the primary metrics.7 A final remarkis now in order on the seven QoS The new algorithm works as follows: The monitored metrics that we proposed. Although these generic metrics process p periodically sends heartbeat messages should be sufficient for most applications that use failure m1;m2;m3; ... to q every  time units, where  is a detectors, there may be other specialized metrics that are parameter of the algorithm. Every heartbeat message mi relevant to specific applications. Such metrics could be is tagged with its sequence number i. Henceforth, i derivable from ours. denotes the sending time of message mi. The monitoring process q shifts the is forward by Ðthe other parameter of the algorithmÐto obtain the sequence of times 3THE DESIGN AND QOSANALYSIS OF A NEW 1 <2 <3 < ..., where i ˆ i ‡ . Process q uses the is FAILURE DETECTOR ALGORITHM and the times it receives heartbeat messages to determine 3.1 The Probabilistic Network Model whether to trust or suspect p, as follows: Consider time

We assume that processes p and q are connected by a link period ‰i;i‡1†. At time i, q checks whether it has received 8 that does not create or duplicate messages, but may delay some message mj with j  i. If so, q trusts p during the or drop messages. Note that the linkhere represents an end- entire period ‰i;i‡1† .Fig. 5a). If not, q starts suspecting p. to-end connection and does not necessarily correspond to a If, at some time before i‡1, q receives some message mj physical link. with j  i, then q starts trusting p from that time until i‡1. We assume that the message loss and message delay .Fig. 5b). If, by time i‡1, q has not received any message mj behavior of any message sent through the linkis probabil- with j  i, then q suspects p during the entire period istic and is characterized by the following two parameters: ‰i;i‡1† .Fig. 5c). This procedure is repeated for every time period. The detailed algorithm with parameters  and  is 7. For example, FD1 may be better than FD2 in terms of both E TG† and denoted by NFD-S and is given in Fig. 6.9 E TM †, but worse than FD2 in terms of E TMR†. 8. Message duplication can be easily taken care of: Whenever we refer to a message being received, we change it to the first copy of the message being 9. This version of the algorithm is convenient for illustrating the main received. With this modification, all definitions and analyses in the paper go idea and for performing the analysis. We have omitted some obvious through and, in particular, our results remain correct without any change. optimizations. CHEN ET AL.: ON THE QUALITY OF SERVICE OF FAILURE DETECTORS 567

Fig. 5. Three scenarios of the failure detector output in one interval ‰i;i‡1†.

Fig. 6. Failure detector algorithm NFD-S with parameters  and  Bclocks are synchronized).

Note that, from time i to i‡1, only messages mj with Definition 1. j  i can affect the output of the failure detector. For this 1. For any i  1, let k be the smallest integer such that, reason, i is called a freshness point: From time i to i‡1, messages m with j  i are still fresh .useful). So, our for all j  i ‡ k, mj is sent at or after time i. j 2. For any i  1, let p x† be the probability that q algorithm is characterized by the following property: q j does not receive message m by time  ‡ x, for trusts p at time t if and only if q received a message that is i‡j i every j  0 and every x  0; let p ˆ p 0†. still fresh at time t. 0 0 3. For any i  2, let q0 be the probability that q receives This property immediately implies that the failure message miÀ1 before time i. detector reaches its steady state very quickly: It does so at 4. For any i  1, let u x† be the probability that q time 1, i.e.,  time after the first heartbeat message is sent. suspects p at time i ‡ x, for every x 2‰0;†. This is because, after time j, the state of process q only 5. For any i  2, let ps be the probability that an depends on what happens at or after time j .the time when S-transition occurs at time i. the jth heartbeat message is sent). The above definitions are given in terms of i, a positive 3.3 The QoS Analysis of the Algorithm integer. Proposition 3, however, shows that they are actually independent of i. We now give the QoS of the algorithm .the analysis is given in Appendix B). We assume that the linkfrom p to q satisfies Proposition 3. the following message independence property: The behaviors 1. k ˆd=e. 10 of any two heartbeat messages sent by p are independent. 2. For all j  0 and for all x  0, def Henceforth, let 0 ˆ 0, and i ˆ i ‡  for i  1 .as in line 3 of the algorithm). pj x†ˆpL ‡ 1 À pL†Pr D>‡ x À j†: We first formalize the intuition behind freshness points and fresh messages: 3. q0 ˆ 1 À pL†Pr D<‡Q†. 4. For all x 2‰0;†, u x†ˆ k p x†. Lemma 2. For all i  0 and all time t 2‰i;i‡1†, q trusts p at jˆ0 j 5. p ˆ q Á u 0†. time t if and only if q has received some message mj with j  i s 0 by time t. By definition, if p0 ˆ 0, then, for every i  1, the

probability that q receives mi by time i is 1. Thus, if The following definitions are for runs where p does not p ˆ 0, then, with probability one, q trusts p forever after crash. 0 time 1. Similarly, it is easy to see that if q0 ˆ 0, then, with

10. In practice, this holds only if consecutive heartbeats are sent more probability one, q suspects p forever. So, p0 ˆ 0 and q0 ˆ 0 than some Á time units apart, where Á depends on the system. So, are degenerated cases of no interest. We henceforth assume assuming that the behavior of heartbeats is independent is equivalent to assuming that >Á. that p0 > 0 and q0 > 0. 568 IEEE TRANSACTIONS ON COMPUTERS, VOL. 51, NO. 5, MAY 2002

The following lemma indicates that Theorem 1 is applicable to failure detector NFD-S. Lemma 4. NFD-S is an ergodic failure detector.

The following theorem summarizes our QoS analysis of the new failure detector algorithm. Theorem 5. Consider a system with synchronized clocks, where the probability of message losses is pL and the distribution of message delays is Pr D  x†. The failure detector NFD-S of Fig. 6 with parameters  and  has the following properties. Fig. 7. Meeting QoS requirements with NFD-S. The probabilistic 1. The detection time is bounded as follows and the bound behavior of heartbeats is given, and clocks are synchronized. is tight:

TD   ‡ : 3:1† is easy to see that the probability that q trusts p at a random time in Aà must be at least as high as the probability that q trusts p at a random time in any A 2C. The detailed proof is 2. The average mistake recurrence time is: given in Appendix C.  E T †ˆ : 3:2† MR p s 4CONFIGURING THE FAILURE DETECTOR TO SATISFY QOSREQUIREMENTS 3. The average mistake duration is: Suppose we are given a set of failure detector QoS R  requirements .the QoS requirements could be given by 0 u x† dx E TM †ˆ : 3:3† the application that uses this failure detector). We now ps show how to compute the parameters  and  of our failure detector algorithm, so that these requirements are From E TMR† and E TM † given in the theorem above, we can easily derive the other accuracy measures using satisfied. We assume that 1) the local clocks of processes Theorem 1. For example, we can get the query accuracy are synchronized and 2) one knows the probabilistic R  probability PA ˆ 1 À E TM †=E TMR†ˆ1 À 1= Á 0 u x† dx. behavior of the messages, i.e., the message loss prob- Theorem 5.1 shows an important property of the ability pL and the distribution of message delays algorithm: The detection time is bounded and the bound Pr D  x†. In Sections 5 and 6, we consider the cases does not depend on the behavior of message delays and when these assumptions do not hold. losses. We assume that the QoS requirements are expressed In Sections 4, 5, and 6, we show how to use Theorem 5 to using the primary metrics. More precisely, a set of QoS compute the failure detector parameters so that the failure requirements is a tuple T U ;TL ;TU † of positive numbers, detector satisfies some QoS requirements .given by an D MR M U L application). where TD is an upper bound on the detection time, TMR is a lower bound on the average mistake recurrence time, and 3.4 An Optimality Result U TM is an upper bound on the average mistake duration. In Among all failure detectors that send heartbeats at the same other words, the requirements are that:11 rate and satisfy the same upper bound on the detection U L U time, the new algorithm provides the best query accuracy TD  TD ;E TMR†TMR;E TM †TM : 4:1† probability. More precisely, let C be the class of failure detector A such that, in every run of A, process p Our goal, illustrated in Fig. 7, is to find a configuration sends heartbeats to q every  time units and A satisfies procedure that takes as inputs 1) the QoS requirements, U U à namely T U ;TL ;TU , and 2) the probabilistic behavior of the TD  TD for some constant TD . Let A be the instance of the D MR M new failure detector algorithm NFD-S with parameters  heartbeat messages, namely pL and Pr D  x†, and outputs U the failure detector parameters  and  so that the failure and  ˆ TD À . By part 1 of Theorem 5, we know that Aà 2C. We can show that: detector satisfies the QoS requirements in .4.1). Further- more, to minimize the networkbandwidth takenby the Theorem 6. For any A 2C, let PA be the query accuracy à failure detector, we want a configuration procedure that probability of A. Let PA be the query accuracy probability of à à finds the largest intersending interval  that satisfies these A . Then, PA  PA. QoS requirements. The theorem is a consequence of the following important From Theorem 5, our goal can be restated as a property of algorithm AÃ: Consider any algorithm A 2C. mathematical programming problem: Let rà be any failure-free run of Aà and r be any failure-free run of A in which the heartbeat delays and losses are 11. Note that the bounds on the primary metrics E TMR† and E TM † also à impose bounds on the derive metrics according to Theorem 1. More exactly as in r . We can show that if q suspects p at time t in L L U L precisely, we have M  1=TMR, PA  TMR À TM †=TMR, à L U L U r , then q also suspects p at time t in r. With this property, it E TG†TMR À TM , and E TFG† TMR À TM †=2. CHEN ET AL.: ON THE QUALITY OF SERVICE OF FAILURE DETECTORS 569

maximize  4:2† U subject to  ‡   TD

 L  TMR 4:3† ps

R  0 u x† dx U  TM ; 4:4† ps where the values of u x† and ps are given by Proposition 3. Solving this problem is hard, so, instead, we show how to find some  and  that satisfy .4.2)-.4.4) .but the  that we find may not be the largest possible). To do so, we replace Fig. 8. Meeting QoS requirements with NFD-S. The probabilistic .4.4) with a simpler and stronger constraint and then behavior of heartbeats is not known, and clocks are synchronized. compute the optimal solution of this modified problem .see Appendix D). We obtain the following procedure to find  the distribution of the message delay and the message loss. and : However, we can provide a conservative bound on the

0 U optimal  that always holds regardless of the distribution: . Step 1: Compute q0 ˆ 1 À pL†Pr DTU ††;  max L L D f †ˆ : QdT U =eÀ1 0 D U where max is defined in Step 1 of the configuration procedure. q0 jˆ1 ‰pL ‡ 1 À pL†Pr D>TD À j†Š 4:5† 5DEALING WITH UNKNOWN MESSAGE BEHAVIOR Find the largest    such that f †T L . Such max MR In Section 4, our procedure to compute the parameters  an  always exists. To find such an , we can use a simple numerical method, such as binary search and  of NFD-S to meet some QoS requirements assumed .this works because, when  decreases, f † in- that one knows the probability pL of message loss and the creases exponentially fast). distribution Pr D  x† of message delays. This assumption U . Step 3: Set  ˆ TD À  and output  and . is not unrealistic, but, in some systems, the probabilistic behavior of messages may not be known. In that case, it is Theorem 7. Consider a system in which clocks are synchronized still possible to compute  and , as we now explain. We and the probabilistic behavior of messages is known. Suppose proceed in two steps: 1) We first show how to compute  we are given a set of QoS requirements as in 4.1). The above procedure has two possible outcomes: 1) It outputs  and .In and  using only pL, E D†, and V D† .recall that E D† and this case, with parameters  and , the failure detector NFD-S V D† are the expected value and variance of message of Fig. 6 satisfies the given QoS requirements. 2) It outputs delays, respectively); 2) we then show how to estimate pL, ªQoS cannot be achieved.º In this case, no failure detector can E D†, and V D†. In this section, we still assume that local achieve the given QoS requirements. clocks are synchronized .we drop this assumption in the next section). See Fig. 8. As an example of the configuration procedure of the failure detector, suppose we have the following QoS 5.1 Computing Failure Detector Parameters  and  requirements: 1) A crash failure is detected within Using pL, E D†, and V D† U 30 seconds, i.e., TD ˆ 30 s; 2) on average, the failure detector With E D† and V D†, we can bound Pr D>t† using the makes at most one mistake per month, i.e., following One-Sided Inequality of probability theory .e.g., see L TMR ˆ 30 days ˆ 2; 592; 000 s; 3) on average, the failure detector corrects its mistakes within one minute, i.e., [4, p. 79]): For any random variable D with a finite expected U value and a finite variance, TM ˆ 60 s. Assume that the message loss probability is pL ˆ 0:01, the distribution of message delay D is exponen- V D† Pr D>t† ; for all t>E D†: 5:1† tial, and the average message delay E D† is 0:02 s.By 2 inputting these numbers into the configuration procedure, V D†‡ t À E D†† we get  ˆ 20:03 s and  ˆ 9:97 s. With these parameters, With this, we can derive the following bounds on the our failure detector satisfies the given QoS requirements. QoS metrics of algorithm NFD-S: Note that the above procedure may not find the optimal .largest) possible  that satisfies the QoS .but, as Theorem 7 Theorem 9. Consider a system with synchronized clocks and states, the  found satisfies the QoS). How close is the  assume >E D†. For algorithm NFD-S, we have E TMR† found by our procedure to the optimal ? This depends on = and E TM †= , where 570 IEEE TRANSACTIONS ON COMPUTERS, VOL. 51, NO. 5, MAY 2002

Yk0 V D†‡p  À E D†Àj†2 ˆ L ; take the same example as in Section 4, except that we do not 2 jˆ0 V D†‡  À E D†Àj† assume that the distribution of D is exponential. Specifi- cally, suppose that the failure detector QoS requirements k0 ˆd  À E D††=eÀ1; are that: 1) A crash failure is detected within 30 seconds, i.e., and U TD ˆ 30 s; 2) on average, the failure detector makes at most one mistake per month, i.e., T L ˆ 30 days ˆ 2; 592; 000 s; 1 À p †  À E D†‡†2 MR ˆ L : 2 3) on average, the failure detector corrects its mistakes V D†‡  À E D†‡† U within one minute, i.e., TM ˆ 60 s. Assume that the message Note that, in Theorem 9, we assume that >E D†, loss probability is pL ˆ 0:01, the average message delay where  is a parameter of NFD-S. This assumption is E D† is 0:02 s, and the variance V D† is also 0:02.By reasonable because if   E D†, then NFD-S would gen- inputting these numbers into the configuration procedure, erate a false suspicion every time the heartbeat message is we get  ˆ 20:29 s and  ˆ 9:71 s. With these parameters, delayed by more than the average message delay. But then, failure detector NFD-S satisfies the given QoS requirements. NFD-S would make too many mistakes to be a useful failure Note that, when we go from the case that the distribution of detector. D is known .example of Section 4) to the case that D is not Theorem 9 can be used to compute the parameters  and known,  decreases from 9:97 s to 9:71 s. This corresponds to  of the failure detector NFD-S so that it satisfies the QoS a slight increase in the heartbeat sending rate .in order to requirements given in .4.1). Recall that these QoS require- achieve the same given QoS). U L U U ments are given as a tuple TD ;TMR;TM †, where TD is an L 5.2 Estimating pL, E D†, and V D† upper bound on the worst-case detection time, TMR is a lower bound on the average mistake recurrence time, and It is easy to estimate pL, E D†, and V D† using heartbeat U TM is an upper bound on the average mistake duration. The messages. For example, to estimate pL, one can use the configuration procedure is given below. This procedure sequence numbers of the heartbeat messages to count the U assumes that TD >E D†, i.e., the required detection time is number of ªmissingº heartbeats and then divide this count greater than the average message delay .a reasonable by the highest sequence number received so far. To estimate assumption). E D† and V D†, we use the synchronized clocks as follows: When p sends a heartbeat m, p timestamps m with the . Step 1: Compute 0 ˆ 1 À p † T U À E D††2= V D†‡ L D sending time S and, when q receives m, q records the receipt T U À E D††2† and let  ˆ min 0 T U ;TU À E D††. D max M D time A. In this way, A À S is the delay of m. We then If  ˆ 0, then output ªQoS cannot be achievedº max compute the average and variance of A À S for multiple and stop; else continue. past heartbeat messages and, thus, obtain accurate esti- . Step 2: Let mates for E D† and V D†. f †ˆ de T U ÀE D††= À1 6DEALING WITH UNKNOWN MESSAGE BEHAVIOR D Y V D†‡ T U À E D†Àj†2  Á D : U 2 AND UNSYNCHRONIZED CLOCKS jˆ1 V D†‡pL TD À E D†Àj† So far, we assumed that the clocks of p and q are 5:2† synchronized. More precisely, in the algorithm NFD-S of L Fig. 6, q sets the freshness points  s by shifting the sending Find the largest   max such that f †TMR. Such i an  always exists. times of heartbeats by a constant. When clocks are not U synchronized, the local sending times of heartbeats at p . Step 3: Set  ˆ TD À  and output  and . Notice that the above procedure does not use the cannot be used by q to set the is and, thus, q needs to do it in a different way. The basic idea is that q sets the  sby distribution Pr D  x† of message delays; it only uses pL, i E D†, and V D†. shifting the expected arrival times of the heartbeats and q estimates the expected arrival times accurately .to compute Theorem 10. Consider a system in which clocks are synchronized these estimates, q does not need synchronized clocks). and the probabilistic behavior of messages is not known. Suppose we are given a set of QoS requirements, as in 4.1), 6.1 NFD-U: An Algorithm that Uses Expected Arrival U and suppose TD >E D†. The above procedure has two Times possible outcomes: 1) It outputs  and . In this case, with We now present NFD-U, a new failure detector algorithm parameters  and , the failure detector NFD-S of Fig. 6 for systems with unsynchronized clocks. The new algorithm satisfies the given QoS requirements. 2) It outputs ªQoS is very similar to NFD-S; the only difference is that q now cannot be achieved.º In this case, no failure detector can sets the is by shifting the expected arrival times of the achieve the given QoS requirements. heartbeats, rather than the sending times of heartbeats. We assume that local clocks do not drift with respect to real The above configuration procedure works when the time, i.e., they accurately measure time intervals. Let i distribution of the message delay D is not known .only denote the sending time of mi with respect to q's local clock. E D† and V D† are known). To illustrate this procedure, we Then, the expected arrival time of mi at q is CHEN ET AL.: ON THE QUALITY OF SERVICE OF FAILURE DETECTORS 571

Fig. 9. Failure detector algorithm NFD-U with parameters  and Bclocks are not synchronized, but the EAis are known).

EAi ˆ i ‡ E D†, where E D† is the expected message 6.2.1 Computing Failure Detector Parameters  and delay. Assume that q knows the EA s .we will soon show i Using pL and V D† how q can accurately estimate them). To set the is, q shifts By replacing  with E D†‡ in Theorem 9, we obtain the the EAis forward by time units .i.e., i ˆ EAi ‡ ), where following bounds on the accuracy metrics of NFD-U: is a new failure detector parameter that replaces . The Theorem 11. Consider a system with drift-free clocks and assume intuition here is that EAi is the time when mi is expected to >0. For algorithm NFD-U, we have E TMR†= and be received and is a slackadded to EAi to mitigate the 12 E TM †= , where effects of the possible extra delay or loss of mi. Fig. 9 shows the whole algorithm, denoted by NFD-U. Yk0 V D†‡p À j†2 ˆ L ;kˆd =eÀ1; and We restructured the algorithm a little to show explicitly 2 0 jˆ0 V D†‡ À j† when q uses the EAis. Variable ` keeps the largest heartbeat sequence number received so far and  refers to the 1 À p † ‡ †2 `‡1 ˆ L : ªnextº freshness point. Note that, when q updates `, it also V D†‡ ‡ †2 changes `‡1. If the local clockof q ever reaches time `‡1 .an event which might never happen), then, at this time, none of Note that the bounds given in Theorem 11 use only pL and the heartbeats received is still fresh and, so, q starts V D†; on the other hand, E D† is not used. suspecting p .lines 5-6). When q receives mj, it checks Theorem 11 can be used to compute the parameters  whether this is a new heartbeat .j>`) and, in this case, 1) q and of the failure detector NFD-U so that it satisfies some updates `,2)q sets the next freshness point `‡1 to QoS requirements. We assume the QoS requirements are u L U EA`‡1 ‡ , and 3) q trusts p if the current time is less than given as a tuple TD;TMR;TM † of positive numbers. The `‡1 .lines 9-11). Note that this algorithm is identical to requirements are that: NFD-S, except in the way in which q sets the  s. In i u L U TD  TD ‡ E D†;E TMR†TMR;E TM †TM : 6:1† particular, for any time t, let i be so that t 2‰i;i‡1†; then, with NFD-U, q trusts p at time t if and only if q has received Note that the upper bound on the detection time TD is heartbeat m or higher by time t. u u i not TD, but TD plus the unknown average message delay E D†. So, the actual upper bound T U on the detection time 6.2 Analysis and Configuration of NFD-U D

NFD-U and NFD-S differ only in the way they set the is: in NFD-S, i ˆ i ‡ , while, in NFD-U, i ˆ EAi ‡ ˆ i ‡ E D†‡ .the last equality holds because EAi ˆ i ‡ E D†). Thus, the QoS analysis of NFD-U is obtained by simply replacing  with E D†‡ in Proposition 3, Theorem 5, and Theorem 9. To configure the parameters  and of NFD-U to meet some QoS requirements, we use a method similar to the one in Section 5. We proceed in two steps: 1) We first show how to compute  and using only pL and V D† .note that E D† is not used); 2) we then show how to estimate pL and V D†. See Fig. 10. Fig. 10. Meeting QoS requirements with NFD-U. The probabilistic behavior of heartbeats is not known; clocks are not synchronized, but 12. To deal with the loss of mi, the slack must be large enough to allow q to ªcatchº the next heartbeat message mi‡1. they are drift-free. 572 IEEE TRANSACTIONS ON COMPUTERS, VOL. 51, NO. 5, MAY 2002

u is TD ‡ E D†. In other words, the QoS requirement on detection time is not absolute as in .4.1), but relative to E D†. This is justified as follows: Note that, when local clocks are not synchronized and only one-way messages are U used, an absolute bound TD on detection time cannot be enforced by any nontrivial failure detector. Moreover, it is reasonable to specify an upper bound requirement relative to the average delay E D† of a heartbeat. In fact, a failure detector that guarantees to detect crashes faster than E D† makes too many mistakes to be useful. The following is the configuration procedure for algo- rithm NFD-U, modified from the one in Section 5. Fig. 11. Meeting QoS requirements with NFD-E Bsame as with NFD-U, 0 u 2 u 2 except that the expected arrival times EAis of heartbeats are . Step 1: Compute ˆ 1 À pL† TD† = V D†‡ TD† † 0 U u estimated). and let max ˆ min TM ;TD†.Ifmax ˆ 0, then out- put ªQoS cannot be achievedº and stop; else Let s ; ...;s be the sequence numbers of such messages continue. 1 n and A0 ; ...;A0 be their receipt times according to q's local . Step 2: Let 1 n clock. Then, EA`‡1 can be estimated by: deT u = À1 ! DY u 2 n V D†‡ TD À j† 1 X f †ˆ Á : 6:2† EA  A0 À s ‡ ` ‡ 1†: 6:3† V D†‡p T u À j†2 `‡1 n i i jˆ1 L D iˆ1 Find the largest    such that f †T L . Such 0 max MR Intuitively, this formula first ªnormalizesº each Ai by an  always exists. shifting it backward in time by si. Then, it computes the . Step 3: Set ˆ T u À  and output  and . 0 D average of the normalized Ais. Finally, it shifts forward the computed average by ` ‡ 1†. It is easy to see that this is a Theorem 12. Consider a system with unsynchronized, drift-free good estimate of EA`‡1. We denote by NFD-E the algorithm clocks, where the probabilistic behavior of messages is not obtained from Fig. 9 by replacing EA`‡1 with this estimate. known. Suppose we are given a set of QoS requirements, as Our simulations show that NFD-E and NFD-U are in 6.1). The above procedure has two possible outcomes: 1) practically indistinguishable for values of n as low as 30. It outputs  and . In this case, with parameters  and , Thus, for large values of n, the configuration procedure for the failure detector NFD-U of Fig. 9 satisfies the given QoS NFD-U can also be used to configure NFD-E. See Fig. 11. requirements. 2) It outputs ªQoS cannot be achieved.º In this case, no failure detector can achieve the given QoS 7SIMULATION RESULTS requirements. We simulate both the new failure detector algorithm that

6.2.2 Estimating pL and V D† we developed and the simple algorithm commonly used in practice .as described in Section 1.2). In particular, 1) we When local clocks are not synchronized, we can estimate pL simulate the algorithm NFD-S .the one with synchronized and V D† using the procedure of Section 5.2. To estimate pL, this procedure did not use clocks and, so, it works just as clocks) and show that the simulation results validate our before. For V D†, the procedure did use clocks, but it works QoS analysis of NFD-S in Section 3.3; 2) we simulate the even though the clocks are not synchronized. To see why, algorithm NFD-E .the one without synchronized clocks that recall that the procedure estimates V D† by computing the estimates the expected arrival times) and show that it variance of A À S of multiple heartbeat messages, where A provides essentially the same QoS as NFD-S; and 3) we is the time .with respect to q's local clocktime) when q simulate the simple algorithm and compare it to the new receives a message m and S is the time .with respect to p's algorithms NFD-S and NFD-E and show that the new local clocktime) when p sends m. When clocks are not algorithms provide a much better accuracy than the simple synchronized, A À S is not the actual delay of m, but rather algorithm. the delay of m plus a constant, namely, the skew between The settings of the simulations are as follows: For the the clocks of p and q. Thus, the variance of A À S is the same purpose of comparison, we normalize the intersending time as the variance V D† of message delays.  of heartbeat messages in both the new algorithm and the simple algorithm to 1. The message loss probability pL is set 6.3 NFD-E: An Algorithm that Uses Estimates of to 0:01. The message delay D follows the exponential Expected Arrival Times distribution .i.e., Pr D  x†ˆ1 À eÀx=E D† for all x  0). We Failure detector NFD-U .Fig. 9) assumes that q knows the choose the exponential distribution for the following two exact value of all the EAis .the expected arrival time of reasons: First, it has the characteristic that a large portion of messages). In practice, q may not know such values and messages have fairly short delays while a small portion of needs to estimate them. To do so, every time q executes messages have large delays, which is also the characteristic line 10 of algorithm NFD-U in Fig. 9, q considers the n most of message delays in many practical systems [14]; second, 0 0 recent heartbeat messages .for some n), denoted m1; ...;mn. it has a simple analytical representation which allows us CHEN ET AL.: ON THE QUALITY OF SERVICE OF FAILURE DETECTORS 573

Fig. 12. The average mistake recurrence times obtained by: Ba) simulating the new algorithms NFD-S and NFD-E Bshown by ‡ and Â), Bb) simulating the simple algorithm Bshown by ÀÀand ÀÅÀ), and Bc) plotting the analytical formula for E TMR† of the new algorithm NFD-S Bshown by Ð). to compare the simulation results with the analytical similar and 2) the simulation results of both algorithms results given in Theorem 5. The average message delay match the analytical formula for E TMR†. E D† is set to 0:02, which is a small value compared to 7.2 Simulation Results of the Simple Algorithm the intersending time . This corresponds to a system in which message delays are in the order of tens of The simple algorithm has no upper bounds on the detection milliseconds .typical for messages transmitted over the time. However, such an upper bound can be guaranteed Internet), while heartbeat messages are sent every few with a simple modification: The general idea is to discard seconds. Note that, since D follows an exponential heartbeats which have very large delays. More precisely, distribution, the standard deviation is  D†ˆE D†ˆ the modified algorithm has another parameter, the cutoff 0:02 and the variance is V D†ˆ D†2 ˆ 4  10À4. time c, such that all heartbeats delayed by more than c time 13 To compare the accuracy of different algorithms, we first units, called slow heartbeats, are discarded. With this set their parameters so that: 1) They send messages at the modification, the detection time TD is at most TD  c ‡ TO. Given a bound T U on the detection time, there is a trade- same rate .recall that  ˆ 1) and 2) they satisfy the same D off in setting the cutoff time c and the timeout value TO: bound T U on the detection time. We simulated runs for D The larger the cutoff time c, the smaller the number of slow values of T U ranging from 1 to 3.5 and, for each value of T U , D D heartbeats being discarded, but the shorter the timeout we measured the accuracy of the failure detectors in terms value TO, and vice versa. In our simulations, we choose two of the average mistake recurrence time E TMR† and the cutoff times, c ˆ 0:16 and c ˆ 0:08, i.e., eight and four times U average mistake duration E TM †. For each value of TD ,we the average message delay, respectively. The timeout TO is plotted E TMR† by considering a run with 500 mistake U set to TD À c. The algorithm with c ˆ 0:16 is denoted by recurrence intervals and computing the average length of SFD-L, and the one with c ˆ 0:08 is denoted by SFD-S. these intervals. We do not show the plots for E TM † because The simulation results on the average mistake recurrence the E TM † of all the algorithms were similar and bounded times of SFD-L and SFD-S .Fig. 12) show that the accuracy above by approximately  ˆ 1. of the new algorithms .with or without synchronized clocks) is betterÐsometimes by an order of magnitudeÐ 7.1 Simulation Results of NFD-S and NFD-E than the accuracy of the simple algorithm. Intuitively, this is U To ensure that NFD-S meets the given upper bound TD on because the use of a cutoff time to bound the detection time U U the detection time, we set  to TD À  ˆ TD À 1 .as in the simple algorithm is detrimental to its accuracy: If the prescribed by Theorem 5.1). In algorithm NFD-E, we choose simple algorithm uses a large cutoff time, then it must use a to estimate the expected arrival time using the most recent small timeout value and this decreases the accuracy of the 32 heartbeat messages. To ensure NFD-E meets the given failure detector; if it uses a small cutoff time, then it discards U U U more heartbeats and this is equivalent to an increase in the upper bound TD , we set ˆ TD À E D†À ˆ TD À 1:02. In Fig. 12, we show the simulation results for algorithms message loss probability; this in turn also decreases the NFD-S and NFD-E, together with the analytical formula of 13. This assumes that the algorithm can detect slow messages; this is not E TMR† derived in Section 3.3. These results show that: easy when local clocks are not synchronized, but a fail-aware datagram service 1) the accuracy of algorithms NFD-S and NFD-E are very [18] may be used. 574 IEEE TRANSACTIONS ON COMPUTERS, VOL. 51, NO. 5, MAY 2002 accuracy of the failure detector .a detailed explanation of considers only the very latest messages and, thus, quickly the simulation results can be found in [13]). reacts to sudden changes in the network.e.g., due to bursty traffic) and 2) a long-term component that considers a longer time sample and, thus, is insensitive to momentary 8DISCUSSION fluctuations. One then combines the estimates from both 8.1 Making the Failure Detector Adaptive components, e.g., by selecting the most conservative one. 8.1.1 Dealing with Gradual Changes to Network Traffic This scheme is still under development [28] and is the In this paper, we assumed that the probabilistic behavior of subject of future research. heartbeat messages does not change. In some networks, this 8.2 Trade-Off between QoS and Cost may not be the case. For instance, a corporate networkmay It is important to distinguish between two distinct failure have one behavior during working hours .when the detector concepts: the QoS requirement of a failure detector, message traffic is high) and a completely different behavior i.e., its specification, and the cost of running the failure during lunch time or at night .when the system is mostly detector, i.e., the complexity of a particular failure detector idle): During peakhours, the heartbeat messages may have implementation .in terms of networkbandwidth, processor a higher loss rate, a higher expected delay, and a higher cycles, etc.). Even though these two concepts are distinct, variance of delay, than during off-peakhours. Such there is an obvious trade-off between the two: Achieving a networks require a failure detector that adapts to the better QoS may require a higher cost .e.g., it may require an changing conditions, i.e., it dynamically reconfigures itself increase in the heartbeat sending rate). Note also that a to meet some given QoS requirements. failure detector implementation may be inherently more For this kind of gradual changes in the probabilistic efficient than another, in the sense that it can provide the behavior of the network, we suggest the following way to same QoS with lower costs. make our failure detectors adaptive: For the case when Our results take into consideration the network band- clocks are synchronized, we make NFD-S adaptive by width cost .measured by , the heartbeat intersending time). periodically reexecuting the configuration outlined in Fig. 8. In particular, the analysis of our failure detector relates The basic idea is to periodically run the estimator, which networkbandwidth cost to the QoS achieved. Moreover, the uses the n most recent heartbeats to estimate the current optimality result of our failure detector implementation is values of pL;E D†, and V D†. These estimates are then fed with respect to implementations that have the same net- into the configurator to recompute the new failure detector workbandwidth cost .in terms of one-way heartbeat parameters  and . messages) and our configuration procedures try to find Similarly, when clocks are not synchronized, we can the failure detector parameters that yield the lowest cost make NFD-E adaptive by periodically reexecuting the possible. configuration outlined in Fig. 11. The only difference here Nevertheless, there are a number of possible directions is that the estimator also outputs EAiÐthe estimated arrival for further research. One interesting question is how can time of the next heartbeatÐwhich is input into the failure one meet a given QoS specification while minimizing the detector NFD-E. cost? We answered this question partially by focusing on The above adaptive algorithms form the core of a failure one type of failure detector implementations, namely, the detection service that is currently being implemented and ones based on one-way heartbeat messages. But, there are evaluated [15]. This service is intended to be shared among other types of implementationÐe.g., implementations many different concurrent applications, each with a based on two-way ping messages or on a logical rings of different set of QoS requirements. The failure detector in heartbeats. It remains an open question what failure this architecture dynamically adapts itself not only to detectors with what parameters achieve a given QoS with changes in the networkcondition, but also to changes in the absolute minimum cost. the current set of QoS demands .as new applications are Another research direction is to combine the QoS started and old ones terminate). specification with the cost requirements through a utility function. Such a function could include parameters, such as 8.1.2 Dealing with Bursty Traffic how costly is networkbandwidth and how desirable is a What if the networkconditions change very frequently due better QoS, and the goal is to maximize utility. One then to bursty traffic? The above ideas still workprovided that needs to come up with a more complex configuration 1) the occurrences of bursts are independent of each other procedure that takes utility into account. One possibility is and follow some slowly changing probabilistic distribution to use our current configuration procedure as a subroutine and 2) the duration of each burst is short .smaller than the in an iterative process that tries to find local utility maxima. sending period ). In this case, note that heartbeat messages This is definitely an open area that requires more work. behave independently of each other, according to some new slowly changing probability distribution that takes into account the occurrence of bursts. Thus, the situation is no APPENDIX A different than before. MAIN SYMBOLS USED IN THE PAPER What if 1) or 2) does not hold? Then, we need to use other techniques to attempt to estimate the current behavior System model of the networkbased on past behavior. One possibility is to pL message loss probability Section 3.1 use two components: 1) a short-term component that D message delay .random variable) Section 3.1 CHEN ET AL.: ON THE QUALITY OF SERVICE OF FAILURE DETECTORS 575

Primary QoS metrics Lemma 2. For all i  0 and all time t 2‰i;i‡1†, q trusts p at TD detection time .random variable) Section 2.2 time t if and only if q has received some message mj with j  i TMR mistake recurrence time Section 2.2 by time t. .random variable) Proof. Fix an i  0 and a time t 2‰i;i‡1†. Suppose first that T mistake duration .random variable) Section 2.2 M q has received some message mj with j  i by time t. Let 0 0 t  t be the time when mj is received. Choose i such 0 0 that t 2‰i0 ;i0‡1†. Thus, i  i  j. According to line 6 of 0 Derived QoS metrics the algorithm, q trusts p at time t . For every ` in the 0 0 M average mistake rate Section 2.3 period t ;tŠ, since mj is received at t and `  i  j, the PA query accuracy probability Section 2.3 output of the failure detector does not change to S, TG good period duration Section 2.3 according to lines 3-4. Therefore, q trusts p at time t. .random variable) Suppose now that q has not received any message mj TFG forward good period duration Section 2.3 with j  i by time t. Then, at time i, q suspects p .random variable) according to lines 3-4. During the period i;tŠ, since no message mj with j  i is received, the output of the QoS requirements failure detector does not change to T. So, q suspects p at U time t. tu TD upper bound on detection time Section 4 L TMR lower bound on average mistake Section 4 recurrence time We now analyze the accuracy metrics of the algorithm U NFD-S and, to do so, we assume that p does not crash. TM upper bound on average Section 4 mistake duration Definition 1. T u upper bound on detection time Section 6.2 D 1. For any i  1, let k be the smallest integer such that, relative to the average message for all j  i ‡ k, m is sent at or after time  . delay E.D) j i 2. For any i  1, let pj x† be the probability that q does not receive message mi‡j by time i ‡ x, for Failure detector algorithms every j  0 and every x  0; let p ˆ p 0†. NFD-S algorithm with synchronized clocks Section 3.2 0 0 3. For any i  2, let q0 be the probability that q receives NFD-U algorithm with unsynchronized clocks Section 6.1 message miÀ1 before time i. and known expected arrival times of 4. For any i  1, let u x† be the probability that q heartbeats suspects p at time i ‡ x, for every x 2‰0;†. NFD-E algorithm with unsynchronized clocks Section 6.3 5. For any i  2, let ps be the probability that an and estimates of expected arrival S-transition occurs at time i. times of heartbeats Proposition 3. Variables of failure detector algorithms 1. k ˆd=e. i sending time of the ith heartbeat at p Section 3.2 2. For all j  0 and for all x  0, i time of the ith freshness point at q Section 3.2 EAi expected arrival time of the ith Section 6.1 p x†ˆp ‡ 1 À p †Pr D>‡ x À j†: heartbeat at q .in Algorithm NFD-U) j L L

Parameters of failure detector algorithms 3. q0 ˆ 1 À pL†Pr D<‡Q†. 4. For all x 2‰0;†, u x†ˆ k p x†.  intersending time between two Section 3.2 jˆ0 j consecutive heartbeats 5. ps ˆ q0 Á u 0†.  shift of freshness point w.r.t. Section 3.2 As stated in Section 3.3, we only consider the non-

heartbeat sending time, i.e., i ˆ i ‡  degenerated cases in which p0 > 0 and q0 > 0. shift of freshness point w.r.t. Section 6.1 Proposition 13. 1) An S-transition can only occur at time i for heartbeat expected arrival time, i.e., some i  2 and it occurs at i if and only if message miÀ1 is i ˆ i ‡ E D†‡ received by q before time i and no message mj with j  i is received by q by time i; 2) Lemma 2 remains true if j  i in the statement is replaced by i  j  i ‡ k; 3) part 1) above APPENDIX B remainstrueifj  i in the statement is replaced by PROOF OF THEOREM 5 i  j

Proof. For all i  1, let Pi be the probability that, at any Lemma 17. f TMR;n;TM;n†;nˆ 1; 2; ...g is a delayed renewal random time T 2‰i;i‡1†, q is suspecting p. Note that T reward process. is uniformly distributed on ‰i;i‡1† with density 1= i‡1 À i†ˆ1=. Thus, We omit the proof of the lemma here since it is technical Z Z 1 i‡1 1  and lengthy. It can be found in [13]. P ˆ u x À  † dx ˆ u x† dx: i  i  It is immediate from the above lemma that, for any n  2, i 0 the joint probability distribution of TMR;n;TM;n† is identical Note that the value of Pi does not depend on i. Let this to that of T ;T †. This provides a simple way to analyze value be P. Thus, we have that P , the probability that q MR M A the QoS metrics T , T , and T . Moreover, the above trusts p at a random time, is 1 À P. This shows the MR M G lemma. tu lemma directly leads to the following result: Lemma 4. NFD-S is an ergodic failure detector. We now analyze the average mistake recurrence time Proof .sketch). With the rigorous definition of the ergodi- E T † of the failure detector. We will show that: MR city of failure detectors in [13], with Lemma 17, and with Lemma 16. E TMR†ˆ=ps. the fact that any delayed renewal reward process is ergodic .see, for example, Section 2.6 of [27]), it follows If, at each time point i with i  2, the test of whether an S-transition occurs were an independent Bernoulli trial, that NFD-S is ergodic. tu then the above result would be very easy to obtain: ps is the probability of success in one Bernoulli trial, i.e., an The above corollary implies that Theorem 1 is applicable S-transition occurs at i, and  is the time between two to our failure detector. We now prove Lemma 16 using Bernoulli trials and, so, =ps is the expected time between Lemma 17. two successful Bernoulli trials, which is just the expected Proof of Lemma 16. For all i  2, let A be the event that an time between two S-transitions. Unfortunately, this is not i the case because the tests of whether S-transitions occur at S-transition occurs at time i. By definition and Proposi- tion 14, we have that Pr A †ˆp ˆ q Á u 0†q Á pk. is are not independent. In fact, by Proposition 13, the event i s 0 0 0 that an S-transition occurs at i dependents on the behavior Since, in the nondegenerated case q0 > 0 and p0 > 0,we of messages mi; ...;mi‡kÀ1. Thus, two such events may have Pr Ai† > 0. By Proposition 13.3, Ai is also the event depend on the behavior of common messages and, so, they that miÀ1 is received before time i, but no message mj are not independent in general. with i  j

Now, consider S-transitions of the failure detector output Hence, we have, for t  k‡2, as the recurrent events. Let T be the random variable MR;n  representing the time that elapses from the n À 1†th Xk‡2 t À  E N t†† ˆ p j ‡ 1 : S-transition to the nth S-transition .as a convention, s k ‡ 1† consider time 0 to be the time at which the 0th S-transition jˆ2 occurs). Let TM;n be the random variable representing the By Lemma 17, fTMR;n;n 1g is a delayed renewal time that elapses from the n À 1†th S-transition to the nth process. Then, by the Elementary Renewal Theorem .see, T-transition. Thus, TM;n  TMR;n for all n  1. for example, [26, p. 61]), CHEN ET AL.: ON THE QUALITY OF SERVICE OF FAILURE DETECTORS 577 t t E TMR†ˆ lim ˆ lim P jk APPENDIX C t!1 E N t†† t!1 k‡2 p tÀj ‡ 1 jˆ2 s k‡1† PROOF OF THEOREM 6 1 ˆ jk A message delay pattern PD is a sequence fd1;d2;d3; ...g with Pk‡2 tÀ j di 2 0; 1Š representing the delay time of the ith message jˆ2 ps limt!1 k‡1† ‡ 1 =t sent by p; di ˆ1 means that the ith message is lost. The 1  ˆ ˆ : distribution of message delay patterns is governed by the Pk‡2 p s ps jˆ2 k‡1† message loss probability pL and the message delay time D tu and, thus, it is the same for all algorithms in C. We first consider a subclass C0 of C such that, for any 0 By Lemma 16, we know that 0 0 Since q0 > 0, with nonzero probability, mi is received 14 such that the output is also X in the period t; t ‡ †. before i‡1 and, thus, q trusts p just before i‡1. In these à cases, the detection time is i ‡  ‡  À t. Since the time t Lemma 19. Given any message delay pattern PD, let r be the à when p crashes can be arbitrarily close to i, we have that failure-free run of A with PD and let r be a failure-free run of the bound  ‡  is tight. tu 0 U some algorithm A 2C with PD. Then, for every time t  TD , Theorem 5. Consider a system with synchronized clocks, where if q suspects p at time t in run rÃ, then q suspects p at time t in the probability of message losses is pL and the distribution of run r. message delays is Pr D  x†. The failure detector NFD-S of Proof. Suppose that, in run rà of AÃ, q suspects p at time Fig. 6 with parameters  and  has the following properties: U U t  TD . Note that TD ˆ  ‡  ˆ 1,sot  1. Suppose t 2 ‰ ; † for some i  1. By Lemma 2, in run rÃ, q does not 1. The detection time is bounded as follows and the bound i i‡1 receive any message m with j  i by time t. Since, in run is tight: j r, p sends messages at the same times as in run rà and

TD   ‡ : both runs have the same message delay pattern PD, then, in run r, by time t, q does not receive any message sent 2. The average mistake recurrence time is: by p at time j with j  i. Consider first that t 2 i;i‡1†. Suppose, for a contra-  E T †ˆ : diction, that q trusts p at time t in run r. Let  ˆ t À i and MR 0 ps let t ˆ i À 1† ‡ =2. Thus >0. We now construct a new run r0 of A as follows: 1) p crashes at time t0; 2) before 0 3. The average mistake duration is: t , p sends the same messages at the same times as in run r; this is possible because p's state before its crash is R  0 u x† dx independent of its crash in the future; 3) the delays and E TM †ˆ : losses of all messages sent before t0 are the same as in run ps r; this is possible because the message loss and delay Proof. Parts 1 and 2 of the theorem are direct from Lemmas behaviors of messages sent by p are independent of p's 18 and 16. Part 3 is derived from the relation between crash. Note that, for messages to be sent after time t0,in run r, none of them is received by q by time t and, in run E TM †, PA, and E TMR†, as given in part 2 of Theorem 1 r0, they are not sent since p crashes at time t0. Therefore, and the results on PA and E TMR† as given in Lemmas 15 and 16. tu in run r0, up to time t, q receives the same messages at the same times as in run r. Thus, q cannot distinguish run r0 from r at time t and, so, q trusts p at time t in run r0 as in run r. The detection time in run r0, however, is at least

0 t À t ˆ i ‡ †À i À 1† ‡ =2† 14. Even though q0 is defined with respect to runs in which p does not crash, it also applies to runs in which p crashes since the system behavior U U ˆ  ‡  ‡ =2 ˆ TD ‡ =2 >TD ; before p crashes is independent of if and when p crashes in the future. 578 IEEE TRANSACTIONS ON COMPUTERS, VOL. 51, NO. 5, MAY 2002

U contradicting the assumption that A satisfies TD  TD . APPENDIX D Now, suppose t ˆ  . Since the failure detector output i PROOF OF THEOREM 7 AND PROPOSITION 8 is right continuous, there exists >0 such that q suspects p in the period t; t ‡ † in run rÃ. Then, by the above Proposition 21. If p0 > 0 and q0 > 0 the nondegenerated case), argument, q suspects p in the period t; t ‡ † in run r.By then E TM †=q0. the right continuity again, we have that q suspects p at Proof. By Proposition 14, we have, for all x 2‰0;†, time t in run r. tu u 0†u x†. Thus, by equality .3.3), we have 0 Corollary 20. For any A 2C, let PA be the query accuracy R  R  à 0 u x† dx 0 u 0† dx  probability of A. Let PA be the query accuracy probability of E TM †ˆ  ˆ : à à p q u 0† q A . Then, P  PA. s 0 0 A tu Proof .sketch). We first fix a message delay pattern PD. For the run rà of Aà and any run r of A with message delay U pattern PD, Lemma 19 shows that, for any time t  TD ,if Theorem 7. Consider a system in which clocks are synchronized q suspects p in rà at time t, then q suspects p in r at time t. and the probabilistic behavior of messages is known. Suppose Thus, given a fixed message delay pattern PD, at any we are given a set of QoS requirements as in 4.1). The random time t, the probability that q trusts p at time t in procedure in Section 4 has two possible outcomes: 1) It outputs algorithm Aà is at least as high as the probability that q  and . In this case, with parameters  and , the failure à trusts p at time t in algorithm A. So, PA  PA given a detector NFD-S of Fig. 6 satisfies the given QoS requirements. fixed message delay pattern PD. When summing .or 2) It outputs ªQoS cannot be achieved.º In this case, no failure integrating) both sides of the inequality over all message detector can achieve the given QoS requirements. delay patterns according to their distribution, we have Proof. We prove the theorem in the following three parts: à PA  PA. tu 1. Suppose that the procedure outputs para- The above corollary shows that the new algorithm Aà has meters  and . Then, by Step 3, we have the best query accuracy probability in C0, the class of failure U U TD ˆ  ‡ . By part 1 of Theorem 5, TD  TD detector algorithms in which p sends messages at exactly is satisfied. By Step 1 and Proposition 3, à 0 the same times as in A . We now generalize this result to q0 ˆ 1 À pL†Pr D<‡ †ˆq0. Note that we class C, where p still sends messages every  time units, but have q0 > 0 since, otherwise, max ˆ 0 and the it may do so at times different from those in AÃ. procedure would output ªQoS cannot be A message sending pattern PS is a sequence of time points achievedº instead of  and . Consider first f1;2;3; ...g at which p sends messages. The message that p0 > 0. Then, by Proposition 21, U U sending pattern is determined by the algorithm. For any E TM †=q0  max=q0 ˆ q0TM =q0 ˆ TM .So, U algorithm A 2C, its message sending pattern is in the form E TM †TM is satisfied. Note that fs; s ‡ ; s ‡ 2; ...g for some s 2‰0; 1†. Different runs of dT U =eÀ1 algorithm A may have different message sending patterns DY U à ‰pL ‡ 1 À pL†Pr D>TD À j†Š due to the possible nondeterminism of A. Let As be the algorithm in which p sends heartbeat messages according to jˆ1 the sending pattern fs; s ‡ ; s ‡ 2; ...g and q behaves the d ‡Y†=eÀ1 à à à ˆ ‰pL ‡ 1 À pL†Pr D>‡  À j†Š same way as in A . Thus, As is a shifted version of A and, à jˆ1 so, the behavior of the failure detector output in As is also a shifted version of that of AÃ. Since the behaviors of the two d=YeÀ1 ˆ ‰p ‡ 1 À p †Pr D>À j†Š failure detectors only differ in some initial period, their L L jˆ0 steady state behaviors are the same. Therefore, the QoS d=e metrics of Aà and Aà are the same. In particular, they have Y s ˆ ‰pL ‡ 1 À pL†Pr D>À j†Š ˆ u 0†: the same query accuracy probability. jˆ0 Theorem 6. For any A 2C, let P be the query accuracy A Thus, f †ˆ= q u 0†† ˆ =p ˆ E T † by .3.2). probability of A. Let P à be the query accuracy probability of 0 s MR A By Step 2, f †T L and, so, E T †T L is AÃ. Then, P à  P . MR MR MR A A satisfied. Proof .sketch). We first fix a message sending pattern Consider now that p0 ˆ 0. This is the degener- PS ˆfs; s ‡ ; s ‡ 2; ...g. For any algorithm A 2C, con- ated case, where q trusts p forever after time 1 sider the runs of A with the sending pattern PS. In these and, so, E TMR†ˆ1 and E TM †ˆ0. Thus, the runs, p sends messages at exactly the same times as in requirements in .4.1) are also satisfied. à algorithm As. By a similar argument as in Lemma 19 and 2. Suppose that the procedure outputs ªQoS cannot Corollary 20, we can show that the query accuracy be achieved.º Then, the procedure stops at Step 1 à 0 probability of As is at least as high as the query accuracy and, thus, max ˆ 0. This implies q0 ˆ 0 .since U probability of A given the message sending pattern PS. TM > 0), which in turn implies that either pL ˆ 1 à à U Since As and A havethesamequeryaccuracy or Pr D

U U   max= pL ‡ 1 À pL†Pr D>T ††; claim that, at any time t>TD , any failure detector D has to suspect p. In fact, since all messages q has where  is defined in Step 1 of the configuration procedure. U max received by time t are sent before time t À TD , q does not obtain any information about whether p Proof. By Theorem 5, it is easy to see that, for failure U U detector NFD-S, constraint .4.1) can be restated as crashes at time t À TD . Thus, to satisfy TD  TD , q has to suspect p at time t. Hence, for any failure constraints .4.2)-.4.4). By Proposition 3, for all x 2‰0;†, detector, we have E T †ˆ1and, thus, it fails to Qk Qk QkÀ1 M u x†ˆ pj x† pj †ˆp0 † pj 0†.Thus, satisfy E T †T U . Therefore, no failure detector jˆ0 jˆ0 jˆ0 M M we have can satisfy the given QoS in this case. R 3. We now show that the procedure only has two R   QkÀ1 u x† dx p0 † pj 0† dx  Á p †  Á p † possible outcomes: It either outputs parameters  0  0 Q jˆ0 ˆ 0  0 : p k q Á p 0† q and  or outputs ªQoS cannot be achieved.º To s q0 Á jˆ0 pj 0† 0 k 0 show this, it is enough to show that if Step 1 of the By Proposition 3 and constraint .4.2), we have procedure succeeds, then Step 2 always succeeds L in finding an  such that f †TMR. p0 †ˆpL ‡ 1 À pL†Pr D>‡ † Let r x†ˆp ‡ 1 À p †Pr D>TU À x†,and U L L D  pL ‡ 1 À pL†Pr D>TD †: QdT U =eÀ1 s †ˆ D r j†. Thus, f †ˆ= q0 s ††.It jˆ1 0 Then, by constraint .4.4) and the fact that 0 U U is easy to see that, for all x, 0  r x†1, and for max ˆ q0TM ˆ q0TM , we have all x1  x2, r x1†r x2†. q T U    0 M  max : We first claim that there exists an >0 such U p0 † pL ‡ 1 À pL†Pr D>TD † that r † < 1. Indeed, since max > 0, we have 0  tu U pL < 1 and Pr D 0. Thus, there must be U an >0 such that Pr D  TD À † > 0. Other- wise, we would have that, for all >0, Pr D  ACKNOWLEDGMENTS T U À †ˆ0 and, since the probability measure is D The authors would like to thank Carole Delporte-Gallet, continuous from below .see, e.g., Theorem 2.1 .i) Hugues Fauconnier, and the anonymous referees for their of [8]), we would have Pr D

U Foundation grant CCR-9711403 and an Olin Fellowship. r †ˆpL ‡ 1 À pL†Pr D>TD À † U ˆ pL ‡ 1 À pL† 1 À Pr D  TD À †† < 1: REFERENCES Let  ˆ 1 À r †. Thus, 0 < 1. Let [1] M.K. Aguilera, W. Chen, and S. Toueg, ªUsing the Heartbeat Failure Detector for Quiescent Reliable Communication and U 1 ˆ min ; max;TD =2†: Consensus in Partitionable Networks,º Theoretical Computer Science, vol. 220, no. 1, pp. 3-30, June 1999. Let i ˆ 1=i for i ˆ 1; 2; 3; .... We have [2] M.K. Aguilera, W. Chen, and S. Toueg, ªFailure Detection and Consensus in the Crash-Recovery Model,º Distributed Computing, diT U = eÀ1 vol. 13, no. 2, pp. 99-125, Apr. 2000. DY1 s  †ˆ r j=i Á  † [3] M.K. Aguilera, W. Chen, and S. Toueg, ªOn Quiescent Reliable i 1 Communication,º SIAM J. Computing, vol. 29, no. 6, pp. 2040-2073, jˆ1 Apr. 2000. Yi Yi [4] A.O. Allen, Probability, , and Queueing Theory with  r j=i Á 1† r 1† Computer Science Applications, second ed. Academic Press, 1990. jˆ1 jˆ1 [5] Y. Amir, D. Dolev, S. Kramer, and D. Malkhi, ªTransis: A Communication Sub-System for High Availability,º Proc. 22nd Yi i Ann. Int'l Symp. Fault-Tolerant Computing, pp. 76-84, July 1992.  r †ˆ 1 À † : [6] K. Arvind, ªProbabilistic ClockSynchronization in Distributed jˆ1 Systems,º IEEE Trans. Parallel and Distributed Systems, vol. 5, no. 5, pp. 475-487, May 1994. Therefore, [7] O. Babaoglu, R. Davoli, L.-A. Giachini, and M.G. Baker, ªRelacs: A Communications Infrastructure for Constructing Reliable Appli- 0 0 i f i†ˆi= q0s i††  1= q0i 1 À † †!1 cations in Large-Scale Distributed Systems,º BROADCAST Project deliverable report, Dept. of Computing Science, Univ. of New- when i !1. Hence, there is always some i such castle upon Tyne, U.K., 1994. that f  †T L .Wealsoseethat,wheni [8] P. Billingsley, Probability and Measure, third ed. John Wiley & Sons, i MR 1995. increases linearly .i decreases linearly), f i† [9] Reliable Distributed Computing with the Isis Toolkit, K.P. Birman and increases exponentially. tu R. van Renesse, eds. IEEE CS Press, 1993. [10] Requirements for Internet Hosts-Communication Layers, R. Braden, ed., RFC 1122, Oct. 1989. Proposition 8. To satisfy the QoS constraint 4.1) with NFD-S, [11] T.D. Chandra, V. Hadzilacos, S. Toueg, and B. Charron-Bost, ªOn the Impossibility of Group Membership,º Proc. 15th ACM Symp. parameter  has to satisfy Principles of Distributed Computing, pp. 322-330, May 1996. 580 IEEE TRANSACTIONS ON COMPUTERS, VOL. 51, NO. 5, MAY 2002

[12] T.D. Chandra and S. Toueg, ªUnreliable Failure Detectors for Wei Chen received the BEng and MEng Reliable Distributed Systems,º J. ACM, vol. 43, no. 2, pp. 225-267, degrees from the Department of Computer Mar. 1996. A preliminary version appeared in Proc. 10th ACM Science and Technology at Tsinghua University Symp. Principles of Distributed Computing, pp. 325-340, Aug. 1991. in China and the PhD degree from the Depart- [13] W. Chen, ªOn the Quality of Service of Failure Detectors,º PhD ment of Computer Science at Cornell University. thesis, Cornell Univ., May 2000, available at www.cs.cornell.edu/ He is a senior member of the technical staff at home/weichen/research/mypapers/thesis.ps.gz. Oracle Corporation. His research interests in- [14] F. Cristian, ªProbabilistic ClockSynchronization,º Distributed clude distributed computing and fault tolerance. Computing, vol. 3, no. 3, pp. 146-158, 1989. [15] B. Deianov and S. Toueg, ªFailure Detector Service for Depend- able Computing .Fast Abstract),º Proc. 2000 Int'l Conf. Dependable Systems and Networks, pp. B14-B15, June 2000. [16] D. Dolev, R. Friedman, I. Keidar, and D. Malkhi, ªFailure Sam Toueg received the BSc degree from the Detectors in Omission Failure Environments,º Technical Report Technion Israel Institute of Technology and the 96-1608, Dept. of Computer Science, Cornell Univ., Ithaca, N.Y., PhD degree from the Computer Science Depart- Sept. 1996. ment at Princeton University. He was a profes- [17] C. Fetzer and F. Cristian, ªFail-Aware Failure Detectors,º Proc. sor in the Department of Computer Science of 15th Symp. Reliable Distributed Systems, pp. 200-209, Oct. 1996. Cornell University from 1981 to 2001. He also [18] C. Fetzer and F. Cristian, ªA Fail-Aware Datagram Service,º Proc. served as the Chair of the Departement d'Infor- Second Ann. Workshop Fault-Tolerant Parallel and Distributed matique, at the Ecole Polytechnique in France. Systems, Apr. 1997. He is now a professor in the Computer Science [19] C. Fetzer and F. Cristian, ªA Fail-Aware Membership Service,º Department of the University of Toronto. His Proc. 16th Symp. Reliable Distributed Systems, pp. 157-164, Oct. 1997. research area is distributed computing. [20] M.G. Gouda and T.M. McGuire, ªAccelerated Heartbeat Proto- cols,º Proc. 18th Int'l Conf. Distributed Computing Systems, May 1998. [21] R. Guerraoui, M. Larrea, and A. Schiper, ªNon Blocking Atomic Marcos Kawazoe Aguilera holds the BEng Commitment with an Unreliable Failure Detector,º Proc. 14th IEEE degree in computer science from the Uni- Symp. Reliable Distributed Systems, pp. 41-50, Sept. 1995. versidade Estadual de Campinas in Brazil [22] M.G. Hayden, ªThe Ensemble System,º PhD thesis, Dept. of and the PhD degree in computer science Computer Science, Cornell Univ., Jan. 1998. from Cornell University. He is a member of [23] L.E. Moser, P.M. Melliar-Smith, D.A. Argarwal, R.K. Budhia, and the research staff at Compaq's Systems C.A. Lingley-Papadopoulos, ªTotem: A Fault-Tolerant Multicast Research Center in Palo Alto, California. His Group Communication System,º Comm. ACM, vol. 39, no. 4, research interests include distributed algo- pp. 54-63, Apr. 1996. rithms and fault-tolerant computing. [24] G.F. Pfister, In Search of Clusters, second ed. Prentice Hall, 1998. [25] M. Raynal and F. Tronel, ªGroup Membership Failure Detection: A Simple Protocol and Its Probabilistic Analysis,º Distributed Systems Eng. J., vol. 6, no. 3, pp. 95-102, 1999. [26] S.M. Ross, Stochastic Processes. John Wiley & Sons, 1983. [27] K. Sigman, Stationary Marked Point Processes, an Intuitive Approach. . For more information on this or any computing topic, please visit Chapman & Hall, 1995. our Digital at http://computer.org/publications/dlib. [28] S. Toueg and D. Ivan, private communication, May 2001. [29] R. van Renesse, K.P. Birman, and S. Maffeis, ªHorus: A Flexible Group Communication System,º Comm. ACM, vol. 39, no. 4, pp. 76-83, Apr. 1996. [30] R. van Renesse, Y. Minsky, and M. Hayden, ªA Gossip-Style Failure Detection Service,º Proc. '98, Sept. 1998. [31] P. VerõÂssimo and M. Raynal, ªTime in Distributed System Models and Algorithms,º Advances in Distributed Systems: Advanced Distributed Computing from Algorithms to Systems, S. Krakowiak and S. K. Shrivastava, eds., chapter 1, Springer-Verlag, 2000.