Adaptare-FD: a Dependability-Oriented Adaptive Failure Detector

Adaptare-FD: A dependability-oriented adaptive failure detector Monicaˆ Dixit and Antonio´ Casimiro Department of Informatics Faculty of Sciences, University of Lisboa Lisboa, Portugal mdixit,[email protected] Abstract—Unreliable failure detectors are a fundamental abstracting from the specific implementation. Building on building block in the design of reliable distributed systems. the basic Push and Pull styles of algorithm structuring, some But unreliability must be bounded, despite the uncertainties works focus on algorithmic techniques to improve perfor- affecting the timeliness of communication. This is why it is important to reason in terms of the quality of service (QoS) of mance [4], others on the provision of adequate interfaces failure detectors, both in their specification and evaluation. to support the varying requirements of applications [5], [6], We propose a novel dependability-oriented approach for and others on methods for detecting environment changes specifying the QoS of failure detectors, and introduce Adaptare- and adapting parameters accordingly [3], [7]–[10]. FD, an autonomous and adaptive failure detector that executes In this paper we propose Adaptare-FD, which advances on according to this new specification. The main distinguishing features of Adaptare-FD with respect to existing adaptive failure existing work by introducing a new dependability-oriented detection approaches are discussed and explained in detail. approach for the specification of the failure detector QoS A comparative evaluation of Adaptare-FD is presented. We and by integrating a recently developed framework for the highlight the practical differences between our approach and dependable characterization of environments with varying the well known Chen et al. approach for the specification stochastic behavior [11], [12]. We claim that our approach of QoS requirements. We show that Adaptare-FD is easily configured, independently of the specific network environment. is well suited and can be easily integrated in the design Furthermore, the results obtained using the PlanetLab plat- of dependable systems. We discusse the relative merits of form indicate that Adaptare-FD outperforms other timeout- Adaptare-FD, based on a comparative analysis with other based solutions, combining versatility with improved QoS and approaches. In particular, we focus on the comparison with dependability assurance. Chen’s approach [3], to reveal subtle differences that are not Keywords-dependability; adaptation; failure detection; easily observed in a simple outlook. An evaluation of Adaptare-FD is also provided, based I. INTRODUCTION on experiments performed in the PlanetLab platform [13]. Unreliable failure detectors (FDs) are a fundamental Several results are presented, which allow to compare building block in the design of reliable distributed sys- Adaptare-FD with other timeout-based and adaptive failure tems [1]. One important aspect is that they encapsulate the detectors [7]–[9]. The results show that Adaptare-FD is able temporal uncertainties observed in asynchronous systems, to perform better, especially in more dynamic environments. freeing the system designer from the need to deal with them. A particularly interesting result is that the well known trade- However, as shown in [2], the actual implementation of the off between the mistake recurrence time (TMR) and the FD is fundamental to the overall system performance. mistake duration (TM ) can be reduced with Adaptare-FD. Two generic performance-related attributes of failure de- An additional advantage of our solution is that it is tectors are their speed (how fast they detect a failure) and fully adaptive, even when the stochastic behavior of the their accuracy (how well they avoid making mistakes). With environment changes. While Adaptare-FD is like a plug- timeout-based failure detectors, as we consider in this paper, and-play solution, the remaining approaches require some these attributes depend on the timeout values and on the a priori configuration, either because they assume a given period of heartbeats. Therefore, a fundamental problem in stochastic behavior, or because they define static parameters the implementation of failure detectors in asynchronous and according to the expected conditions. They are thus unable evolving environments concerns the configuration of the to dependably adapt to significant environment changes. operational parameters, which involves the need to handle The paper is organized as follows. In the next section we (non-functional) user-level requirements and the ability to describe the assumed system model and the related adaptive characterize the state of the operational environment. failure detectors considered in our evaluation. In Section III Different facets of the problem of configuring and build- we introduce Adaptare-FD, explaining its operation. In Sec- ing adaptive failure detectors have been previously addressed tion IV we discuss why our approach is more dependability- by many authors. The work in [3] introduced a set of metrics oriented. Section V presents the practical evaluation of for evaluating (and specifying) the QoS of failure detectors, Adaptare-FD. Finally, in Section VI we conclude the paper. Process p: B. Chen’s failure detector upon receive “are you alive?” message mqi from q: send “I’m alive” message mi to q The first systematic study of the QoS of failure detectors Process q: was presented in [3]. In this work, Chen et. al. define a Initialization: set of metrics to quantify the QoS of failure detectors, output ! S which are independent of their implementations. Then, they " ! compute_" //Initialize interrogation period # ! compute_# //Initialize timeout propose a failure detector that is configured according to lastReceived ! -1 some expected message behavior and to the required QoS. $0 ! currentTime More specifically, the messages behavior is described by for all i%1, at time $i = $i-1 + " send “are you alive?” message mqi to p the message loss probability pL, and by the expected value for all i%1, at time &i = $i + # (E(D)) and the variance (V (D)) of message delays. The if lastReceived < i then QoS of the failure detector is specified by three metrics: an output ! S U upper bound on the detection time (T ), a lower bound on upon receive “I’m alive” message m at time t: D j L if j > lastReceived then the average mistake recurrence time (TMR) and an upper output ! T bound on the average mistake duration (T U ). " ! compute_" //Adapting interrogation period M # ! compute_# //Adapting timeout Figure 2 (from [3]) presents a schematic view of Chen’s lastReceived ! j FD, corresponding to the version that more closely matches our proposal. The authors mention that an adaptive version may be implemented by re-executing the configurator, lead- Figure 1. Failure detection algorithm (process q monitoring process p) ing to varying values for the period (η) and timeout (δ). Estimator of the probabilistic behavior of message delays II. BASIC CONTEXT pL E(D) V(D) A. System model QoS requirements Configurator U L U TD , TMR , TM We consider an asynchronous distributed system with η δ a finite set of processes Π = fp; q; :::; sg. Processes are interconnected trough unreliable channels, which can loose Chen’s failure detector messages or discard corrupted messages. Messages can also Figure 2. Schematic view of Chen’s failure detector be arbitrarily delayed due to unbounded transmission and/or processing times. Regarding processes, we assume that they only fail by crashing, but otherwise behave correctly. C. Other timeout estimation methods We consider a pull-style crash failure detection model Most of the adaptive failure detectors described in the where a process q (running in one node) monitors a process literature are based on fixed interrogation periods, and a p (in another node), by periodically sending it “are you common approach to estimate timeouts: the use of an alive?” messages. Process p responds to every received estimator, plus a safety margin. “are you alive?” message with a corresponding “I’m alive” Estimators can vary from simple methods, like using the message. Depending on the reception times of p’s responses, delay of the last received message [9] or the average delay the output of the failure detector may be T (trust), or S of the last n received messages [7]–[9], to more elaborated (suspect). Since we adopt a pull-style failure detector model, approaches, e.g. applying an ARIMA (autoregressive inte- no assumptions about synchronized clocks are needed. grated moving average) model to build a time series from Figure 1 shows the basic algorithm that is typically used the last delays, and using this model to predict the delay of in pull-style failure detectors, which we also adopted. The the next message, also studied in [8], [9]. distinguishing factor between the different adaptive failure Safety margins may be static or dynamic. Static safety detector approaches lies in the solutions used for computing margins, as used in [3], although simpler, require some a the interrogation period, η, and the timeout, δ. priori analysis of the execution environment in order to be These two parameters are dynamically adapted during appropriately defined. The flexibility provided by dynamic the execution, depending on the required QoS (initially and safety margins [7] makes them more suitable to network statically defined) and on the observed behavior of the com- environments susceptible to frequent changes. munication environment (which may change). Therefore,

Adaptare-FD: a Dependability-Oriented Adaptive Failure Detector

Failure Detectors for Wireless Sensor-Actuator Systems

The Dynamic Enterprise Bus∗

Low-Overhead Accrual Failure Detector

A Literature Review of Failure Detection Within the Context of Solving the Problem of Distributed Consensus

Failure Detectors for Wireless Sensor-Actuator Systems Hamza A

INFORMATION to USERS This Manuscript Has Been Reproduced from the Microfilm Master. UMI Films the Text Directly from the Origina

Self-Healing Distributed Systems

A Self-Tuning Failure Detection Scheme for Cloud Computing Service

The Failure Detector Abstraction

AVR1003: Using the XMEGA Clock System

Distributed Algorithms

Failure Detectors Outline