A Comprehensive Study on Failure Detectors of Distributed Systems

Volume 64, Issue 2, 2020 Journal of Scientific Research Institute of Science, Banaras Hindu University, Varanasi, India. A Comprehensive Study on Failure Detectors of Distributed Systems Bhavana Chaurasia* and Anshul Verma Department of Computer Science, Banaras Hindu University, Varanasi. India. [email protected]*, [email protected] Abstract: In distributed systems, failure detectors are used to Toueg, 1996). Each combination builds a different class of monitor the processes and to reduce the risk of failures by detecting failure detectors by using two completeness and four accuracy them before system crashes. Accuracy and completeness are the key properties. The conceptual view of unreliable failure detectors attributes to measure the quality of failure detectors. Failure was presented in Chandra and Toueg, 1996, for the reliable detectors are said to be unreliable because sometimes they suspect a distributed system. Sometimes, failure detectors suspect a correct process as a faulty process or they treat a faulty process as a correct process. In this paper various failure detector algorithms correct process as a faulty process or they treat a faulty process are discussed. A comprehensive study is presented based on as a correct process. Therefore, the failure detectors are said to properties, methodologies used, the applicability of systems, and be unreliable due to their inaccuracy in detecting faulty outcomes of the failure detectors. The paper helps readers for the processes. enhancement of knowledge about the basics of failure detectors and The conceptual view of failure detectors proposed by the different algorithms which are developed to solve the failure Chandra and Toueg (1991) is used to solve consensus and other detection problems of distributed systems. related problems like reliable broadcast and atomic broadcast in an asynchronous distributed system. A distributed system has a Index Terms: Asynchronous System, Consensus, Distributed set of processes coordinate with each other through message Systems, Failure Detectors, Failure Detector Classes. passing. Asynchronous and failures are the fundamental issues I. INTRODUCTION of distributed computing (Raynal, 2016). Distributed systems are classified into two categories based Reliability is a key attribute to measure the quality of any on their topological arrangement, and process or event system. The reliability of a system depends on the less completion time. The first category consists of Fully connected probability of failures. A failure detector is used to detect the (Mesh), Partially connected, Hierarchical (Tree), Ring, and Star failures in a system. In a distributed system, a failure detector is topology-based distributed systems. The second category has a module or algorithm running at every process/node to collect Synchronous, Asynchronous, and Partially-synchronous the operational states of other processes. The information systems, which are based on two-time attributes, the first one is concerning the operational status of a process given by two the message transmission time, and another is the task execution failure detectors may vary at dissimilar processes (Cortinas, time (Cortinas, 2011). 2011). In this situation, the reliability of the failure detector is The primary issue in a distributed system is coordination. As measured with the help of two abstract properties: completeness every component/node in the distributed system has only a and accuracy. Completeness is classified into strong partial view of the global state of the system. That is why all the completeness and weak completeness, similarly, accuracy is processes have to agree on some value. The problem of classified into strong accuracy, weak accuracy which is hard to consensus is found to be similar to the agreement in which all attain therefore it is further classified into eventual strong the participated processes propose some values then every accuracy and eventual weak accuracy. The combination of these (correct) process has to agree on one of the proposed values two abstract properties like Completeness and Accuracy form a (Pease, Shostak, & Lamport, 1980). The problem of consensus is class of Failure detector which is shown in Table I (Chandra & solvable in the synchronous system by considering lower and upper time bounds on processes’ processing speed and * Corresponding Author communication delays (Lynch, 1996). But, in an asynchronous DOI: 10.37398/JSR.2020.640235 250 Journal of Scientific Research, Volume 64, Issue 2, 2020 system, consensus cannot deterministically be solved due to the which any implementation should efficiently satisfy its lack of proper time-bound on events. In an asynchronous system, requirements. The properties are classified into two classes: the condition of reliable failure detection cannot be fulfilled due safety and liveness (Lamport, 1977). The identification of to its inefficiency to differentiate a crashed process and a slow properties of a system into these classes improves the better process (Fischer, Lynch, & Paterson, 1985). Dolev et. al. (1987) understanding of those properties, better specifications, examined 32 partial synchrony systems and proved that 4 subsequently possible and clear implementations, and proofs. systems out of those 32 systems can solve the problem of The liveness property implies that the process or event will consensus. Dwork et. al. (1988) assumed two partially eventually produce a result. Similarly, the safety property synchronous systems and proved that a consensus algorithm can implies that a wrong result will never be executed. According to solve as long as f<n/2, where f denotes the number of faulty the formal definitions of safety and liveness, the classification of processes and n denotes the number of total processes, and also the properties of a dependable system can be done into one or proposed some eventually synchronous consensus algorithms. both classes (Alpern and Schneider, 1985). Many papers are The previously defined failure detectors are based on eight there for more information on safety & liveness properties failure detector classes and are used to solve the different types (Lamport, 1977 and 1979; Charron-Bost, Toueg, and Basu, of agreement problems like Consensus and other related 2000; Alpern and Schneider, 1985 and 1987; Benenson, Freiling, problems like Atomic Broadcast, k-set agreement, non-blocking Holz, Kesdogan, and Penso, 2006b). atomic commitment, and uniform reliable broadcast, etc. The A new property is presented by (Neiger and Toueg, 1993) consensus is a model that describes a family of agreement named as uniformity. It is quite interesting due to its nature. It is problems (Pease et. al., 1980; Turek and Shasha, 1992; Barborak considered that a process, which holds the uniform property, is & Malek, 1993). A failure detector can detect different types of potentially correct because it does not fail (Hadzilacos and failures such as crash failure, crash recovery failure, omission Toueg, 1994). failure, link failure, timing failure, and byzantine failure, and can A. Processes work for synchronous, asynchronous, or partially synchronous systems. The readers can refer (Chandra and Toueg, 1992; The process is stated as a computational entity. In the Chandra and Toung, 1996; Aguilera, Chen and Toueg, 1997; distributed system model, there is a set of processes ∏= {p1, p2, Larrea et. al., 1999; Larrea, & Ferna´ndez, 2004; Gallet et. al., p3… pn}, where ∏, n and p denote the set of processes, number 2007; Macêdo & Gorender, 2009; Soraluze et. al., 2011; Xiaohui of processes in a set, and a process respectively. The set of and Yan, 2014; Verma, et. al., 2016; Sozinov & Hammar, 2017; processes is finite and constant throughout the execution of the Jiménez, López-Presa, & Martín-Rueda, 2020) articles to get system. All processes in the system are connected with the help more information about the various failure detectors. of pairwise bidirectional communicational channels in a fully The rest of the paper is organized systematically, Section II connected topology (Cortinas, 2011). describes the system model and the basic terminologies. In Processes may be characterized as correct processes that never Section III, a detailed introduction of failure detectors and their faced any failure and faulty processes that are affected by any properties with failure detector classes are presented. Section IV type of failure or crash. Correct processes are represented by c contains a detailed analysis and comparison of various types of and faulty processes are represented by f. failure detector algorithms. The final section presents the Time-driven and message-driven are the two classical conclusion and future directions. approaches of event generation. In the time-driven algorithms, events generation implies the passage of time and measured by II. SYSTEM MODELS clocks or times. In message-driven algorithms, event generation only depends on message response. Models are the building blocks to design and find the solution for a given problem. It is a representation of a problem as well as B. Communication Channels/Links its solution. A good model has properties like accuracy and An abstraction of the network is represented by traceability, which provides easiness for validation and communication channels or links. The processes are evaluation of the solution (Schneider, 1993). The components communicating with each other by sending messages to their are modeled in the distributed systems based

A Comprehensive Study on Failure Detectors of Distributed Systems

Failure Detectors for Wireless Sensor-Actuator Systems

The Dynamic Enterprise Bus∗

Low-Overhead Accrual Failure Detector

A Literature Review of Failure Detection Within the Context of Solving the Problem of Distributed Consensus

Failure Detectors for Wireless Sensor-Actuator Systems Hamza A

INFORMATION to USERS This Manuscript Has Been Reproduced from the Microfilm Master. UMI Films the Text Directly from the Origina

Self-Healing Distributed Systems

A Self-Tuning Failure Detection Scheme for Cloud Computing Service

The Failure Detector Abstraction

AVR1003: Using the XMEGA Clock System

Distributed Algorithms

Failure Detectors Outline