The Failure Detector Abstraction

The Failure Detector Abstraction Felix C. Freiling, University of Mannheim and Rachid Guerraoui, EPFL and MIT CSAIL and Petr Kuznetsov, TU Berlin/Deutsche Telekom Laboratories A failure detector is a fundamental abstraction in distributed computing. This paper surveys this abstraction through two dimensions. First we study failure detectors as building blocks to simplify the design of reliable distributed algorithms. In particular, we illustrate how failure detectors can factor out timing assumptions to detect failures in distributed agreement algorithms. Second, we study failure detectors as computability benchmarks. That is, we survey the weakest failure detector question and illustrate how failure detectors can be used to classify problems. We also highlight some limitations of the failure detector abstraction along each of the dimensions. Categories and Subject Descriptors: A.1 [General Literature]: Introductory and Survey; C.4 [Computer Systems Organization]: Performance of Systems|fault tolerance; modeling tech- niques; reliability, availability, and serviceability General Terms: Algorithms, Design, Reliability, Theory Additional Key Words and Phrases: distributed system, agreement problem, consensus, atomic commit, fault tolerance, liveness, message passing, safety, synchrony The first author's work was supported by the Deutsche Forschungsgemeinschaft (DFG) as part of the Emmy Noether programme. Contact author's address: University of Mannheim, Department of Computer Science, D-68131 Mannheim, Germany, contact author email: [email protected] Permission to make digital/hard copy of all or part of this material without fee for personal or classroom use provided that the copies are not made or distributed for profit or commercial advantage, the ACM copyright/server notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior specific permission and/or a fee. c 20YY ACM 0000-0000/20YY/0000-0001 $5.00 ACM Journal Name, Vol. V, No. N, Month 20YY, Pages 1{0??. 2 · Freiling, Guerraoui, Kuznetsov Contents 1 Introduction 1 2 Failure Detectors as Programming Building Blocks 2 2.1 Failure Detection using Timeouts . 2 2.1.1 Example of a Distributed Problem: Non-Blocking Atomic Commit (NBAC) . 2 2.1.2 Example of a Distributed Protocol: Three-Phase Commit (3PC) . 2 2.1.3 Three-Phase Commit with Timeouts . 3 2.1.4 Difficulties of Determining Good Timeout Values . 4 2.1.5 Synchronous Systems . 4 2.1.6 Asynchronous Model and Timeouts . 5 2.1.7 Eventually Synchronous Systems . 6 2.1.8 Conclusions . 7 2.2 Failure Detectors as Useful Distributed Services . 7 2.2.1 Failure Detectors as Oracles . 8 2.2.2 Perfect Failure Detectors . 8 2.2.3 Asynchronous Models with Failure Detectors . 9 2.2.4 Non-Blocking Atomic Commit with a Perfect Failure Detector 9 2.2.5 Solving Consensus using Failure Detectors . 9 2.2.6 Unreliable Failure Detectors . 11 2.2.7 Other Failure Detectors . 12 2.2.8 Justifying Unreliable Failure Detectors . 12 2.2.9 Solving Problems Other than Consensus using Failure Detectors 13 2.2.10 Using and Combining Different Failure Detector Abstractions 13 2.3 Limitations of Failure Detectors . 14 2.3.1 What is not a Failure Detector? . 14 2.3.2 Do Failure Detectors make sense outside of the crash model? 15 2.3.3 Can Randomization be used to implement Failure Detectors? 16 2.3.4 Can Failure Detectors be used to Reason about Real-Time? . 17 2.4 Summary . 18 3 Failure Detectors as a Computability Benchmark 18 3.1 The CHT Play . 18 3.2 The weakest failure detector for a register . 25 3.2.1 Read/write shared memory . 25 3.2.2 The sufficiency part . 26 3.2.3 The reduction algorithm . 26 3.2.4 Solving Consensus in All Environments . 28 3.3 Solving Non-Blocking Atomic Commit . 28 3.3.1 Failure detector Ψ . 28 3.3.2 Using (Ψ; FS) to solve NBAC . 28 3.3.3 The weakest failure detector to solve NBAC . 29 3.4 The Set Agreement Quest and the Hierarchy of Distributed Tasks . 30 3.5 Summary . 31 ACM Journal Name, Vol. V, No. N, Month 20YY. The Failure Detector Abstraction · 3 4 Concluding Remarks 31 A Handling a Bivalent Critical Index 37 A.1 Simulation Tree . 37 A.2 Determining a Correct Process: Hooks and Forks . 39 A.2.1 Forks . 39 A.2.2 Hooks . 39 A.3 Existence of Hooks and Forks . 40 A.3.1 Terminating the Infinite Simulation Tree . 41 A.3.2 Identifying a Fork or a Hook . 41 ACM Journal Name, Vol. V, No. N, Month 20YY. The Failure Detector Abstraction · 1 1. INTRODUCTION Advances in computing are typically achieved through the identification of abstractions that factor out specifics of an actual processor, machine or network. In the early times, abstractions like record, set or arrays helped the emancipation from assemblies and machine languages. The art of traditional sequential and centralized computing was then orchestrated around such data structures. Progress in computer architectures called then however for new abstractions. In the area of concurrent computing for instance, abstractions like threads, semaphores and monitors were very helpful in understanding concurrent programs and reasoning about their correctness. In the area of distributed computation, the remote procedure call abstraction helped factor out the details of the network and was a key to the popularity of standard distributed middleware infrastructures. In short, the remote procedure call abstraction hides many differences between languages and operating systems on different machines, and encapsulate serialization and de- serialization mechanisms to transfer data over the wire. The remote procedure call does not however help capture another fundamental characteristic of distributed systems: partial failures. Basically, if a process of some machine remotely invokes an operation on a process performing on a different machine, and the latter machine fails, an exception is raised. The way the failure is detected is usually achieved using a timeout mechanism. Typically, a timeout delay is associated with the operation and when it expires, the exception is raised. Programming with timeouts is however difficult and it hampers portability. The adequate way of choosing the duration of a timeout might vary from a system to another one, and might even dynamically depend on the load of the system. Sometimes it also is more appropriate to ping processors whereas sometimes it is better to require that they initiate heartbeat messages. Basically, failure detectors are abstract devices that offer information about the operational status of processes in a distributed system [Chandra and Toueg 1996]. We believe that the failure detector abstraction is a fundamental one and should sit as a first class citizen of a distributed programming library. In fact, and as we discuss in this paper, the failure abstraction can also help classify problems in distributed computing [Chandra et al. 1996]. This paper is structured into two parts: the first part (Section 2) looks at failure detectors from an engineering point of view and discusses the advantages of using failure detectors in the design, programming and analysis of distributed algorithms. It also discusses inherent limitations of the failure detector abstraction. The second part (Section 3) takes a more theoretical perspective and discusses the role that failure detectors can play to compare and distinguish problem spec- ifications in distributed systems. We describe how the hardness of a problem can be measured by determining the weakest failure detector needed to solve the problem, and we illustrate this approach by several examples of the \weakest failure detector" proofs. Several surveys about distributed programming with failure detectors have ben published. [Raynal 2002; Guerraoui et al. 1999; Raynal 2005]. Our survey is however more complete as it also covers the the theoretical aspect of failure detectors that we believe is equally important but much less understood. ACM Journal Name, Vol. V, No. N, Month 20YY. 2 · Freiling, Guerraoui, Kuznetsov 2. FAILURE DETECTORS AS PROGRAMMING BUILDING BLOCKS Information about the operational status of remote processes is often necessary to implement reliable distributed services. In this section we argue that the failure detector abstraction is a sensible one from an engineering point of view. In Section 2.1 we first review the problems of implementing failure detection based on timeouts. In Section 2.2 we informally introduce the failure detector abstraction and argue that it has several advantages over the explicit use of timeouts: (1) It separates the concerns of reasoning about failures and reasoning about time and therefore makes programs simpler to write and analyze; (2) It helps express information about failures in a way that is closer to the control logic of many applications, so it allows to write simpler and more elegant programs; (3) It allows independent implementation and service sharing and therefore has the potential of building more efficient applications. In Section 2.3 we discuss some of the limitations of the failure detector approach. 2.1 Failure Detection using Timeouts 2.1.1 Example of a Distributed Problem: Non-Blocking Atomic Commit (NBAC). We introduce our subject through this seminal database problem [Bernstein et al. 1987, Chapter 7] where data is distributed over multiple geographically separated processes. At the end of a transaction on that data, these processes must decide whether the actions should be committed (made permanent) or aborted (rolled back). More precisely, at the end of the transaction each participating process votes yes (\I am willing to commit") or no (\we must abort"), and eventually processes must reach a common decision, commit or abort. A non-blocking atomic commit protocol ensures that the following properties hold: (1) All processes that manage to reach a decision on the outcome of the transaction agree on the decision.

The Failure Detector Abstraction

Failure Detectors for Wireless Sensor-Actuator Systems

The Dynamic Enterprise Bus∗

Low-Overhead Accrual Failure Detector

A Literature Review of Failure Detection Within the Context of Solving the Problem of Distributed Consensus

Failure Detectors for Wireless Sensor-Actuator Systems Hamza A

INFORMATION to USERS This Manuscript Has Been Reproduced from the Microfilm Master. UMI Films the Text Directly from the Origina

Self-Healing Distributed Systems

A Self-Tuning Failure Detection Scheme for Cloud Computing Service

AVR1003: Using the XMEGA Clock System

Distributed Algorithms

Failure Detectors Outline

Fault Tolerance Management for a Hierarchical Gridrpc Middleware