Mastering Agreement Problems in Distributed Systems

focusfault tolerance Mastering Agreement Problems in Distributed Systems Michel Raynal, IRISA Mukesh Singhal, Ohio State University istributed systems are difficult to design and implement because of the unpredictability of message transfer delays and process speeds (asynchrony) and of failures. Consequently, systems de- D signers have to cope with these difficulties intrinsically associated with distributed systems. We present the nonblocking atomic commitment (NBAC) agreement problem as a case study in the context of both synchronous and asynchronous distributed systems where processes can fail by Overcoming agreement crashing. This problem lends itself to fur- lem, a concurrency control conflict, and so problems in ther exploration of fundamental mecha- on) a participant cannot locally commit the distributed nisms and concepts such as time-out, multi- transaction, it votes No. A Yes vote means systems is a cast, consensus, and failure detector. We the participant commits to make local up- will survey recent theoretical results of dates permanent if required. The second primary studies related to agreement problems and phase pronounces the order to commit the challenge to show how system engineers can use this in- transaction if all participants voted Yes or systems formation to gain deeper insight into the to abort it if some participants voted No. designers. These challenges they encounter. Of course, we must enrich this protocol sketch to take into account failures to cor- authors focus on Commitment protocols rectly implement the failure atomicity prop- practical In a distributed system, a transaction erty. The underlying idea is that a failed solutions for a usually involves several sites as participants. participant is considered to have voted No. well-known At a transaction’s end, its participants must enter a commitment protocol to commit it Atomic commitment protocols agreement (when everything goes well) or abort it Researchers have proposed several atomic problem—the (when something goes wrong). Usually, this commitment protocols and implemented nonblocking protocol obeys a two-phase pattern (the them in various systems.1,2 Unfortunately, atomic two-phase commit, or 2PC). In the first some of them (such as the 2PC with a main phase, each participant votes Yes or No. If coordinator) exhibit the blocking property in commitment. for any reason (deadlock, a storage prob- some failure scenarios.3,4 “Blocking” means 40 IEEE SOFTWARE July/August 2001 0740-7459/01/$10.00 © 2001 IEEE Authorized licensed use limited to: University of Minnesota. Downloaded on January 28, 2010 at 14:57 from IEEE Xplore. Restrictions apply. The Two-Phase Commit Protocol Can Block A coordinator-based two-phase commit protocol obeys the where noncrashed participants cannot progress because of following message exchanges. First, the coordinator requests a participant crash occurrences. vote from all transaction participants and waits for their answers. If all participants vote Yes, the coordinator returns the decision Commit; if even a single participant votes No or Reference failed, the coordinator returns the decision Abort. 1. Ö. Babao˘glu and S. Toueg, “Non-Blocking Atomic Commitment,” Distrib- 1 uted Systems, S. Mullender, ed., ACM Press, New York, 1993, pp. Consider the following failure scenario (see Figure A). Par- 147–166. ticipant p1 is the coordinator and p2, p3, p4, and p5 are the other participants. The coordinator sends a request to all par- Compute D = Commit/Abort ticipants to get their votes. Suppose p4 and p5 answer Yes. Ac- Vote request cording to the set of answers p1 receives, it determines the de- P1 Yes/No Crash cision value D (Commit or Abort) and sends it to p2 and p3, P 2 Crash but crashes before sending it to p4 and p5. Moreover, p2 and Yes/No p crash just after receiving D. So, participants p and p are P 3 4 5 3 Crash blocked: they cannot know the decision value because they Yes ? P4 voted Yes and because p1, p2, and p3 have crashed (p4 and p5 Yes ? cannot communicate with one of them to get D). Participants P5 p4 and p5 cannot force a decision value because the forced value could be different from D. So, p4 and p5 are blocked until one of the crashed participants recovers. That is why the ba- Figure A. A failure scenario for the two-phase commit sic two-phase commit protocol is blocking: situations exist protocol. that nonfailed participants must wait for the tually detected. Synchronous systems provide recovery of failed participants to terminate such a guarantee, thanks to the upper bounds their commit procedure. For instance, a com- on scheduling and message transfer delays mitment protocol is blocking if it admits ex- (protocols use time-outs to safely detect fail- ecutions in which nonfailed participants can- ures in a bounded time). Asynchronous sys- not decide. When such a situation occurs, tems do not have such bounded delays, but nonfailed participants cannot release re- we can compensate for this by equipping sources they acquired for exclusive use on the them with unreliable failure detectors, as transaction’s behalf (see the sidebar “The we’ll describe later in this article. (For more Two-Phase Commit Protocol Can Block”). on the difference between synchronous and This not only prevents the concerned trans- asynchronous distributed systems, see the re- action from terminating but also prevents lated sidebar.) So, time-outs in synchronous other transactions from accessing locked systems and unreliable failure detectors in data. So, it is highly desirable to devise asynchronous systems constitute building NBAC protocols that ensure transactions blocks offering the liveness guarantee. With will terminate (by committing or aborting) these blocks we can design solutions to the despite any failure scenario. NBAC agreement problem. Several NBAC protocols (called 3PC protocols) have been designed and implemented. Distributed systems and failures Basically, they add handling of failure scenar- A distributed system is composed of a fi- ios to a 2PC-like protocol and use complex nite set of sites interconnected through a subprotocols that make them difficult to un- communication network. Each site has a lo- derstand, program, prove, and test. More- cal memory and stable storage and executes over, these protocols assume the underlying one or more processes. To simplify, we as- distributed system is synchronous (that is, the sume that each site has only one process. process-scheduling delays and message trans- Processes communicate and synchronize by fer delays are upper bounded, and the proto- exchanging messages through the underly- cols know and use these bounds). ing network’s channels. A synchronous distributed system is Liveness guarantees characterized by upper bounds on message Nonblocking protocols necessitate that transfer delays, process-scheduling delays, the underlying system offers a liveness guar- and message-processing time. δ denotes an antee—a guarantee that failures will be even- upper time bound for “message transfer de- July/August 2001 IEEE SOFTWARE 41 Authorized licensed use limited to: University of Minnesota. Downloaded on January 28, 2010 at 14:57 from IEEE Xplore. Restrictions apply. A Fundamental Difference between Synchronous and Asynchronous Distributed Systems mous “Generals Paradox” shows that dis- We consider a synchronous distributed system in which a channel con- tributed systems with unreliable communi- nects each pair of processes and the only possible failure is the crash of cation do not admit solutions to the NBAC processes. To simplify, assume that only communication takes time and that problem.2 a process that receives an inquiry message responds to it immediately (in A site or transaction participant can fail ∆ zero time). Moreover, let be the upper bound of the round-trip communi- by crashing. Its state is correct until it cation delay. In this context, a process pi can easily determine the crash of crashes. A participant that does not crash ∆ another process pj. Process pi sets a timer’s time-out period to and sends during the transaction execution is also said an inquiry message to pj. If it receives an answer before the timer expires, it to be correct; otherwise, it is faulty. Because can safely conclude that pj had not crashed before receiving the inquiry we are interested only in commitment and message (see Figure B1). If pi has not received an answer from pj when the not in recovery, we assume that a faulty par- timer expires, it can safely conclude that pj ticipant does not recover. Moreover, what- ∆ has crashed (see Figure B2). ever the number of faulty participants and P In an asynchronous system, processes i the network configuration, we assume that can use timers but cannot rely on them, ir- Inquiry I am here each pair of correct participants can always P respective of the time-out period. This is i communicate. In synchronous systems, because message transfer delays in asyn- (1) crashes can be easily and safely detected by chronous distributed systems have no up- ∆ using time-out mechanisms. This is not the per bound. If pi uses a timer (with some P case in asynchronous distributed systems. ∆ i time-out period ) to detect the crash of pj, the two previous scenarios (Figures B1 and P The consensus problem B2) can occur. However, a third scenario i Crash (2) Consensus6 is a fundamental problem of can also occur (see Figure B3): the timer distributed systems (see the sidebar “Why ∆ expires while the answer is on its way to Consensus Is Important in Distributed Sys- ∆ P pi. This is because is not a correct upper i tems”). Consider a set of processes that can bound for a round-trip delay. fail by crashing. Each process has an input P Only the first and second scenarios can i value that it proposes to the other processes. occur in a synchronous distributed system.

Mastering Agreement Problems in Distributed Systems

ACM SIGACT News Distributed Computing Column 28

Edsger Dijkstra: the Man Who Carried Computer Science on His Shoulders

Mastering Concurrent Computing Through Sequential Thinking

Lecture Notes in Computer Science 2584 Edited by G

31 International Symposium on Distributed Computing Andréa W

Byzantine Consensus in Asynchronous Message-Passing Systems: a Survey

Money Transfer Made Simple Alex Auvolat, Davide Frey, Michel Raynal, François Taïani

Distributed Computing Column 36 Distributed Computing: 2009 Edition

Distributed Eventual Leader Election in the Crash-Recovery and General Omission Failure Models

Characterizing Consensus in the Heard-Of Model

Michel Raynal an Algorithmic Approach

Scalable Consensus in a Hybrid Communication Model Michel Raynal, Jiannong Cao