Fault Tolerance

Fault Tolerance 1/65 Fault Tolerance I Fault tolerance is the ability of a distributed system to provide its services even in the presence of faults. I A distributed system should be able to recover automatically from partial failures without seriously affecting availability and performance. 2/65 Fault Tolerance: Basic Concepts (1) I Being fault tolerant is strongly related to what are called dependable systems. Dependability implies the following: I Availability: The percentage of the time the system is operational. I Reliability: The time interval for which the system is operational. I Safety I Maintainability I In-class Exercise: How is is availability different from reliability? How about a system that goes down for 1 millisecond every hour? How about a system that never goes down but has to be shut down two weeks every year? 3/65 Fault Tolerance: Basic Concepts (2) I A system is said to fail when it cannot meet its promises. For example, if it cannot provide a service (or only provide it partially). I An error is a part of the system’s state that may lead to failure. The cause of an error is called a fault. I Types of faults: I Transient: Occurs once and then disappears. If a operation is repeated, the fault goes away. I Intermittent: Occurs, then vanishes of its own accord, then reappears and so on. I Permanent: Continues to exist until the faulty component is replaced. I In-class Exercise: Give examples of each type of fault for your project! 4/65 Failure Models (1) 5/65 Failure Models (2) I Fail-stop: failures that can be reliably detected I Fail-noisy: failures that are eventually detected I Fail-silent: we cannot detect between a crash failure or omission failure I Fail-safe: failures that do no harm I Fail-arbitrary or Byzantine Failures: I Arbitrary failures where a server is producing output that it should never have produced, but which cannot be detected as being incorrect. I A faulty server may even be working with other servers to produce intentionally wrong answers! 6/65 Failure Masking by Redundancy I Information Redundancy. For example, adding extra bits (like in Hamming Codes, see the book Coding and Information Theory) to allow recovery from garbled bits. I Time Redundancy. Repeat actions if need be. I Physical Redundancy. Extra equipment or processes are added to make the system tolerate loss of some components. 7/65 Process Resilience Achieved by replicating processes into groups. I How to design fault-tolerant groups? I How to reach an agreement within a group when some members cannot be trusted to give correct answers? 8/65 Flat Groups Versus Hierarchical Groups 9/65 Failure Masking Via Replication I Primary-backup protocol. A primary coordinates all write operations. If it fails, then the others hold an election to replace the primary. I Replicated-write protocols. Active replication as well as quorum based protocols. Corresponds to a flat group. I A system is said to be k fault tolerant if it can survive faults in k components and still meet its specifications. I For fail-silent components, k + 1 are enough to be k fault tolerant. I For Byzantine failures, at least 2k + 1 extra components are needed to achieve k fault tolerance. I Requires atomic multicasting: all requests arrive at all servers in same order. This can be relaxed to just be for write operations. 10/65 Consensus in Faulty Systems The model: a very large collection of clients send commands to a group of processes that jointly behave as a single, highly robust process. To make this work, we need to make an important assumption: In a fault-tolerant process group, each nonfaulty process executes the same commands, in the same order, as every other nonfaulty process. I Flooding Consensus (for fail-stop failures) I Paxos Consensus (for fail-noisy failures) I Byzantine Agreement (for arbitrary failures) 11/65 Flooding Consensus I A client contacts a group member (P1;:::;Pn) requesting it to execute a command. Every group member maintains a list of commands: some from clients and some from other group members. I In each round, a process Pi sends its list of proposed commands it has seen to every other process. At the end of a round, each process merges all received proposed commands into a new list, from which it deterministically selects a new command to execute (if possible) I A process will move on to the next round without having made a decision, only when it detects that another process has failed. P1 P2 decide P3 decide P4 decide 12/65 Consensus in Systems with Arbitrary Faults I Two Army Problem I Non-faulty generals with unreliable communication. 13/65 Byzantine Agreement Problem (1) Problem: I Red army in the valley, n blue generals each with their own army surrounding the red army. I Communication is pairwise, instantaneous and perfect. I However m of the blue generals are traitors (faulty processes) and are actively trying to prevent the loyal generals from reaching agreement. The generals know the value m. Goal: The generals need to exchange their troop strengths. At the end of the algorithm, each general has a vector of length n. If ith general is loyal, then the ith element has their correct troop strength otherwise it is undefined. Conditions for a solution: I All loyal generals decide upon the same plan of action I A small number of traitors cannot cause the loyal generals to adopt a bad plan 14/65 Byzantine Agreement Problem (2) Setup: I Assume that we have n processes, where each process i will provide a value vi to other processes. The goal is to let each process construct a vector of length n such that if process i is non-faulty, V [i] = vi . Otherwise, V [i] is undefined. I We assume that there are at most m faulty processes. Algorithm: Step 1 Every non-faulty process i sends vi to every other process using reliable unicasting. Faulty processes may send anything. I Moreover, since we are using unicasting, the faulty processes may send different values to different processes. Step 2 The results of the announcements of Step 1 are collected together in the form of a vector of length n. Step 3 Every process passes its vector from Step 2 to every other process. Step 4 Each process examines the ith element of each of the newly received vectors. If any value has a majority, that value is put into the result vector. If no value has a majority, the corresponding element of the result vector is set to UNKNOWN. 15/65 Byzantine Agreement Example (1) I We have 2 loyal generals and one traitor. I Note that process 1 sees a value of 2 from process 2 (process 2 vector) but sees a value of b for process 2 (process 3 vector). But process 1 has no way of knowing whether process 2 or process 3 is a traitor. So it cannot decide the right value for V [2]. I Similarly, process 2 cannot ascertain the right value for V [1]. Hence, they cannot reach agreement. I For m faulty processes, we need 2m + 1 non-faulty processes to overcome the traitors. In total, we need a total of 3m + 1 processes to reach agreement. 16/65 Byzantine Agreement Example (2) The Byzantine generals problem for 3 loyal generals and 1 traitor: (a) The generals announce their troop strengths (let’s say, in units of 1 kilo soldiers). (b) The vectors that each general assembles based on previous step. (c) The vectors that each general receives. (d) If a value has a majority, then we know it correctly, else it is unknown. In this case, we can reach agreement amongst the non-faulty processes. 17/65 Limitations on Reaching Consensus (1) The general goal of distributed consensus algorithms is to have all the non-faulty processes reach consensus on some issue, and to establish that consensus within a finite number of steps. Possible cases: I Synchronous versus asynchronous systems I Communication delay is bounded or not I Message delivery from the same sender is ordered or not I Message transmission is done through unicasting or multicasting 18/65 Limitations on Reaching Consensus (2) Message ordering Unordered Ordered Communication delay XX X X Bounded Synchronous X X Unbounded X Bounded Asynchronous X Unbounded Process behavior UnicastMulticast Unicast Multicast Message transmission I Most distributed systems in practice assume that processes behave asynchronously, message transmission is unicast, and communication delays are unbounded. 19/65 The CAP Theorem I CAP Theorem (aka Brewer’s Theorem): states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees: I Consistency (all nodes see the same data at the same time) I Availability (a guarantee that every request receives a response about whether it was successful or failed) I Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system) This is a chord, this is another, this is a third. Now form a band! We know three chords but you can only pick two Brewer’s CAP Theorem (julianbrowne.com) 20/65 Dealing with CAP I Drop Partition Tolerance I Drop Availability I Drop Consistency I The BASE Jump: The notion of accepting eventual consistency is supported via an architectural approach known as BASE (Basically Available, Soft-state, Eventually consistent). BASE, as its name indicates, is the logical opposite of ACID. I Design around it! 21/65 Reliable Client-Server Communication RMI semantics in the presence of failures.

Load more