Fault Tolerance Via Idempotence

Fault Tolerance via Idempotence G. Ramalingam and Kapil Vaswani Microsoft Research, India grama,[email protected] Abstract ing comes with its own pitfalls, such as process failures, imperfect Building distributed services and applications is challenging due messaging, asynchrony and concurrency. to the pitfalls of distribution such as process and communication Consider the prototypical bank account transfer service in failures. A natural solution to these problems is to detect potential Fig. 1. The goal of the service is to transfer money between bank failures, and retry the failed computation and/or resend messages. accounts, potentially in different banks. If the accounts belong to Ensuring correctness in such an environment requires distributed different banks, ensuring that the transfer executes as an atomic services and applications to be idempotent. (distributed) transaction is usually not feasible, and the natural way In this paper, we study the inter-related aspects of process fail- of expressing this computation is as a workflow [10, 20] consisting ures, duplicate messages, and idempotence. We first introduce a of two steps, a debit followed by a credit. simple core language (based on λ-calculus) inspired by modern dis- What if the process executing the workflow fails in between the tributed computing platforms. This language formalizes the notions debit and credit steps? A natural solution is to detect this failure and ensure that a different process completes the remaining steps of a service, duplicate requests, process failures, data partitioning, 1 and local atomic transactions that are restricted to a single store. of the workflow. A challenging aspect of realizing this solution We then formalize a desired (generic) correctness criterion for is figuring out whether the original process failed before or after applications written in this language, consisting of idempotence completing a particular step (either debit or credit). If not done (which captures the desired safety properties) and failure-freedom carefully, the debit or credit step may be executed multiple times, (which captures the desired progress properties). leading to further correctness concerns. Services often rely on a We then propose language support in the form of a monad that central workflow manager to manage process failures during the automatically ensures failfree idempotence. A key characteristic of workflow (using distributed transactions). our implementation is that it is decentralized and does not require Now consider a (seemingly) different problem. Messages sent distributed coordination. We show that the language support can between the client initiating the transfer and the service may be lost. be enriched with other useful constructs, such as compensations, The only option for a client, when it does not receive a response while retaining the coordination-free decentralized nature of the within some reasonable time, is to resend its request. Yet the client implementation. does not want the transfer to occur twice! We have implemented the idempotence monad (and its variants) In this paper, we study process and communication failures in in F# and C# and used our implementation to build realistic appli- the context of workflows. The seemingly different problems caused cations on Windows Azure. We find that the monad has low runtime by process and communication failures are, in fact, inter-related. overheads and leads to more declarative applications. Idempotence, a correctness criterion that requires the system to tolerate duplicate requests, is the key to handling both communication Categories and Subject Descriptions D.4.5 [Operating Systems]: and process failures efficiently. Idempotence, when combined with Reliability—Fault-tolerance; C.2.4 [Computer-Communication Net- retry, gives us the essence of a workflow, a fault tolerant composi- works]: Distributed Systems—Client/server, Distributed application of atomic actions, for free without the need for distributed co- tions ordination. In the transfer example, a fault tolerant account transfer can be implemented without a central workflow manager if the General Terms Reliability, Languages, Design debit and credit steps can be designed to be idempotent, Keywords fault tolerance, idempotence, workflow, transaction, Formalizing Failfree Idempotence. In this paper, we introduce a monad simple core language λFAIL, inspired by contemporary cloud platforms. This language formalizes process failure, duplicate requests, 1. Introduction partitioned data, and local transactions. A local transaction pro- Distributed computing is becoming mainstream. Several modern vides ACID guarantees but is restricted to access data within a sin- platforms offer virtualized distributed systems at low entry cost gle partition (typically a single server). Computations in λFAIL are with the promise of scaling out on demand. But distributed comput- like workflows, but without any fault-tolerance guarantees for the composition (i.e., the computation may fail between transactions). We then formalize a generic correctness criterion for applications written in λFAIL. A simple, powerful and tempting criterion is that an application’s behavior in the presence of duplicate requests Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed and process failures should be indistinguishable from its behav- for profit or commercial advantage and that copies bear this notice and the full citation ior in the absence of duplicate requests and failures. We formal- on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. 1 In general, detecting failures perfectly in an asynchronous, message pass- POPL’13, January 23–25, 2013, Rome, Italy. ing system is impossible [8]. Conservative failure detection can also lead to Copyright c 2013 ACM 978-1-4503-1832-7/13/01. $10.00 the same problem of duplicated computation. let process (request) = workflow monad [2, 4]. We find that the core business logic in these match request with applications can be declaratively expressed using the monad. Our j (“getBalance”,(branch, account)) ! evaluation shows that performance overheads of using the monad atomic branch flookup accountg over hand-coded implementations are statistically insignificant. j (“transfer”,(fromBranch, fromAccount, toBranch, toAccount, amt) ! atomic fromBranch f The rest of the paper is organized as follows. In Section 2, we update fromAccount ((lookup fromAccount) − amt) introduce a language λFAIL and formalize duplicate requests and g; process failures. We formalize what it means for a λFAIL application atomic toBranch f to correctly tolerate duplicate requests and failures. In Section 3, update toAccount ((lookup toAccount) + amt) we present the idempotence monad and show how it can be used to g; tolerate duplicate requests as well as process failures. In Section 4, “Transfer complete.” we describe extensions of the idempotence construct. In Section 5, we evaluate the idempotence monad and our implementation from Figure 1: A banking service example, in syntactically sugared the perspective of expressiveness, benefits and overheads. Section 6 discusses related work. λFAIL, that is neither idempotent nor fault-tolerant. 2. Failfree Idempotence ize a slightly weaker, but more appropriate, correctness criterion, In this section we present a language λFAIL that distils essential el- namely failure-freedom modulo message duplication. Informally, ements of distributed computing platforms such as Windows Azure this criterion permits the system to send duplicate responses. This and formalize the concept of failfree idempotence. weakening is appropriate from the perspective of composition: if the recipient of the responses can also tolerate duplicate messages, 2.1 The Language λFAIL then the sender is freed of the obligation to send the response ex- actly once. Informal Overview. A λFAIL program e represents a service that receives input requests and produces output responses. An input Automating Idempotence. Next, we address the problem of auto- request v is processed by creating an agent to evaluate e v. When matically ensuring idempotence for a service. We present our solu- the evaluation of e v terminates, producing a value v0, v0 is sent tion as a monad, the idempotence monad. We then show that idem- back as the response. Multiple input requests can be processed con- potence, when coupled with a simple retry mechanism, provides a currently, and their evaluation can be interleaved. Shared, mutable, “free” solution to the problem of tolerating process failures, guaran- persistent data is stored in tables. teeing failure-freedom modulo message duplication. We then pro- Agents. An agent has its own internal state (captured by local pose dedicated language support for idempotent computations. variables of the code). An agent may fail at any point in time. A failure models problems such as hardware failure, software crashes Decentralized Idempotence and Workflow. The idea underlying and reboots . Data stored in tables is persistent and is unaffected by the idempotence monad is conceptually simple, but tedious to im- agent failures. plement manually (i.e., without the monad). Given a unique iden- Tables. Tables are persistent maps. They provide primitives to tifier associated with a computation, the monad essentially adds update the value

Load more