Fault-Tolerant and Transactional Stateful Serverless Workflows (Extended Version)

Fault-tolerant and Transactional Stateful Serverless Workflows (extended version)? Haoran Zhang, Adney Cardozay, Peter Baile Chen, Sebastian Angel, and Vincent Liu University of Pennsylvania yRutgers University-Camden Abstract ing different functions to obtain nontrivial end-to-end applica- This paper introduces Beldi, a library and runtime system tions. This is fairly straightforward when functions are state- for writing and composing fault-tolerant and transactional less, but becomes involved when the functions maintain their stateful serverless functions. Beldi runs on existing providers own state (e.g., modify a data structure that persists across stateful serverless functions and lets developers write complex stateful applications that invocations). Composing such require fault tolerance and transactional semantics without the (SSFs) requires reasoning about consistency and isolation need to deal with tasks such as load balancing or maintaining semantics in the presence of concurrent requests and deal- virtual machines. Beldi’s contributions include extending the ing with component failures. While these requirements are log-based fault-tolerant approach in Olive (OSDI 2016) with common in distributed systems and are addressed by existing new data structures, transaction protocols, function invoca- proposals [8, 28, 33, 35, 46], SSFs have unique idiosyncrasies tions, and garbage collection. They also include adapting the that make existing approaches a poor fit. resulting framework to work over a federated environment The first peculiarity is that request routing is stateless. Ap- where each serverless function has sovereignty over its own proaches based on state machine replication are hard to im- data. We implement three applications on Beldi, including a plement because a follow-up message might be routed by the movie review service, a travel reservation system, and a social infrastructure to a different SSF instance from the one that media site. Our evaluation on 1,000 AWS Lambdas shows processed a prior message (e.g., an “accept” routed differently that Beldi’s approach is effective and affordable. than its “prepare”). A second characteristic is that SSFs can be independent and have sovereignty over their own data. For ex- 1 Introduction ample, different organizations may develop and deploy SSFs, and an application may stitch them together to achieve some Serverless computing is changing the way in which we end-to-end functionality. As a result, there is no component structure and deploy computations in Internet-scale systems. in the system that has full visibility (or access) to all the state. Enabled by platforms like AWS Lambda [2], Azure Func- Lastly, SSF workflows (directed graphs of SSFs) can be com- tions [3], and Google Cloud Functions [18], programmers plex and include cycles to express recursion and loops over can break their application into small functions that providers SSFs. If a developer wishes to define transactions over such then automatically distribute over their data centers. When a workflows (or its subgraphs), all transactions (including those user issues a request to a serverless-based system, this request that will abort) must observe consistent state to avoid infinite flows through the corresponding functions to achieve thede- loops and undefined behavior. This is a common requirement sired end-to-end semantics. For example, in an e-commerce in transactional memory systems [20, 23, 32, 37, 38], but is site, a user’s purchase might trigger a product order, a ship- seldom needed in distributed transaction protocols ping event, a credit card charge, and an inventory update, all To bring fault-tolerance and transactions to this challeng- of which could be handled by separate serverless functions. ing environment, this paper introduces Beldi, a library and During development, structuring an application as a set of runtime system for building workflows of SSFs. Beldi runs serverless functions brings forth the benefits of microservice on existing cloud providers without any modification to their arXiv:2010.06706v1 [cs.DC] 13 Oct 2020 architectures: it promotes modular design, quick iteration, infrastructure and without the need for servers. The SSFs and code reuse. During deployment, it frees programmers used in Beldi can come from either the app developer, other from the prosaic but difficult tasks associated with provision- developers in the same organization, third-party open-source ing, scaling, and maintaining the underlying computational, developers, or the cloud providers. Regardless, Beldi helps storage, and network resources of the system. In particular, to stitch together these components in a way that insulates app developers need not worry about setting up virtual ma- the developer from the details of concurrency control, fault chines or containers, starting or winding down instances to tolerance, and SSF composition. accommodate demand, or routing user requests to the right A well-known aspect of SSFs is that even though they set of functional units—all of this is automated once an app can persist state, this state is usually kept in low-latency developer describes the connectivity of the units. NoSQL databases (possibly different for each SSF) such as A key challenge in increasing the general applicability of DynamoDB, Bigtable, and Cosmos DB that are already fault serverless computing lies in correctly and efficiently compos- tolerant. Viewed in this light, SSFs are clients of scalable ?This is the full version of [45]. This version includes additional details fault-tolerant storage services rather than stateful services and evaluation results in the appendices. themselves. Beldi’s goal is therefore to guarantee exactly- 1 once semantics to workflows in the presence of clients (SSFs) of deploying complex serverless applications that incorporate that fail at any point in their execution and to offer synchro- state, and a list of requirements that Beldi aims to satisfy. nization primitives (in the form of locks and transactions) to prevent concurrent clients from unsafely handling state. 2.1 Serverless functions To realize this vision, Beldi extends Olive [36] and adapts Serverless computing aims to eliminate the need to manage its mechanisms to the SSF setting. Olive is a recent frame- machines, runtimes, and resources (i.e., everything except work that exposes an elegant abstraction based on logging and the app logic). It provides an abstraction where developers request re-execution to clients of cloud storage systems; oper- upload a simple function (or ‘lambda’) to the cloud provider ations that use Olive’s abstraction enjoy exactly-once seman- that is invoked on demand; an identifier is provided with tics. Beldi’s extensions include support for operations beyond which clients and other services can invoke the function. storage accesses such as synchronous and asynchronous invo- The cloud provider is then responsible for provisioning the cations (so that SSFs can invoke each other), a new data struc- VMs or containers, deploying the user code, and scaling the ture for unifying storage of application state and logs, and pro- allocated resources up and down based on current demand— tocols that operate efficiently on this data structure (§4). The all of this is transparent to users. In practice, this means that purpose of Beldi’s extensions is to smooth out the differences on every function invocation the provider will spawn a new between Olive’s original use case and ours. As one example, worker (VM or container) with the necessary runtime and Olive’s most critical optimization assumes that clients can dispatch the request to this worker (‘cold start’). The provider store a large number of log entries in a database’s atomicity may also use an existing worker, if one is free (‘warm start’). scope (the scope at which the database can atomically update Note that while workers can stay warm for a while, running objects). However, this assumption does not hold for many functions are limited by a timeout, after which they are killed. databases commonly used by SSFs. In DynamoDB, for ex- This time limit is configurable (up to 15 min in 1 s increments ample, the atomicity scope is a single row that can store at on AWS, up to 9 min in 1 ms increments on Google Cloud, most 400 KB of data [14]—the row would be full in less than and unbounded time in 1 s increments on Azure) and helps in a minute in our applications. budgeting and limiting the effect of bugs. Beldi also adapts existing concurrency control and dis- Serverless functions are often used individually, but they tributed commit protocols to support transactions over SSF can also be composed into workflows: directed graphs of func- workflows. A salient aspect of our setting is that there isnoen- tions that may contain cycles to express recursion or loops tity that can serve as a coordinator: a user issues its request to over one or more functions. Some ways to create workflows the first SSF in the workflow, and SSFs interact only withthe include AWS’s step functions [41] and driver functions.A SSFs in their outgoing edges in the workflow. Consequently, step function specifies how to stitch together different func- we design a protocol where SSFs work together (while re- tions (represented by their identifiers) and their inputs and specting the workflow’s underlying communication pattern) outputs; the step function takes care of all scheduling and to fulfill the duties of the coordinator and collectively decide data movement, and users get an identifier to invoke it. In whether to commit or abort a transaction (§6). contrast, a driver function is a single function specified by the To showcase the costs and the benefits of Beldi, we imple- developer that invokes other functions (similar to the main ment three applications as representative case studies: (1) a function of a traditional program). Control flow can form a travel reservation system, (2) a social media site, and (3) a graph because functions (including the driver function) can movie review service. These applications are based on Death- be multi-threaded or perform asynchronous invocations.

Fault-Tolerant and Transactional Stateful Serverless Workflows (Extended Version)

Amazon Dynamodb

Performance at Scale with Amazon Elasticache

Amazon Documentdb Deep Dive

A Motion Is Requested to Authorize the Execution of a Contract for Amazon Business Procurement Services Through the U.S. Communities Government Purchasing Alliance

Amazon Elasticache Deep Dive Powering Modern Applications with Low Latency and High Throughput

Amazon Mechanical Turk Developer Guide API Version 2017-01-17 Amazon Mechanical Turk Developer Guide

Analytics Lens AWS Well-Architected Framework Analytics Lens AWS Well-Architected Framework

October 24, 2013—Amazon.Com, Inc

Leveraging Cloud-Based Predictive Analytics to Strengthen Audience Engagement

Enter the Purpose-Built Database Era: Finding the Right Database Type for the Right Job

How Can Startups Make Use of Cloud Services

Nosql for Dummies®