Scalable State-Machine Replication (S-SMR)
Total Page:16
File Type:pdf, Size:1020Kb
Scalable State-Machine Replication Carlos Eduardo Bezerra∗z, Fernando Pedone∗, Robbert van Renessey ∗University of Lugano, Switzerland yCornell University, USA zUniversidade Federal do Rio Grande do Sul, Brazil Abstract—State machine replication (SMR) is a well-known designer much of the complexity involved in replication; all technique able to provide fault-tolerance. SMR consists of the service designer must provide is a sequential implemen- sequencing client requests and executing them against replicas tation of each service command. If state is partitioned, then in the same order; thanks to deterministic execution, every replica will reach the same state after the execution of each some commands may need data from multiple partitions. request. However, SMR is not scalable since any replica added Should the service designer introduce additional logic in the to the system will execute all requests, and so throughput implementation to handle such cases? Should the service be does not increase with the number of replicas. Scalable SMR limited to commands that access a single partition? (S-SMR) addresses this issue in two ways: (i) by partitioning This paper presents Scalable State Machine Replication the application state, while allowing every command to access any combination of partitions, and (ii) by using a caching (S-SMR), an approach that achieves scalable throughput and algorithm to reduce the communication across partitions. We strong consistency (i.e., linearizability) without constraining describe Eyrie, a library in Java that implements S-SMR, and service commands or adding additional complexity to their Volery, an application that implements Zookeeper’s API. We implementation. S-SMR partitions the service state and assess the performance of Volery and compare the results relies on an atomic multicast primitive to consistently order against Zookeeper. Our experiments show that Volery scales throughput with the number of partitions. commands within and across partitions. We show in the paper that simply ordering commands consistently across partitions is not enough to ensure strong consistency in I. INTRODUCTION partitioned state machine replication. S-SMR implements Many current online services have stringent availability execution atomicity, a property that prevents command in- and performance requirements. High availability entails tol- terleaves that violate strong consistency. To assess the per- erating component failures and is typically accomplished formance of S-SMR, we developed Eyrie, a Java library that with replication. For many online services, “high perfor- allows developers to implement partitioned-state services mance” essentially means the ability to serve an arbitrarily transparently, abstracting partitioning details, and Volery, a large load, something that can be achieved if the service can service that implements Zookeeper’s API [5]. All commu- scale throughput with the number of replicas. State machine nication between partitions is handled internally by Eyrie, replication (SMR) [1], [2] is a well-known technique to including remote object transfers. In the experiments we provide fault tolerance without sacrificing strong consistency conducted with Volery, throughput scaled with the number (i.e., linearizability). SMR regulates how client commands of partitions, in some cases linearly. In some deployments, are propagated to and executed by the replicas: every Volery reached over 250 thousand commands per second, non-faulty replica must receive and execute every command largely outperforming Zookeeper, which served 45 thousand in the same order. Moreover, command execution must be commands per second under the same workload. deterministic. SMR provides configurable availability, by The paper makes the following contributions: (1) It intro- setting the number of replicas, but limited scalability: every duces S-SMR and discusses several performance optimiza- replica added to the system must execute all requests; hence, tions, including caching techniques. (2) It details Eyrie, a throughput does not increase as replicas join the system. library to simplify the design of services based on S-SMR. Distributed systems usually rely on state partitioning to (3) It describes Volery to demonstrate how Eyrie can be scale (e.g., [3], [4]). If requests can be served simultaneously used to implement a service that provides Zookeeper’s API. by multiple partitions, then augmenting the number of (4) It presents a detailed experimental evaluation of Volery partitions results in an overall increase in system throughput. and compares its performance to Zookeeper. However, exploiting state partitioning in SMR is challeng- The remainder of this paper is organized as follows. ing: First, ensuring linearizability efficiently when state is Section II describes our system model. Section III presents partitioned is tricky. To see why, note that the system must be state-machine replication and the motivation for this work. able to order multiple streams of commands simultaneously Section IV introduces S-SMR; we explain the technique in (e.g., one stream per partition) since totally ordering all detail and argue about its correctness. Section V details the commands cannot scale. But with independent streams of implementation of Eyrie and Volery. Section VI reports on ordered commands, how to handle commands that address the performance of the Volery service. Section VII surveys multiple partitions? Second, SMR hides from the service related work and Section VIII concludes the paper. II. MODEL AND DEFINITIONS example, a sequence of operations for a read-write variable v We consider a distributed system consisting of an un- is legal if every read command returns the value of the most recent write command that precedes the read, if there is one, bounded set of client processes C = fc1; c2; :::g and a or the initial value otherwise. An execution E is linearizable bounded set of server processes S = fs1; :::; sng. Set S if there is some permutation of the commands executed in is divided into P disjoint groups of servers, S1; :::; SP . Processes are either correct, if they never fail, or faulty, E that respects (i) the service’s sequential specification and otherwise. In either case, processes do not experience arbi- (ii) the real-time precedence of commands. Command C1 trary behavior (i.e., no Byzantine failures). Processes com- precedes command C2 if the response of C1 occurs before municate by message passing, using either one-to-one or the invocation of C2. one-to-many communication, as defined next. The system is In classical state-machine replication, throughput does not asynchronous: there is no bound on message delay or on scale with the number of replicas: each command must relative process speed. be ordered among replicas and executed and replied by One-to-one communication uses primitives send(p; m) every (non-faulty) replica. Some simple optimizations to and receive(m), where m is a message and p is the process the traditional scheme can provide improved performance m is addressed to. If sender and receiver are correct, then but not scalability. For example, although update commands every message sent is eventually received. One-to-many must be ordered and executed by every replica, only one communication relies on atomic multicast, defined by the replica can respond to the client, saving resources at the primitives multicast(γ; m) and deliver(m), where γ is a other replicas. Commands that only read the state must be set of server groups. Let relation ≺ be defined such that ordered with respect to other commands, but can be executed m ≺ m0 iff there is a server that delivers m before m0. by a single replica, the replica that will respond to the client. Atomic multicast ensures that (i) if a server delivers m, then This is a fundamental limitation: while some optimiza- all correct servers in γ deliver m (agreement); (ii) if a correct tions may increase throughput by adding servers, the im- process multicasts m to groups in γ, then all correct servers provements are limited since fundamentally, the technique in every group in γ deliver m (validity); and (iii) relation does not scale. In the next section, we describe an extension ≺ is acyclic (order).1 The order property implies that if s to SMR that under certain workloads allows performance to and r deliver messages m and m0, then they deliver them in grow proportionally to the number of replicas. the same order. Atomic broadcast is a special case of atomic multicast in which there is a single group with all servers. IV. SCALABLE STATE-MACHINE REPLICATION In this section, we introduce S-SMR, discuss performance III. BACKGROUND AND MOTIVATION optimizations, and argue about S-SMR’s correctness. State-machine replication is a fundamental approach to implementing a fault-tolerant service by replicating servers A. General idea and coordinating the execution of client commands against S-SMR divides the application state V (i.e., state vari- server replicas [1], [2]. The service is defined by a state ables) into P partitions P ; :::; P , where for each P , machine, which consists of a set of state variables V = 1 P i P ⊆ V. Moreover, we require each variable v in V to fv ; :::; v g and a set of commands that may read and i 1 m be assigned to at least one partition and define part(v) as modify state variables, and produce a response for the the partitions that hold v. Each partition P is replicated command. Each command is implemented by a deterministic i by servers in group S . For brevity, we say that server s program. State-machine replication can be implemented with i belongs to P with the meaning that s 2 S , and say that atomic broadcast: commands are atomically broadcast to all i i client c multicasts command C to partition P meaning that servers, and all correct servers deliver and execute the same i c multicasts C to group S . sequence of commands. i To execute command C, the client multicasts C to all We are interested in implementations of state-machine partitions that hold a variable read or updated by C.