Efficient Replication Via Timestamp Stability

Efficient Replication via Timestamp Stability Vitor Enes Carlos Baquero Alexey Gotsman Pierre Sutra INESC TEC and INESC TEC and IMDEA Software Institute Télécom SudParis University of Minho University of Minho Abstract most critical data. State-machine replication (SMR) [38] is an Modern web applications replicate their data across the globe approach for providing such guarantees used by a number and require strong consistency guarantees for their most of systems [8, 14, 21, 26, 39, 44]. In SMR, a desired service critical data. These guarantees are usually provided via state- is defined by a deterministic state machine, and each site machine replication (SMR). Recent advances in SMR have maintains its own local replica of the machine. An SMR pro- focused on leaderless protocols, which improve the avail- tocol coordinates the execution of commands at the sites to ability and performance of traditional Paxos-based solutions. ensure that the system is linearizable [18], i.e., behaves as if We propose Tempo – a leaderless SMR protocol that, in com- commands are executed sequentially by a single site. parison to prior solutions, achieves superior throughput and Traditional SMR protocols, such as Paxos [28] and offers predictable performance even in contended workloads. Raft [34], rely on a distinguished leader site that defines the To achieve these benefits, Tempo timestamps each applica- order in which client commands are executed at the replicas. tion command and executes it only after the timestamp be- Unfortunately, this site is a single point of failure and con- comes stable, i.e., all commands with a lower timestamp are tention, and a source of higher latency for clients located far known. Both the timestamping and stability detection mech- from it. Recent efforts to improve SMR have thus focused anisms are fully decentralized, thus obviating the need for on leaderless protocols, which distribute the task of ordering a leader replica. Our protocol furthermore generalizes to commands among replicas and thus allow a client to contact partial replication settings, enabling scalability in highly par- the closest replica instead of the leader [1, 5, 13, 30, 31, 42]. allel workloads. We evaluate the protocol in both real and Compared to centralized solutions, leaderless SMR offers simulated geo-distributed environments and demonstrate lower average latency, fairer latency distribution with re- that it outperforms state-of-the-art alternatives. spect to client locations, and higher availability. Leaderless SMR protocols also generalize to the setting of CCS Concepts: • Theory of computation ! Distributed partial replication, where the service state is split into a set algorithms. of partitions, each stored at a group of replicas. A client com- Keywords: Fault tolerance, Consensus, Geo-replication. mand can access multiple partitions, and the SMR protocol ensures that the system is still linearizable, i.e., behaves as ACM Reference Format: if the commands are executed by a single machine storing Vitor Enes, Carlos Baquero, Alexey Gotsman, and Pierre Sutra. a complete service state. This approach allows implement- 2021. Efficient Replication via Timestamp Stability. In Sixteenth ing services that are too big to fit onto a single machine. It European Conference on Computer Systems (EuroSys ’21), April 26–28, 2021, Online, United Kingdom. ACM, New York, NY, USA, 25 pages. also enables scalability, since commands accessing disjoint https://doi.org/10.1145/3447786.3456236 sets of partitions can be executed in parallel. This has been demonstrated by Janus [32] which adapted a leaderless SMR 1 Introduction protocol called Egalitarian Paxos (EPaxos) [31] to the setting of partial replication. The resulting protocol provided bet- Modern web applications are routinely accessed by clients ter performance than classical solutions such as two-phase arXiv:2104.01142v2 [cs.DC] 25 Apr 2021 all over the world. To support such applications, storage commit layered over Paxos. systems need to replicate data at different geographical loca- Unfortunately, all existing leaderless SMR protocols suffer tions while providing strong consistency guarantees for the from drawbacks in the way they order commands. Some pro- Permission to make digital or hard copies of all or part of this work for tocols [1, 5, 13, 31] maintain explicit dependencies between personal or classroom use is granted without fee provided that copies are not commands: a replica may execute a command only after all made or distributed for profit or commercial advantage and that copies bear its dependencies get executed. These dependencies may form this notice and the full citation on the first page. Copyrights for components arbitrary long chains. As a consequence, in theory the pro- of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to tocols do not guarantee progress even under a synchronous redistribute to lists, requires prior specific permission and/or a fee. Request network. In practice, their performance is unpredictable, and permissions from [email protected]. in particular, exhibits a high tail latency [5, 36]. Other proto- EuroSys ’21, April 26–28, 2021, Online, United Kingdom cols [11, 30] need to contact every replica on the critical path © 2021 Association for Computing Machinery. of each command. While these protocols guarantee progress ACM ISBN 978-1-4503-8334-9/21/04...$15.00 https://doi.org/10.1145/3447786.3456236 1 EuroSys ’21, April 26–28, 2021, Online, United Kingdom Vitor Enes, Carlos Baquero, Alexey Gotsman, and Pierre Sutra under synchrony, they make the system run at the speed of 2 Partial State-Machine Replication the slowest replica. We consider a geo-distributed system where processes All of these drawbacks carry over to the setting of par- may fail by crashing, but do not behave maliciously. State- tial replication where they are aggravated by the fact that machine replication (SMR) is a common way of implement- commands span multiple machines. ing fault-tolerant services in such a system [38]. In SMR, the In this paper we propose Tempo, a new leaderless SMR pro- service is defined as a deterministic state machine accepting tocol that lifts the above limitations while handling both full a set of commands C. Each process maintains a replica of the and partial replication settings. Tempo guarantees progress machine and receives commands from clients, external to under a synchronous network without the need to contact the system. An SMR protocol coordinates the execution of all replicas. It also exhibits low tail latency even in contended commands at the processes, ensuring that they stay in sync. workloads, thus ensuring predictable performance. Finally, We consider a general version of SMR where each pro- it delivers superior throughput than prior solutions, such as cess replicates only a part of the service state – partial SMR EPaxos and Janus. The protocol achieves all these benefits by (PSMR) [19, 32, 37]. We assume that the service state is di- assigning a scalar timestamp to each command and executing vided into partitions, so that each variable defining the state commands in the order of these timestamps. To determine belongs to a unique partition. Partitions are arbitrarily fine- when a command can be executed, each replica waits until grained: e.g., just a single state variable. Each command the command’s timestamp is stable, i.e., all commands with accesses one or more partitions. We assume that a process a lower timestamp are known. Ordering commands in this replicates a single partition, but multiple processes may be way is used in many protocols [1, 9, 11, 27]. A key novelty co-located at the same machine. Each partition is replicated of Tempo is that both timestamping and stability detection at A processes, of which at most 5 may fail. Following Flexible are fault-tolerant and fully decentralized, which preserves 5 5 A−1 Paxos [20], can be any value such that 1 ≤ ≤ b 2 c. This the key benefits of leaderless SMR. allows using small values of 5 regardless of the replication In more detail, each Tempo process maintains a local clock factor A, which is appropriate in geo-replication [8, 13]. We from which timestamps are generated. In the case of full write I? for the set of all the processes replicating a partition replication, to submit a command a client sends it to the ?, I2 for the set of processes that replicate the partitions closest process, which acts as its coordinator. The coordinator accessed by a command 2, and I for the set of all processes. computes a timestamp for the command by forwarding it A PSMR protocol allows a process 8 to submit a command 2 to a quorum of replicas, each of which makes a timestamp on behalf of a client. For simplicity, we assume that each com- proposal, and taking the maximum of these proposals. If mand is unique and the process submitting it replicates one enough replicas in the quorum make the same proposal, of the partitions it accesses: 8 2 I2 . For each partition ? ac- then the timestamp is decided immediately (fast path). If cessed by 2, the protocol then triggers an upcall execute? ¹2º not, the coordinator does an additional round trip to the at each process storing ?, asking it to apply 2 to the local replicas to persist the timestamp (slow path); this may happen state of partition ?. After 2 is executed by at least one pro- when commands are submitted concurrently. Thus, under cess in each partition it accesses, the process that submitted favorable conditions, the replica nearest to the client decides the command aggregates the return values of 2 from each the command’s timestamp in a single round trip. partition and returns them to the client. To execute a command, a replica then needs to determine PSMR ensures the highest standard of consistency of repli- when its timestamp is stable, i.e., it knows about all com- cated data – linearizability [18] – which provides an illusion mands with lower timestamps.

Efficient Replication Via Timestamp Stability

Dagrep-V007-I011-Complete.Pdf

Chicago Journal of Theoretical Computer Science the MIT Press

Combinatorial Topology and Distributed Computing Copyright 2010 Herlihy, Kozlov, and Rajsbaum All Rights Reserved

Tight Bounds for K-Set Agreement Soma Chaudhuri Maurice Herlihy CRL 98/4 Nancy A

A Complexity-Based Hierarchy for Multiprocessor Synchronization

A Topological Perspective on Distributed Network Algorithms∗

Expert Report of Maurice P. Herlihy in Securities and Exchange Commission V

Chicago Journal of Theoretical Computer Science the MIT Press

Compositional Competitiveness for Distributed Algorithms∗

Contents U U U

Distributed Computing

Maurice Peter Herlihy