Velos: One-Sided Paxos for RDMA Applications
Total Page:16
File Type:pdf, Size:1020Kb
Velos: One-sided Paxos for RDMA applications Rachid Guerraoui Antoine Murat Athanasios Xygkis EPFL EPFL EPFL Abstract These solutions primarily focus on increasing the through- put and lowering the latency of common case executions, Modern data centers are becoming increasingly equipped thus achieving consensus in the order of a few microseconds. with RDMA-capable NICs. These devices enable distributed Nevertheless, outside of their common case, these systems systems to rely on algorithms designed for shared memory. suffer from orders of magnitude higher failover times, rang- RDMA allows consensus to terminate within a few microsec- ing from 1ms [2] to tens or hundreds of ms [23, 25]. ond in failure-free scenarios, yet, RDMA-optimized algo- In this work we introduce a leader-based consensus algo- rithms still use expensive two-sided operations in case of fail- rithm based on the well-known Paxos [16] that is efficient ure. In this work, we present a new leader-based algorithm both during the common-case as well as during failover. Our for consensus based on Paxos that relies solely on one-sided algorithm provides comparative performance to the state-of- RDMA verbs. Our algorithm decides in a single one-sided . s RDMA operation in the common case, and changes leader the-art system Mu, i.e., it achieves consensus in under 1 9µ , making it only twice as slow as Mu. At the same time, our also in a single one-sided RDMA operation in case of fail- algorithm is 13 times faster than Mu in the event of a leader ure. We implement our algorithm in the form of an SMR s system named Velos, and we evaluated our system against failure and manages to failover in under 65µ . To achieve this, our algorithm relies on one-sided RDMA WRITEs as the state-of-the-art competitor Mu. Compared to Mu, our well as Compare & Swap (CAS), a capability that is present solution adds a small overhead of ≈ 0.6µs in failure-free ex- in RDMA NICs. ecutions and shines during failover periods during which it is 13 times faster in changing leader. The basic idea behind our algorithm is that the original Paxos algorithm contains RPCs that are simple enough to be replaced with CAS operations. The CAS operations are initi- 1 Introduction ated by the leader and are executedon the memoryof the par- ticipating consensus nodes (leader and followers). We first RDMA is becoming increasingly popular in data centers. describe a version of our algorithm for single-shot consensus, arXiv:2106.08676v1 [cs.DC] 16 Jun 2021 This networking technology is implemented in modern NICs where we provide proofs of correctness. Then, we continue and allows for Remote Direct Memory Access, where a by extending this version to a fully-fledged system, in which server can access the memory of another without involving our algorithm takes a form similar to multishot Paxos [17]. the CPU of the latter [24]. As a result, RDMA paves the way In the case of a stable leader, it decides in a single CAS op- for distributed algorithms in the shared-memory model that eration to a majority. In the event of a failure, the leader is are no longer confined within a single physical server. changed in a single additional CAS operation to a majority. Communication over RDMA takes two forms. One form The rest of this document is as follows: In section 2, we is one-sided communication which shares similar semantics present related work. In section 3, we introduce the consen- to local memory accesses: a server directly performs READs sus problem and present the well-known Paxos algorithm. In and WRITEs to the memory of a remote server without in- section 4, we explain how to transform the RPC algorithm volving the remote CPU. The other form is called two-sided into a CAS-based one and we prove the correctness of this communication that has similar semantics to message pass- transformation. In section 5, we discuss further practical ing: the server SENDs a message to the remote server which considerations, which are relevant in converting our single- involves its CPU to process the RECEIVEd message. shot consensus algorithm to a multi-shot one. In section 6, A significant body of work on consensus [4] over RDMA we discuss our implementation. In section 7, we evaluate has been conducted over the past decade [2, 14, 23, 25]. the performance of our solution against Mu, the most recent 1 state-of-the-art system that implements consensus. 3.2 Consensus In the consensus problem, processes propose individual val- 2 Related work ues and eventually irrevocably decide on one of them. For- mally, Consensus has the following properties: Mu Uniform agreement If processes i and j decide val and val′, Mu [2] is an RDMA-powered State Machine Replication [3] respectively, then val = val′. system that operates at the microsecond scale. Similarly to our system, it decides in a single amortized RDMA RTT. Validity If some process decides val, then val is the inputof This is achieved by relying on RDMA permissions [24]. Mu some process. ensures that at any time, at most one process can write to Integrity No process decides twice. a majority and decide, which ensures consensus’ safety. In case of leader failure, Mu requires permission changes that Termination Every correct process that proposes eventually take ≈ 250µs. Mu thus fails at guaranteeing microsecond decides. decisions in case of failure. It is well known that consensus is impossible in the asyn- chronous model [9]. To circumvent this impossibility, an ad- APUS ditional synchrony assumption has to be made. Our consen- APUS [25] is a Paxos-based SMR system. It was tailored for sus algorithm provides safety in the asynchronousmodel and RDMA. It doesn’t use expensive permission changes but re- requires partial synchrony for liveness. For pedagogical rea- lies heavily on two-sided communication schemes. While it sons and in order to facilitate understanding, we implement provides short failovers, its consensus engine involves heavy our consensus algorithm by merging together the following CPU usage at replicas and is significantly slower than Mu’s. abstractions: • Abortable Consensus [4], an abstraction weaker than Disk Paxos Consensus that is solvable in the asynchronous model, Disk Paxos [10] observes that Paxos’ acceptors can be re- • Eventually Perfect Leader Election [6], which relies on placed by moderately smart and failure-prone shared mem- the weakest failure detector required to solve Consen- ories. Their work can be done by proposers as long as they sus. are able to run atomic operations at the shared resource. This work is purely theoretical. 3.3 Abortable Consensus 3 Preliminaries Abortable consensus has the following properties: Uniform agreement If processes i and j decide val and val′, In this section, we state the process and communication respectively, then val = val′. model that we assume for the rest of this work. Then we formally introduce the problem of consensus and we present Validity If some process decides val, then val is the inputof the well-known Paxos algorithm. some process. Termination Every correct process that proposes eventually 3.1 Assumptions decides or abort. We consider the message-and-memory (M&M) model [1], Decision If a single process proposes infinitely many time, which allows processes to use both message-passing and it eventually decides. shared-memory. Communication is assumed to be lossless and provides FIFO semantics. The system has n processes Algorithm 1 solves Abortable Consensus and is based on Π = {p1,..., pn} that can attain the roles of proposer or ac- Paxos. Processes are divided into two groups: proposers or ceptor. In the system, we assume that there are p proposers acceptors. Proposers propose a value for decision and accep- and n acceptors, where 1 < p < |Π|. Processes can fail by tors accept some proposed values. Once a value has been n−1 crashing. Up to p − 1 proposers and 2 acceptors may accepted by a majority of acceptors, it is decided by its pro- fail. As long as a process is alive, its memory is remotely poser. accessible. When a process crashes, its memory also crashes. Intuitively, the algorithm is split in two phases: the Pro- In this case, subsequent memory operations do not return a pose phase and the Accept phase. During these phases, mes- response. The system is asynchronous in that it can experi- sages from the proposer are identified by a unique proposal ence arbitrary delays. number. The Prepare phase serves two purposes. First, the 2 Algorithm 1: Abortable Consensus 3.4 From Abortable Consensus to Consensus 1 Proposers execute: 2 upon hIniti: One can solve Consensus by combining Abortable Consen- 3 decided = False sus together with Eventually Perfect Leader Election (Ω). 4 proposal = id 5 proposed_value = ⊥ In Abortable Consensus a proposer is guaranteed to decide, rather than abort, if it executes unobstructed. The role of Ω 7 propose(value): 8 proposed_value = value is to ensure this condition is ensured. Informally, Ω guaran- 9 if not decided: tees that eventually all correct proposers will consider a sin- 10 if prepare(): 11 accept() gle one ofthem to be the leader. As long as a proposeris con- sidering itself as the leader it keeps on proposing their value 13 prepare(): 14 proposal = proposal + |Π| using Abortable Consensus. Eventually, Ω will mark a sin- 15 broadcast hPrepare | proposali gle correct proposer as the leader, which will try to propose 16 wait for a majority of hPrepared | ack, ap, avi 17 if any av returned, replace proposed_value with av with unobstructed and decide. The leader can then broadcast the highest ap decision to the rest of the proposers.