arXiv:2106.08676v1 [cs.DC] 16 Jun 2021 hsntokn ehooyi mlmne nmdr NICs modern in implemented is technology networking This ihRM-aal Is hs eie nbedistributed enable devices These NICs. RDMA-capable with ovn h eoeCU h te omi aldtwo-sided called is form other The CPU. remote the volving n losfrRmt ietMmr ces hr a involving [ without where latter another the of Access, of CPU memory the Memory the Direct access can Remote server for centers. allows data and in popular increasingly becoming is RDMA Introduction 1 leader. it changing which in faster during times periods 13 failover is during shines and ecutions novsisCUt rcs h EEVdmessage. RECEIVEd which the server process remote to the CPU to its message involves pass- a message SENDs server to the semantics ing: similar in- has without that server communication remote a of memory the to READs performs WRITEs directly and server a accesses: semantics memory local similar to shares which communication one-sided is server. physical single a within that confined model longer shared-memory no the are in algorithms distributed for our Mu, of to overhead small Compared a against SMR adds solution system Mu. an our competitor of evaluated state-of-the-art form we the the and fail- in Velos, of algorithm named case our system in implement operation We RDMA leader one-sided changes ure. single and a case, one-sided in common single also the a in in operation decides RDMA algorithm one-sided Our on solely algorithm verbs. relies leader-based that RDMA new a on present based consensus we for work, this f In of algo- case ure. in RDMA-optimized operations two-sided expensive yet, use still scenarios, rithms failure-free microsec- in memory. few a ond shared within for terminate to designed consensus allows algorithms RDMA on rely to equipped systems increasingly becoming are centers data Modern a encnutdoe h atdcd [ decade past the over conducted been has infiatbd fwr ncness[ consensus on work of body significant A form One forms. two takes RDMA over Communication 24 ahdGuerraoui Rachid .A eut DApvsteway the paves RDMA result, a As ]. Abstract eo:OesddPxsfrRM applications RDMA for Paxos One-sided Velos: EPFL ≈ 0 . 6 µ s nfiuefe ex- failure-free in 4 2 vrRDMA over ] , 14 , 23 non Murat Antoine , 25 ail- EPFL ]. 1 ela opr wp(A) aaiiyta spresent is that capability a (CAS), Swap & Compare as well hr epoiepof fcretes hn econtinue we Then, correctness. of proofs provide we where edsusoripeetto.I section In implementation. our discuss we hs ouin rmrl ou nicesn h through- the increasing on focus primarily solutions These ecieavrino u loih o igeso consens single-shot first for algorithm We our of followers). version a par- and describe the (leader of memory nodes the consensus on executed ticipating are and initi- leader are the operations by CAS ated The be to operations. enough CAS simple with replaced are that RPCs contains algorithm Paxos NICs. as RDMA WRITEs in RDMA one-sided on relies algorithm our this, alr n aae ofioe nudr65 under leader a in of our failover event time, to the same in manages the Mu and At than failure faster Mu. times as 13 slow is as algorithm twice only it making u loih ae omsmlrt utso ao [ Paxos multishot to similar form a which in takes system, algorithm fully-fledged our a to version this extending by h efrac forslto gis u h otrecent most the Mu, against solution our of section performance In the one. multi-shot a singl to our algorithm converting consensus in shot relevant this are of which correctness section considerations, the In prove we and transformation. one CAS-based a into In algorithm. Paxos section well-known the present and problem section sus In work. related is present leader majority. a the to failure, operation a op- CAS of CAS additional event single single a the a in in In changed decides majority. it a leader, to stable eration a of case the In 1 under in consensus achieves it i.e., Mu, state-o system the the-art to Our performance failover. comparative during provides as algorithm well as common-case the during both ih ae ntewl-nw ao [ Paxos well-known the on based rithm ufrfo reso antd ihrfioe ie,rang times, [ failover 1ms from higher systems ing magnitude these of orders case, from common suffer their executions, microseconds. of few case a outside of common Nevertheless, order the of in consensus latency achieving thus the lowering and put h ai dabhn u loih sta h original the that is algorithm our behind idea basic The h eto hsdcmn sa olw:I section In follows: as is document this of rest The nti okw nrdc edrbsdcnessalgo- consensus leader-based a introduce we work this In 4 eepanhwt rnfr h P algorithm RPC the transform to how explain we , taaisXygkis Athanasios 2 otn rhnrd fm [ ms of hundreds or tens to ] EPFL 5 edsusfrhrpractical further discuss we , 3 eitoueteconsen- the introduce we , 16 hti efficient is that ] 23 µ 7 s eevaluate we , , oachieve To . 25 ]. 2 . we , 9 17 µ us, 6 e- f- s ]. - , , state-of-the-art system that implements consensus. 3.2 Consensus In the consensus problem, processes propose individual val- 2 Related work ues and eventually irrevocably decide on one of them. For- mally, Consensus has the following properties: Mu Uniform agreement If processes i and j decide val and val′, Mu [2] is an RDMA-powered State Machine Replication [3] respectively, then val = val′. system that operates at the microsecond scale. Similarly to our system, it decides in a single amortized RDMA RTT. Validity If some process decides val, then val is the inputof This is achieved by relying on RDMA permissions [24]. Mu some process. ensures that at any time, at most one process can write to Integrity No process decides twice. a majority and decide, which ensures consensus’ safety. In case of leader failure, Mu requires permission changes that Termination Every correct process that proposes eventually take ≈ 250µs. Mu thus fails at guaranteeing microsecond decides. decisions in case of failure. It is well known that consensus is impossible in the asyn- chronous model [9]. To circumvent this impossibility, an ad- APUS ditional synchrony assumption has to be made. Our consen- APUS [25] is a Paxos-based SMR system. It was tailored for sus algorithm provides safety in the asynchronousmodel and RDMA. It doesn’t use expensive permission changes but re- requires partial synchrony for liveness. For pedagogical rea- lies heavily on two-sided communication schemes. While it sons and in order to facilitate understanding, we implement provides short failovers, its consensus engine involves heavy our consensus algorithm by merging together the following CPU usage at replicas and is significantly slower than Mu’s. abstractions: • Abortable Consensus [4], an abstraction weaker than Disk Paxos Consensus that is solvable in the asynchronous model, Disk Paxos [10] observes that Paxos’ acceptors can be re- • Eventually Perfect Leader Election [6], which relies on placed by moderately smart and failure-prone shared mem- the weakest failure detector required to solve Consen- ories. Their work can be done by proposers as long as they sus. are able to run atomic operations at the shared resource. This work is purely theoretical. 3.3 Abortable Consensus 3 Preliminaries Abortable consensus has the following properties: Uniform agreement If processes i and j decide val and val′, In this section, we state the process and communication respectively, then val = val′. model that we assume for the rest of this work. Then we formally introduce the problem of consensus and we present Validity If some process decides val, then val is the inputof the well-known Paxos algorithm. some process. Termination Every correct process that proposes eventually 3.1 Assumptions decides or abort. We consider the message-and-memory (M&M) model [1], Decision If a single process proposes infinitely many time, which allows processes to use both message-passing and it eventually decides. shared-memory. Communication is assumed to be lossless and provides FIFO semantics. The system has n processes Algorithm 1 solves Abortable Consensus and is based on Π = {p1,..., pn} that can attain the roles of proposer or ac- Paxos. Processes are divided into two groups: proposers or ceptor. In the system, we assume that there are p proposers acceptors. Proposers propose a value for decision and accep- and n acceptors, where 1 < p < |Π|. Processes can fail by tors accept some proposed values. Once a value has been n−1 crashing. Up to p − 1 proposers and  2  acceptors may accepted by a majority of acceptors, it is decided by its pro- fail. As long as a process is alive, its memory is remotely poser. accessible. When a process crashes, its memory also crashes. Intuitively, the algorithm is split in two phases: the Pro- In this case, subsequent memory operations do not return a pose phase and the Accept phase. During these phases, mes- response. The system is asynchronous in that it can experi- sages from the proposer are identified by a unique proposal ence arbitrary delays. number. The Prepare phase serves two purposes. First, the

2 Algorithm 1: Abortable Consensus 3.4 From Abortable Consensus to Consensus 1 Proposers execute: 2 upon hIniti: One can solve Consensus by combining Abortable Consen- 3 decided = False sus together with Eventually Perfect Leader Election (Ω). 4 proposal = id 5 proposed_value = ⊥ In Abortable Consensus a proposer is guaranteed to decide, rather than abort, if it executes unobstructed. The role of Ω 7 propose(value): 8 proposed_value = value is to ensure this condition is ensured. Informally, Ω guaran- 9 if not decided: tees that eventually all correct proposers will consider a sin- 10 if prepare(): 11 accept() gle one ofthem to be the leader. As long as a proposeris con- sidering itself as the leader it keeps on proposing their value 13 prepare(): 14 proposal = proposal + |Π| using Abortable Consensus. Eventually, Ω will mark a sin- 15 broadcast hPrepare | proposali gle correct proposer as the leader, which will try to propose 16 wait for a majority of hPrepared | ack, ap, avi 17 if any av returned, replace proposed_value with av with unobstructed and decide. The leader can then broadcast the highest ap decision to the rest of the proposers. Algorithm 2 provides →֒ 18 if any not ack: 19 trigger hAborti the implementation of this idea. A proof of its correctness is 20 return False given in [4]. 21 return True

23 accept(): Algorithm 2: Consensus from Abortable Consensus 24 broadcast hAccept | proposal, proposed_valuei 25 wait for a majority of hAccepted | mpi 1 Proposers execute: 26 if any mp > proposal: 2 upon hIniti: 27 trigger hAborti 3 proposed_value = ⊥ 28 else: 4 leader = proposed = decided = False 29 decided = true 30 trigger hDecide | proposed_valuei 6 upon hTrust | pii: 7 if pi == self then leader = True 32 Acceptors execute: 8 else leader = False 33 upon hIniti: 34 min_proposal = 0 10 propose(value): 35 accepted_proposal = 0 11 proposed_value = value 36 accepted_value = ⊥ 12 while True: 13 if leader and not proposed: 38 upon hPrepare | proposali: 14 proposed = True 39 if proposal > min_proposal: 15 trigger hAbortableConsensus, Propose | min_proposal = n ֒→ proposed_valuei 40 41 reply hPrepared | min_proposal == n, accepted_proposal, :accepted_valuei 17 upon hAbortableConsensus, Decide | valuei →֒ 18 broadcast hDecided | valuei 43 upon hAccept | proposal, valuei: 44 if proposal ≥ min_proposal: 20 upon hAbortableConsensus, Aborti: 45 accepted_proposal = min_proposal = n 21 proposed = False 46 accepted_value = value 47 reply hAccepted | min_proposali 23 upon hDecided | valuei: 24 if not decided: 25 decided = true 26 trigger hDecide | valuei

Notice that Abortable Consensus differs from Consensus proposer gets a promise from a majority of acceptors that only in its liveness property. The former is essentially an another proposer with a lower proposal number will fail to obstruction free consensus implementation in which no pro- decide. Second, the proposer updates its proposed value us- poser may decide under contention. However, Abortable ing the accepted values stored in the acceptors. This way, if Consensus retains the safety properties of Consensus. Thus, a value has been decided, the proposer will adopt it. The pre- for the rest of this work we will concentrate on Abortable pare phase can also trigger Abort if any acceptor in the ma- Consensus and we will transform it into a CAS-based algo- jority previously made a promise to a higher proposal num- rithm. ber.

If the proposer manages to complete the Prepare phase 4 One-sided Consensus without aborting, it proceeds to the Accept phase. In this phase, the proposer tries to store its proposal value in a ma- In this section, we explain how to transform the two-sided jority of acceptors. If it succeeds, it decides on that value. algorithm presented in section 3.3 into a one-sided one. To Otherwise, it ran obstructed and triggers Abort. do so, we first establish the equivalence between the RPCs used in algorithm 1 and CAS. Then, we take advantage of A proof of correctness for algorithm 1 is given in [4]. this equivalence and replace RPCs with CAS in algorithm

3 1. The resulting CAS-based Abortable Consensus is given in If an execution hs1,xirpc makes the comparison at line 2 algorithm 4. fail, then state is not modified and projection(s1) is returned. In the execution hs1,xicas−rpc, the comparison at 4.1 One-sided obstruction-free RPC line 8 will also fail (as the comparison is deterministic) and projection(s1) is also returned without modifying the re- Observe that algorithm 1 uses message passing (i.e. RPC) mote state. In this case, both executions are equivalent. in a very specific form. The acceptors keep track of only If an execution hs2,xirpc makes the comparison at line three variables: min_porposal, accepted_proposal and 2 succeed, then state is modified to f(s2, x) and accepted_value. In both the Accept and the Prepare projection(f(s2, x)) is returned. In the execution phases, acceptors atomically update these values upon a hs2,xicas−rpc, the comparison at line 8 will also succeed very simple condition (i.e., a simple comparison) and return (since the comparison is deterministic). As the run is as- some of them. In this section, we propose and prove a sim- sumed to not abort, the CAS will succeed. Thus the remote ple obstruction-free transformation to turn such RPCs into state will atomically be updated from s2 to f(s2, x) and purely one-sided conditional writes using CAS. f(s2, x) is also returned. So, in this case, both executions are also equivalent.

Algorithm 3: CAS-based RPC In addition, this transformation is safe in case of obstruc- 1 rpc(x): 2 if compare(x, state): tion 3 state = f(state, x) 4 return projection(state) Lemma 4.2. If cas-rpc aborts, it has no side effect.

6 cas-rpc(x): 7 expected = fetch_state() Proof. If cas-rpc aborts, the comparison at line 13 failed. 8 if not compare(x, expected): This in turn implies that the CAS at line failed and thus that 9 return projection(expected) state is unaffected by the execution. 11 move_to = f(expected, x) 12 old = cas(state, expected, move_to) From lemmas 4.1 and 4.2, cas-rpc exhibits all-or- 13 if old == expected: 14 return projection(move_to) nothing atomicity. We now prove that such a transformation in obstruction-free. 16 abort() Lemma 4.3. If cas-rpc runs alone (i.e., unobstructed), it In algorithm 3, we assume that the whole state of a pro- does not abort. cess (i.e., all its variables) is stored in state. In the case Proof. Let’s assume by contradiction that cas-rpc runs of RPC (line 1), the caller sends x to the callee. The callee alone and aborts. For cas-rpc to abort, the comparison deterministically compares x with its state using compare. at line 13 must fail. This in turn implies that the CAS at If the comparison succeeds, its state is deterministically mu- line 12 failed, i.e., that the current value of state does not tated using the function f. In any case, the callee extracts match the expected one. For this to happen the state must part of its state using projection and returns it to the caller. have been updated between line 7 and 12 by another process. By convention, rpc runs atomically. This means that there was a concurrent execution, a contra- We prove below that if the callee’s state is accessible by diction. the caller via shared memory, and compare, f, projection are known to the caller, then rpc and cas-rpc are strictly equivalent except in the case where cas-rpc aborts. 4.2 A purely one-sided consensus algorithm

Lemma 4.1. If cas-rpc does not abort, rpc and cas-rpc Algorithm 4 implements Abortable Consensus by replac- are equivalent. ing RPCs in algorithm 1 with the one-sided obstruction-free RPCs introduced in 4.1. As this new algorithm relies on CAS Proof. An execution of rpc solely depends on the value of for comparisons, it aborts in situations where the original al- state and the input value x. We denote such execution of gorithm would have succeeded. For example, consider the rpc with hstate,xirpc. If an execution of cas-rpc does not following execution: Let proposers P1 and P2 concurrently abort, it solely depends on the value of expected fetched at initiate the Prepare phase with respective proposals 1 and 2. line 7 and the input value x. We denote such execution of Both fetch the remote state and get h0,0,⊥i. Then, P1 suc- cas-rpc with hexpected,xicas−rpc. ceeds at writing its proposal to acceptor A1. Later on, the We show that any execution hs,xirpc is equivalent to the CAS of P2 fails at A1 as the value is now h1,0,⊥i instead of execution hs,xicas−rpc in the sense that both rpc and cas − the expected h0,0,⊥i. Thus, P2 aborts even if it had a larger rpc will have the same value of state and return the same proposal number than P1. The more relaxed comparison in projection at the end of their execution. the original algorithm would not have caused P2 to abort.

4 Algorithm 4: CAS-based Abortable Consensus RPCs will terminate without aborting. In such case, lemma 1 Proposers execute: 4.1 guarantees the execution to be equivalent to one of the 2 upon hIniti: original algorithm. Thus, the transformation preserves the 3 decided = False 4 proposal = id decision property of the original algorithm. 5 proposed_value = ⊥

7 propose(value): Lemma 4.5. Algorithm 4 preserves the termination property 8 proposed_value = value of Abortable Consensus. 9 if not decided: 10 if prepare(): 11 accept() Proof. Assuming a majority of correct acceptors, all CASes will eventually return or abort. Due to the absence of loops 13 prepare(): 14 proposal = proposal + |Π| or blocking operations (apart from waiting for a reply from a 15 execute in parallel cas_prepare(p, proposal) for p in majority of acceptors), a proposer that invokes propose will Acceptors →֒ 16 wait for a majority to abort or return hack, ap, avi either abort or decide. 17 if any returned, replace proposed_value with av with highest ap →֒ 18 if any aborted or not ack: The only execution difference between both algorithms is 19 trigger hAborti that some executionsof the transformedalgorithm may abort, 20 return False 21 return True where the original one would not. Nevertheless, aborting does not violate safety. 23 accept(): 24 execute in parallel cas_accept(p, proposal, proposal_value) for p in Acceptors Lemma 4.6. Algorithm 4 preserves the safety properties of →֒ 25 wait for a majority to abort or return mp Abortable Consensus. 26 if any aborted or returned mp > proposal: 27 trigger hAborti 28 else: Proof. Assume by contradiction that adding superfluous 29 decided = true abortions in the original algorithm can violate safety. In a 30 trigger hDecide | proposed_valuei first execution E1, processes {P1, ..., Pn} deviate from the 32 cas_prepare(p, proposal): original algorithm and abort at times {t , ..., t } after which 33 expected = fetch_state(p) 1 n 34 if not proposal > expected.min_proposal: the global state is {S1, ..., Sn}. At some point, safety is vio- 35 return hfalse, expected.accepted_proposal, expected. lated. In a second execution E , processes {P , ..., P } crash accepted_valuei 2 1 n →֒ at times {t1, ..., tn} after which the global state is {S1, ..., 37 move_to = expected S }. As the original algorithm tolerates arbitrarily many pro- 38 move_to.min_proposal = proposal n 39 old = cas(p.state, expected, move_to) poser crash failures, safety is not violated. Proposers cannot 40 if old == expected: distinguish both executions. Thus, safety cannot be violated, 41 return htrue, expected.accepted_proposal, expected. -accepted_valuei hence a contradiction. Thus, adding superfluous aborts pre →֒ serves safety and algorithm 4 preserves safety. 43 abort() 45 cas_accept(p, proposal, value): Theorem 4.7. By lemmas 4.5, 4.5, 4.6, algorithm 4 imple- 46 expected = fetch_state(p) 47 if not proposal ≥ expected.min_proposal: ments Abortable Consensus. 48 return expected.min_proposal

50 move_to = expected 51 move_to.min_proposal = proposal 4.3 Streamlined one-sided algorithm 52 move_to.accepted_proposal = proposal 53 move_to.accepted_value = value Section 4.2 introduced a simple one-sided consensus algo- 54 old = cas(p.state, expected, move_to) 55 if old == expected: rithm built by replacing RPCs with weaker (i.e., abortable) 56 return expected.min_proposal one-sided RPCs and proved its correctness. The resulting al- 58 abort() gorithm exhibits flagrant inefficiencies that can can be fixed to reduce the number of network operations to 2 CASes in 60 Acceptors execute: 61 upon hIniti: the common case. 62 state = {min_proposal : 0,accepted_proposal : 0,accepted_value : ⊥} First, it is not required to fetch the remote state at the start of each RPC. As it is safe to have stale expected states, it is safe to use optimistic predicted states deduced from previous Lemma 4.4. Algorithm 4 preserves the decision property of CASes. Predicted states can thus be initialized to h0,0,⊥i Abortable Consensus. and updated each time a CAS completes (either succeeding or not). Moreover, wrongly predicting states can only result Proof. If a single process proposes infinitely many time, it in superfluous aborts which have been proven to be safe by will eventually run one-sided RPCs obstruction-free. By lemma 4.6. Thus, it is safe to optimistically assume that on- lemma 4.3, this guarantees that eventually the one-sided flight CAS will succeed.

5 Second, in the Prepare phase, proposal can be in- each time abort at line 34 or 52 because of a single wrongly creased upfront to be higher than any predicted remote guessed remote state and update its prediction. The opti- min_proposal to reduce predictable abortions. mistic update of expected states at lines 29 and 49 and the Algorithm 5 gives the algorithm obtained after applying FIFO property of links provide that, once a remote state is these optimisations. correctly guessed, any later CAS will succeed. Thus, after at most n runs all CASes will succeed and the proposer will Algorithm 5: Streamlined One-sided Abortable Consensus decide. 1 Proposers execute: 2 upon hIniti: 3 predicted = {min_proposal : 0,accepted_proposal : 5 Practical Considerations 0,accepted_value : ⊥}|Acceptors| 4 decided = False 5 proposal = id So far we have showed a new algorithm for doing RDMA- 6 proposed_value = ⊥ based consensus using CAS. Our algorithm presents a sin- 8 propose(value): gle instance of consensus, however most practical systems 9 proposed_value = value require to run consensus over and over. 10 if not decided: 11 if prepare(): State Machine Replication (SMR) replicates a service 12 accept() (e.g., a key-value storage system) across multiple physical 14 prepare(): servers called replicas, such that the system remains avail- 15 for p in Acceptors: able and consistent even if some servers fail. SMR provides 16 while predicted[p].min_proposal ≥ proposal: 17 proposal = proposal + |Π| strong consistency in the form of linearizability [11]. Acom- mon way to implement SMR, which we adopt in this paper, 19 reads = ⊥|Acceptors| 20 move_to = ⊥|Acceptors| is as follows: each replica has a copy of the service soft- ware and a log. The log stores client requests. We consider 22 for p in Acceptors: 23 move_to[p] = (proposal, predicted[p]. non-durable SMR systems [13,15,18,19,21,22], which keep -accepted_proposal, predicted[p]. state in memory only, without logging updates to stable stor →֒ (accepted_value →֒ 24 async reads[p] = cas(slotp, predicted[p], move_to[p]) age. Such systems make an item of data reliable by keeping 25 wait until majority of slots are read copies of it in the memory of several nodes. Thus, the data 27 for p in Acceptors: remains recoverable as long as there are fewer simultaneous 28 if reads[p] ∈ {predicted[p], ⊥}: node failures than data copies [23]. 29 predicted[p] = move_to[p] 30 else: A consensus protocol ensures that all replicas agree on 31 predicted[p] = reads[p] what request is stored in each slot of the log. Replicas then 33 if any CAS failed (predicted[p] 6= read[p]): apply the requests in the log (i.e., execute the corresponding 34 trigger hAborti operations), in log order. Assuming that the service is deter- 35 return false ministic, this ensures all replicas remain in sync. We adopt 37 proposed_value = predicted[.].accepted_value with highest a leader-based approach, in which a dynamically elected accepted_proposal or proposed_value if none →֒ 38 return true replica called the leader communicates with the clients and sends back responses after requests reach a majority of repli- 40 accept(): 41 move_to = (proposal, proposal, proposed_value) cas. As already stated, we assume a crash-failure model: 42 for p in Acceptors: servers may fail by crashing, after which they stop execut- 43 async reads[p] = cas(slotp, predicted[p], move_to) 44 wait until majority of slots are read ing. Velos is the implementaion of Algorithm 5 in the form of 46 if any CAS failed: 47 for p in Acceptors: SMR. Implementing Velos leads to practical considerations 48 if reads[p] ∈ {predicted[p], ⊥}: and additional hardware-specific optimizations that are not 49 predicted[p] = move_to 50 else: present in Algorithm 5. 51 predicted[p] = reads[p] 52 trigger hAborti 53 return 5.1 Pre-preparation 55 decided = true 56 trigger hDecide | proposed_valuei Unfortunately, our CAS approach prevent us from using the multi-Paxos optimisation [5, 17, 20]. Thus, each consensus 58 Acceptors execute: 59 upon hIniti: slot must be prepared individually. Nevertheless, a leader 60 state = {min_proposal : 0,accepted_proposal : 0,accepted_value : ⊥} can prepare slots in advance so that the it can decide running only the accept phase on its critical path. In this case, the Said optimisations also preserve liveness. Let’s assume leader decides in a single CAS RTT. Doing so requires a sta- that a single proposer runs infinitely many time. Eventu- ble leader. Switching to another leader requires re-preparing ally, it will run obstruction-free. In the worst case, it will slots. Fortunately, this will most likely succeed in a single

6 CAS RTT by optimistically predicting remote slots to have that a value has been decided for slot s − 1 simply by read- been prepared by the failed leader. ing its local slot s. With this strategy, values are decided in a single CAS and decisions learned with no additional com- 5.2 Limited CAS size munication in the common case. Algorithm 5 assumes that a state can fit within a single CAS. Current RDMA NICs support CAS up to 8 bytes. As both min_proposal and accepted_proposal must be the same 5.5 Unavailable features size, both fields are limited to at most 31 bits with 2 bits being dedicated to storing the accepted_value. Modern RNICs lack some features and performances that One may be afraid of proposal fields overflowing (ei- would make one-sided Paxos even more appealing. We be- ther breaking safety or decision). In such an extremely un- lieve that these features can reasonablybe expectedto be pro- likely case (and actually more for completeness), the abstrac- vided by future RNICs. tion can fallback to traditional RPC. Switching to RPC First are global CASes. CX6 atomics are only guaran- can be safely implemented as follows: Once the RDMA- teed to be atomic with other operations executed by the same exposed min_proposal of an acceptor reaches 231 − |Π|, HCA. This prevent us from using CPU-side CASes to up- every proposer switches to RPC to communicate with this date the state of the proposer, which could save one MMIO specific acceptor. At a regular time interval, acceptors (≈300ns). checks their state and, if above the threshold, execute al- Second are unalligned CASes. Currently, CASes cannot gorithm 1 with min_proposal, accepted_proposal and overlap 2 consecutive slots. Such a feature would allow a accepted_value variables initiated to match state. proposer to run the Accept phase for slot s and the Prepare Another concern is the limited accepted_value field phase for slot s + 1 in a single CAS, definitely removing the size. Dependingon the upperapplication, the value can be in- need for batch pre-preparation from the critical path. lined within the CAS. Otherwise, a simple indirection can be put in place. Instead of deciding on the proposed value itself, Third, even using Device Memory, CAS exhibits a latency one can decide on the proposer’s id. This can be achieved by twice higher than RDMA WRITEs, which makes Mu twice RDMA-writing the actual value to a majority of acceptors as fast in the failure-free scenario. in a dedicated write-exclusive slot before running the accept phase. RDMA RC QP FIFO semantics can be leverages to do so at minimimum cost. The write WQE can be prepended to the Accept WQE, made unsignaled, and posted with Door- 6 Implementation bell batching. If the CAS completes, then the value was writ- ten at a majority of acceptors and will always be recoverable. Our algorithm is implemented in 1046 lines of C++17 code (CLOC [7]). It uses the ibverbs library for RDMA over In- finiband and it relies on the reusable abstractions provided 5.3 Device Memory by the source code of Mu. Furthermore, we implemented Modern RNICs allows RDMA exposure of their own inter- the optimizations mentioned in section 5, apart from “Piggy- nal memory. This feature is called Device Memory (DM). backing decisions”. In Mellanox’ Connect-X6, the available memory is slightly We implement a new leader election algorithm that departs larger than 100KiB. RDMAing this memory is faster than from Mu’s design. Our algorithm relies on modifications accessing the main memory as it removes an extra PCIe com- of the Linux kernel and is composed of two separate mod- munication from the critical path. All RDMA verbs benefit ules, namely the interceptor and the broadcaster module. In from a speedup. Moreover, DM reduces atomic verbs con- Linux, when a process crashes control is transferred to the tention. As one-sided Paxos makes acceptors fundamentally kernel which takes care of cleaning up the process. Our ker- passive, DM can be used without additional cost. DM can nel modifications allow processes to instruct the kernel—via also be leveraged when RDMA-writing the actual value as a prctl system call—that upon failure the interceptor mod- described in the previous section to PCIe-transfer the pay- ule should be invoked during the cleaning up. The intercep- load only once. tor module is then responsible for invoking the broadcaster module. The broadcasting module broadcasts a message say- 5.4 Piggybacking decisions ing that the process has crashed. This message is picked up by correct processes that subsequently stop trying to commu- Consensus slots can be augmented with an additional nicate with the crashed process. Our kernel modifications previous_decision value. This way, if every node en- and kernel modules implementation span in 352 lines of C dorses both the proposer and the acceptor roles, it can learn code.

7 7 Evaluation 4.5 4.25 4 Mu Our goal is to evaluate whether our implementation can 3.75 Velos s] 3.5 Velos /w Device Mem. achieve consensus within a few microseconds and whether µ 3.25 3 it handles failures with the least amount of delay. Concretely, 2.75 2.5 with our evaluation we aim at answering the following ques- 2.25 tions: Latency [ 2 1.75 1.5 • Does our implementation imposes a small overhead 1.25 1 compared to Mu in the common case? 1B 32B 64B 128B 256B 512B Message size • What is the total failover time during a leader crash?

We evaluate our system on a 3-node cluster, the details of Figure 1: Median replication latency of Velos compared to which are given in Table 1. Mu for different message sizes. In the reportednumbers we show 3-way replication, which accounts for most real deployments [12]. In all of our exper- iments we ignore the existence of a client issuing requests to For larger payload sizes, Velos replicates a request using our system. a RDMA CAS and an additional RDMA WRITE. In other words, Velos does exactly what Mu does, apart from having Table 1: Hardware details of machines. an extra RDMA CAS for every replicated message. Given that the latency of RDMA CAS is constant (in the absence 2x Intel(R) Xeon(R) Gold 6244 CPU of CAS-contention, i.e., when having a stable leader), the CPU @ 3.60GHz impact of the overhead in the replication latency of Velos Memory 2x 128GiB compared to Mu diminishes for larger message sizes. NIC Mellanox ConnectX-6 Figure 1 also demonstrates the effect of using Device Links 100 Gbps Infiniband Memory. When Velos relies exclusively on Device memory Switch Mellanox MSB7700 EDR 100 Gbps for its RDMA WRITEs and RDMA CASes, it gains 200ns OS Ubuntu 20.04.2 LTS on replication latency. Kernel 5.4.0-74-custom RDMA Driver Mellanox OFED 5.3-1.0.0.1 7.2 Fail-over Time

We measure time using the POSIX clock_gettime func- During a leader crash, Velos replicas receive the failure de- tion, with the CLOCK_MONOTONIC parameter. In our deploy- tection event and a new leader is elected. The new leader im- ment, the resolution and overhead of clock_gettime is mediately starts replicating new messages among itself and around 16–20ns [8]. the remaining replica R. Figure 2 evaluates the time it takes for R to discover a new 7.1 Common-case Replication Latency replicated value in its log. When a stable leader replicates requests, the throughput is at around 42 requests per 100µs. We first evaluate the latency of our system under no leader In other words, R discovers a new entry in its log approx- failure. We measure the latency at the leader. Latency refers imately every 2.5µs. When the leader fails the throughput to the time it takes for propose of Algorithm 5 to execute. drops to 0 and replica R discovers a new value after approxi- We show the time it takes to replicate messages of different mately 65µs. The subsequent few replication requests from sizes in Figure 1. the new leader take between 3µs to 3.6µs, which is evident For messages of 1 byte, Velos replicates a request using in the throughput curve. When the new leader takes over, only a RDMA CAS. On the other side, Mu always uses the throughput curve exhibits a not-so-steep trajectory, be- a single RDMA WRITE to replicate. For messages up to fore stabilizing again at the throughput of 42 requests per 128 bytes Mu manages to inline the message in the RDMA 100µs This is because the new leader’s cache is cold and ini- request, thus exhibiting almost constant replication latency. tial replication requests result in higher replication latency. RDMA WRITE and RDMA CAS are both one-sided oper- Soon after the new leader manages to replicate new requests ations, but they have different latency. Sending 3 RDMA in approximately 2.5µs. CASes and waiting for a majority of replies costs around Comparing Velos to Mu, the latter faster than Velos in the 1.9µs, while the same communication pattern with RDMA common case but slower during leader change. Mu manages WRITEs costs 1.25µs. This is seen from the time difference to replicate requests using a single WRITE but it relies on for 1B messages. permissions to handle concurrent leaders during leader fail-

8 s] References

µ

40 [1] Marcos K. Aguilera, Naama Ben-David, Irina Calciu, Rachid Guerraoui, Erez Petrank, and Sam Toueg. Pass- ing messages while sharing memory. In ACM Sympo- 20 sium on Principles of (PODC), pages 51–60, July 2018.

[2] Marcos K Aguilera, Naama Ben-David, Rachid Guer- 0 raoui, Virendra J Marathe, Athanasios Xygkis, and Igor 0 50 100 150 200 Throughput [requests per 100 Zablotchi. Microsecond consensus for microsecond Time (µs) applications. USENIX Symposium on Operating Sys- tem Design and Implementation (OSDI), pages 599– Figure 2: Throughput under leader failure (for message size 616, 2020. is 1 byte) measured from the remaining replica. [3] Romain Boichat, Partha Dutta, Svend Frølund, and Rachid Guerraoui. Deconstructing Paxos. ACM SIGACT News, 34(1):47–67, March 2003. ure. The cost of changing permissions, as presented in Mu, is approximately 250µs just for changing the leader. Mu re- [4] Christian Cachin, Rachid Guerraoui, and Luís Ro- quires an additional 600µs to detect the leader failure. On drigues. Introduction to reliable and secure distributed the other side, Velos requires approximately 30µs to detect programming. Springer Science & Business Media, a leader failure and an additional 35µs for the new leader to 2011. successfully replicate the next request. In other words, Velos [5] Tushar Deepak Chandra, Robert Griesemer, and Joshua is approximately 1.2 or 1.5 µs slower than Mu in the com- mon case, depending on whether it relies or not in Device Redstone. Paxos made live: An engineering perspec- ACM Symposium on Principles of Distributed Memory. Velos is significantly faster than Mu during leader tive. In Computing (PODC), pages 398–407, August 2007. change, as it detects a leader failure 20 times faster than Mu and replicates the next request 7 times faster than Mu. Over- [6] Tushar Deepak Chandra, Vassos Hadzilacos, and Sam all, Velos is 13 times faster than Mu during leader change. Toueg. The weakest failure detector for solving consen- sus. Journal of the ACM (JACM), 43(4):685–722, July 1996.

[7] Al Danial. cloc: Count lines of code. https:// 8 Conclusion github.com/AlDanial/cloc. [8] Travis Downs. A benchmark for low-level CPU Consensus is a classical distributed systems abstractions that micro-architectural features. https://github.com/ is widely used in the data centers. RDMA enables consensus travisdowns/uarch-bench. to achieve lower decision latency not only due to its intrin- [9] Michael J Fischer, Nancy A Lynch, and Michael S Pa- sic latency characteristics as a network fabric, but also due to terson. Impossibility of distributed consensus with one its semantics. RDMA semantics such as permission changes faulty process. Journal of the ACM (JACM), 32(2):374– improve the latency of consensus in the common case, as 382, 1985. they enable consensus to decide by using a single RDMA WRITE. However, the non-common case during which fail- [10] Eli Gafni and . Disk paxos. Distributed ures still exhibited latency in the order of a millisecond. computing (DIST), 16(1):1–20, February 2003.

Velos is a state machine replication system that can repli- [11] Maurice P. Herlihy and Jeannette M. Wing. Lin- cates requests in a few microseconds. It relies a modified earizability: A correctness condition for concurrent ob- Paxos algorithm that replaces RPCs (i.e. message passing) jects. ACM Transactions on Programming Languages with RDMA Compare-and-Swap. As a result, Velos ex- and Systems (TOPLAS), 12(3):463–492, January 1990. hibits common-case latency that is competitive to Mu’s la- tency and shines during failure. Velos manages to switch to [12] Patrick Hunt, Mahadev Konar, Flavio Paiva Junqueira, a new leader and start replicating new requests in under 65 and Benjamin Reed. Zookeeper: Wait-free coordina- µs, meaning that Velos has an extra 9 of availability for the tion for internet-scale systems. In USENIX Annual same number of expected failures, compared to Mu. Technical Conference (ATC), June 2010.

9 [13] Zsolt István, David Sidler, Gustavo Alonso, and Marko [24] Mellanox Technologies. RDMA aware networks Vukolic. Consensus in a box: Inexpensivecoordination programming user manual. rev 1.7. https://www. in hardware. In USENIX Symposium on Networked Sys- mellanox.com/related-docs/prod_software/ tems Design and Implementation (NSDI), pages 425– RDMA_Aware_Programming_user_manual.pdf. 438, March 2016. Accessed 2019-11-02. [14] Sagar Jha, Jonathan Behrens, Theo Gkountouvas, [25] Cheng Wang, Jianyu Jiang, Xusheng Chen, Ning Yi, Matthew Milano, Weijia Song, Edward Tremel, Rob- and Heming Cui. APUS: Fast and scalable paxos on bert Van Renesse, Sydney Zink, and Kenneth P. Bir- RDMA. In Symposium on Cloud Computing (SoCC), man. Derecho: Fast state machine replication for cloud pages 94–107, September 2017. services. ACM Transactions on Computer Systems (TOCS), 36(2), April 2019. [15] Xin Jin, Xiaozhou Li, Haoyu Zhang, Nate Foster, Jeongkeun Lee, Robert Soulé, Changhoon Kim, and Ion Stoica. Netchain: Scale-free sub-RTT coordination. In USENIX Symposium on Networked Systems Design and Implementation (NSDI), pages 35–49, April 2018. [16] Leslie Lamport. The part-time parliament. ACM Transactions on Computer Systems (TOCS), 16:133– 169, May 1998. [17] Leslie Lamport. Paxos made simple. ACM SIGACT News, 32(4):18–25, June 2001. [18] Jialin Li, Ellis Michael, Naveen Kr Sharma, Adriana Szekeres, and Dan RK Ports. Just say NO to Paxos overhead: Replacing consensus with network ordering. In USENIX Symposium on Operating System Design and Implementation (OSDI), pages 467–483, Novem- ber 2016. [19] Ming Liu, Tianyi Cui, Henry Schuh, Arvind Krishna- murthy, Simon Peter, and Karan Gupta. Offloading dis- tributed applications onto smartNICs using iPipe. In ACM Conference on SIGCOMM, pages 318–333, Au- gust 2019. [20] David Mazieres. Paxos made practical. https://www. scs.stanford.edu/~dm/home/papers/paxos. pdf, 2007. [21] Diego Ongaro, Stephen M. Rumble, Ryan Stutsman, John Ousterhout, and Mendel Rosenblum. Fast crash recovery in RAMCloud. In ACM Symposium on Op- erating Systems Principles (SOSP), pages 29–41, Octo- ber 2011. [22] Seo Jin Park and John Ousterhout. Exploiting com- mutativity for practical fast replication. In USENIX Symposium on Networked Systems Design and Imple- mentation (NSDI), pages 47–64, February 2019. [23] Marius Poke and Torsten Hoefler. DARE: High- performance state machine replication on RDMA net- works. In Symposium on High-Performance Parallel and Distributed Computing (HPDC), pages 107–118. ACM, June 2015.

10