The Impact of RDMA on Agreement [Extended Version]

Marcos K. Aguilera Naama Ben-David VMware CMU EPFL [email protected] [email protected] [email protected] Virendra Marathe Igor Zablotchi Oracle EPFL [email protected] [email protected]

ABSTRACT traditional shared memory, which is often assumed to be reliable1. Remote Direct Memory Access (RDMA) is becoming widely avail- In this paper, we show that the additional power of RDMA more able in data centers. This technology allows a process to directly than compensates for these challenges. read and write the memory of a remote host, with a mechanism to Our main contribution is to show that RDMA improves on the fun- control access permissions. In this paper, we study the fundamental damental trade-off in distributed systems between failure resilience power of these capabilities. We consider the well-known problem and performance—specifically, we show how a consensus protocol of achieving consensus despite failures, and find that RDMA can can use RDMA to achieve both high resilience and high performance, improve the inherent trade-off in between fail- while traditional algorithms had to choose one or another. We illus- ure resilience and performance. Specifically, we show that RDMA trate this on the fundamental problem of achieving consensus and allows algorithms that simultaneously achieve high resilience and capture the above RDMA capabilities as an M&M model [3], in high performance, while traditional algorithms had to choose one which processes can use both message-passing and shared-memory. or another. With Byzantine failures, we give an algorithm that only We consider asynchronous systems and require safety in all exe- cutions and liveness under standard additional assumptions (e.g., requires 푛 ≥ 2푓푃 + 1 processes (where 푓푃 is the maximum number of faulty processes) and decides in two (network) delays in common ex- partial synchrony). We measure resiliency by the number of failures ecutions. With crash failures, we give an algorithm that only requires an algorithm tolerates, and performance by the number of (network) delays in common-case executions. Failure resilience and perfor- 푛 ≥ 푓푃 + 1 processes and also decides in two delays. Both algorithms tolerate a minority of memory failures inherent to RDMA, and they mance depend on whether processes fail by crashing or by being provide safety in asynchronous systems and liveness with standard Byzantine, so we consider both. additional assumptions. With Byzantine failures, we consider the consensus problem called weak Byzantine agreement, defined by Lamport37 [ ]. We 1 INTRODUCTION give an algorithm that (a) requires only 푛 ≥ 2푓푃 + 1 processes (where 푓 is the maximum number of faulty processes) and (b) decides in In recent years, a technology known as Remote Direct Memory 푃 two delays in the common case. With crash failures, we give the first Access (RDMA) has made its way into data centers, earning a algorithm for consensus that requires only 푛 ≥ 푓 + 1 processes and spotlight in distributed systems research. RDMA provides the tra- 푃 decides in two delays in the common case. With both Byzantine or ditional send/receive communication primitives, but also allows crash failures, our algorithms can also tolerate crashes of memory— a process to directly read/write remote memory. Research work only 푚 ≥ 2푓 + 1 memories are required, where 푓 is the maximum shows that RDMA leads to some new and exciting distributed algo- 푀 푀 number of faulty memories. Furthermore, with crash failures, we rithms [3, 9, 25, 31, 45, 49]. improve resilience further, to tolerate crashes of a minority of the RDMA provides a different interface from previous communi- combined set of memories and processes. cation mechanisms, as it combines message-passing with shared- arXiv:1905.12143v2 [cs.DC] 25 Feb 2021 Our algorithms appear to violate known impossibility results: it memory [3]. Furthermore, to safeguard the remote memory, RDMA is known that with message-passing, Byzantine agreement requires provides protection mechanisms to grant and revoke access for read- 푛 ≥ 3푓 + 1 even if the system is synchronous [44], while consensus ing and writing data. This mechanism is fine grained: an application 푃 with crash failures require 푛 ≥ 2푓 + 1 if the system is partially can choose subsets of remote memory called regions to protect; it 푃 synchronous [27]. There is no contradiction: our algorithms rely on can choose whether a region can be read, written, or both; and it the power of RDMA, not available in other systems. can choose individual processes to be given access, where different RDMA’s power comes from two features: (1) simultaneous access processes can have different accesses. Furthermore, protections are to message-passing and shared-memory, and (2) dynamic permis- dynamic: they can be changed by the application over time. In this sions. Intuitively, shared-memory helps resilience, message-passing paper, we lay the groundwork for a theoretical understanding of helps performance, and dynamic permissions help both. these RDMA capabilities, and we show that they lead to distributed To see how shared-memory helps resilience, consider the Disk algorithms that are inherently more powerful than before. algorithm [29], which uses shared-memory (disks) but no While RDMA brings additional power, it also introduces some messages. Disk Paxos requires only 푛 ≥ 푓 + 1 processes, matching challenges. With RDMA, the remote memories are subject to failures 푃 that cause them to become unresponsive. This behavior differs from 1There are a few studies of failure-prone memory, as we discuss in related work. the resilience of our algorithm. However, Disk Paxos is not as fast: clusters [25]. Kalia et al. [32] provide guidelines for designing sys- it takes at least four delays. In fact, we show that no shared-memory tems using RDMA. RDMA has also been applied to solve consen- consensus algorithm can decide in two delays (Section 6). sus [9, 45, 49]. Our model shares similarities with DARE [45] and To see how message-passing helps performance, consider the APUS [49], which modify queue-pair state at run time to prevent Fast Paxos algorithm [39], which uses message-passing and no or allow access to memory regions, similar to our dynamic permis- shared-memory. Fast Paxos decides in only two delays in common sions. These systems perform better than TCP/IP-based solutions, executions, but it requires 푛 ≥ 2푓푃 + 1 processes. by exploiting better raw performance of RDMA, without changing Of course, the challenge is achieving both high resilience and the fundamental communication complexity or failure-resilience of good performance in a single algorithm. This is where RDMA’s the consensus protocol. Similarly, Rüsch et al. [46] use RDMA as a dynamic permissions shine. Clearly, dynamic permissions improve replacement for TCP/IP in existing BFT protocols. resilience against Byzantine failures, by preventing a Byzantine M&M. Message-and-memory (M&M) refers to a broad class of process from overwriting memory and making it useless. More sur- models that combine message-passing with shared-memory, intro- prising, perhaps, is that dynamic permissions help performance, by duced by Aguilera et al. in [3]. In that work, Aguilera et al. consider providing an uncontended instantaneous guarantee: if each process M&M models without memory permissions and failures, and show revokes the write permission of other processes before writing to a that such models lead to algorithms that are more robust to failures register, then a process that writes successfully knows that it exe- and asynchrony. In particular, they give a consensus algorithm that cuted uncontended, without having to take additional steps (e.g., to tolerates more crash failures than message-passing systems, but is read the register). We use this technique in our algorithms for both more scalable than shared-memory systems, as well as a leader elec- Byzantine and crash failures. tion algorithm that reduces the synchrony requirements. In this paper, In summary, our contributions are as follows: our goal is to understand how memory permissions and failures in • We consider distributed systems with RDMA, and we pro- RDMA impact agreement. pose a model that captures some of its key properties while Tolerance. Lamport, Shostak and Pease [40, 44] accounting for failures of processes and memories, with sup- show that Byzantine agreement can be solved in synchronous sys- port of dynamic permissions. tems iff 푛 ≥ 3푓푃 + 1. With unforgeable signatures, Byzantine agree- • We show that the shared-memory part of our RDMA improves ment can be solved iff 푛 ≥ 2푓푃 + 1. In asynchronous systems subject resilience: our Byzantine agreement algorithm requires only to failures, consensus cannot be solved [28]. However, this result is circumvented by making additional assumptions for liveness, such 푛 ≥ 2푓푃 + 1 processes. • We show that the shared-memory by itself does not permit as randomization [10] or partial synchrony [18, 27]. Many Byzantine consensus algorithms that decide in two steps in common agreement algorithms focus on safety and implicitly use the addi- executions. tional assumptions for liveness. Even with signatures, asynchronous • We show that with dynamic permissions, we can improve Byzantine agreement can be solved only if 푛 ≥ 3푓푃 + 1 [15]. the performance of our Byzantine Agreement algorithm, to It is well known that the resilience of Byzantine agreement varies decide in two steps in common executions. depending on various model assumptions like synchrony, signatures, • We give similar results for the case of crash failures: decision equivocation, and the exact variant of the problem to be solved. A system that has non-equivocation is one that can prevent a Byzantine in two steps while requiring only 푛 ≥ 푓푃 + 1 processes. • Our algorithms can tolerate the failure of memories, up to a process from sending different values to different processes. Table 1 minority of them. summarizes some known results that are relevant to this paper. The rest of the paper is organized as follows. Section 2 gives Strong Work Synchrony Signatures Non-Equiv Resiliency an overview of related work. In Section 3 we formally define the Validity RDMA-compliant M&M model that we use in the rest of the paper, [40] ✓ ✓ ✗ ✓ 2푓 + 1 and specify the agreement problems that we solve. We then proceed [40] ✓ ✗ ✗ ✓ 3푓 + 1 to present the main contributions of the paper. Section 4 presents [4, 41] ✗ ✓ ✓ ✓ 3푓 + 1 [21] ✗ ✓ ✗ ✗ 3푓 + 1 our fast and resilient Byzantine agreement algorithm. In Section 5 [21] ✗ ✗ ✓ ✗ 3푓 + 1 we consider the special case of crash-only failures, and show an [21] ✗ ✓ ✓ ✗ 2푓 + 1 improvement of the algorithm and tolerance bounds for this setting. ✗ This paper ✗ ✓ ✗ 2푓 + 1 In Section 6 we briefly outline a lower bound that shows thatthe (RDMA) dynamic permissions of RDMA are necessary for achieving our Table 1: Known fault tolerance results for Byzantine agreement. results. Finally, in Section 7 we discuss the semantics of RDMA in practice, and how our model reflects these features. To ease readability, most proofs have been deferred to the Appen- Our Byzantine agreement results share similarities with results dices. for shared memory. Malkhi et al. [41] and Alon et al. [4] show bounds on the resilience of strong and weak consensus in a model 2 RELATED WORK with reliable memory but Byzantine processes. They also provide RDMA. Many high-performance systems were recently proposed consensus protocols, using read-write registers enhanced with sticky using RDMA, such as distributed key-value stores [25, 31], com- bits (write-once memory) and access control lists not unlike our munication primitives [25, 33], and shared address spaces across permissions. Bessani et al. [11] propose an alternative to sticky bits and access control lists through Policy-Enforced Augmented Tuple Spaces. All these works handle Byzantine failures with powerful objects rather than registers. Bouzid et al. [13] show that 3푓푃 + 1 processes are necessary for strong Byzantine agreement with read- write registers. Some prior work solves Byzantine agreement with 2푓푃 +1 pro- cesses using specialized trusted components that Byzantine pro- cesses cannot control [19, 20, 22, 23, 34, 48]. Some schemes decide in two delays but require a large trusted component: a coordina- tor [19], reliable broadcast [23], or message ordering [34]. For us, permission checking in RDMA is a trusted component of sorts, but it is small and readily available. At a high-level, our improved Byzantine fault tolerance is achieved by preventing equivocation by Byzantine processes, thereby effec- tively translating each Byzantine failure into a crash failure. Such translations from one type of failure into a less serious one have ap- peared extensively in the literature [8, 15, 21, 43]. Early work [8, 43] shows how to translate a crash tolerant algorithm into a Byzantine tolerant algorithm in the synchronous setting. Bracha [14] presents a similar translation for the asynchronous setting, in which 푛 ≥ 3푓푃 +1 processes are required to tolerate 푓 Byzantine failures. Bracha’s 푃 Figure 1: Our model with processes and memories, which may translation relies on the definition and implementation of a reliable both fail. Processes can send messages to each other or access broadcast primitive, very similar to the one in this paper. However, registers in the memories. Registers in a memory are grouped we show that using the capabilities of RDMA, we can implement it into memory regions that may overlap, but in our algorithms with higher fault tolerance. they do not. Each region has a permission indicating what pro- Faulty memory. Afek et al. [2] and Jayanti et al. [30] study the cesses can read, write, and read-write the registers in the region problem of masking the benign failures of shared memory or objects. (shown for two regions). We use their ideas of replicating data across memories. Abraham et al. [1] considers honest processes but malicious memory. Common-case executions. Many systems and algorithms tolerate adversarial scheduling but optimize for common-case executions of processes 푅mr, 푊mr, RWmr indicating whether each process can without failures, asynchrony, contention, etc (e.g., [12, 24, 26, 35, 36, read, write, or read-write the registers in the region. We say that 푝 39, 42]). None of these match both the resilience and performance has read permission on mr if 푝 ∈ 푅mr or 푝 ∈ RWmr; we say that 푝 of our algorithms. Some algorithms decide in one delay but require has write permission on mr if 푝 ∈ 푊mr or 푝 ∈ RWmr. In the special 푛 ≥ 5푓푃 + 1 for Byzantine failures [47] or 푛 ≥ 3푓푃 + 1 for crash case when 푅mr = 푃 \{푝}, 푊mr = ∅, RWmr = {푝}, we say that mr failures [16, 24]. is a Single-Writer Multi-Reader (SWMR) region—registers in mr correspond to the traditional notion of SWMR registers. Note that a register may belong to several regions, and a process may have 3 MODEL AND PRELIMINARIES access to the register on one region but not another—this models the We consider a message-and-memory (M&M) model, which allows existing RDMA behavior. Intuitively, when reading or writing data, processes to use both message-passing and shared-memory [3]. The a process specifies the region and the register, and the system uses system has 푛 processes 푃 = {푝1, . . . , 푝푛 } and 푚 (shared) memories the region to determine if access is allowed (we make this precise 푀 = {휇1, . . . , 휇푚 }. Processes communicate by accessing memories below). or sending messages. Throughout the paper, memory refers to the Permission change. An algorithm indicates an initial permission shared memories, not the local state of processes. for each memory region mr. Subsequently, the algorithm may wish The system is asynchronous in that it can experience arbitrary to change the permission of mr during execution. For that, processes delays. We expect algorithms to satisfy the safety properties of can invoke an operation changePermission(mr, new_perm), where the problems we consider, under this asynchronous system. For new_perm is a triple (푅,푊 , RW). This operation returns no results liveness, we require additional standard assumptions, such as partial and it is intended to modify 푅mr,푊mr, RWmr to 푅,푊 , RW. To toler- synchrony, randomization, or failure detection. ate Byzantine processes, an algorithm can restrict processes from Memory permissions. Each memory consists of a set of reg- changing permissions. For that, the algorithm specifies a function isters. To control access, an algorithm groups those registers into legalChange(푝, mr, old_perm, new_perm) which returns a boolean a set of (possibly overlapping) memory regions, and then defines indicating whether process 푝 can change the permission of mr to permissions for those memory regions. Formally, a memory region new_perm when the current permissions are old_perm. More pre- mr of a memory 휇 is a subset of the registers of 휇. We often refer cisely, when changePermission is invoked, the system evaluates to mr without specifying the memory 휇 explicitly. Each memory legalChange to determine whether changePermission takes effect region mr has a permission, which consists of three disjoint sets or becomes a no-op. When legalChange always returns false, we say that the permissions are static; otherwise, the permissions are • Uniform Agreement. If processes 푝 and 푞 decide 푣푝 and 푣푞, dynamic. then 푣푝 = 푣푞. Accessing memories. Processes access the memories via opera- • Validity. If some process decides 푣, then 푣 is the initial value tions write(mr, 푟, 푣) and read(mr, 푟) for memory region mr, register proposed by some process. 푟, and value 푣.A write(mr, 푟, 푣) by process 푝 changes register 푟 • Termination. Eventually all correct processes decide. to 푣 and returns ack if 푟 ∈ mr and 푝 has write permission on mr; We expect Agreement and Validity to hold in an asynchronous otherwise, the operation returns nak.A read(mr, 푟) by process 푝 system, while Termination requires standard additional assump- returns the last value successfully written to 푟 if 푟 ∈ mr and 푝 has tions (partial synchrony, randomization, failure detection, etc). With read permission on mr; otherwise, the operation returns nak. In our Byzantine failures, we change these definitions so the problem can algorithms, a register belongs to exactly one region, so we omit the be solved. We consider weak Byzantine agreement [37], with the mr parameter from write and read operations. following properties: Sending messages. Processes can also communicate by sending • Agreement. If correct processes 푝 and 푞 decide 푣푝 and 푣푞, messages over a set of directed links. We assume messages are then 푣푝 = 푣푞. unique. If there is a link from process 푝 to process 푞, then 푝 can • Validity. With no faulty processes, if some process decides 푣, send messages to 푞. Links satisfy two properties: integrity and no- then 푣 is the input of some process. loss. Given two correct processes 푝 and 푞, integrity requires that • Termination. Eventually all correct processes decide. a message 푚 be received by 푞 from 푝 at most once and only if 푚 Complexity of algorithms. We are interested in the performance was previously sent by 푝 to 푞. No-loss requires that a message 푚 of algorithms in common-case executions, when the system is syn- sent from 푝 to 푞 be eventually received by 푞. In our algorithms, we chronous and there are no failures. In those cases, we measure perfor- typically assume a fully connected network so that every pair of mance using the notion of delays, which extends message-delays to correct processes can communicate. We also consider the special our model. Under this metric, computations are instantaneous, each case when there are no links (see below). message takes one delay, and each memory operation takes two de- Executions and steps. An execution is as a sequence of process lays. Intuitively, a delay represents the time incurred by the network steps. In each step, a process does the following, according to its to transmit a message; a memory operation takes two delays because local state: (1) sends a message or invokes an operation on a memory its hardware implementation requires a round trip. We say that a (read, write, or changePermission), (2) tries to receive a message consensus protocol is 푘-deciding if, in common-case executions, or a response from an outstanding operation, and (3) changes local some process decides in 푘 delays. state. We require a process to have at most one outstanding operation on each memory. 4 BYZANTINE FAILURES Failures. A memory 푚 may fail by crashing, which causes subse- We now consider Byzantine failures and give a 2-deciding algorithm quent operations on its registers to hang without returning a response. for weak Byzantine agreement with 푛 ≥ 2푓푃 + 1 processes and Because the system is asynchronous, a process cannot differentiate 푚 ≥ 2푓푀 + 1 memories. The algorithm consists of the composition a crashed memory from a slow one. We assume there is an upper of two sub-algorithms: a slow one that always works, and a fast one bound 푓푀 on the maximum number of memories that may crash. that gives up under hard conditions. Processes may fail by crashing or becoming Byzantine. If a process The first sub-algorithm, called Robust Backup, is developed in crashes, it stops taking steps forever. If a process becomes Byzantine, two steps. We first implement a reliable broadcast primitive, which it can deviate arbitrarily from the algorithm. However, that process prevents Byzantine processes from sending different values to differ- cannot operate on memories without the required permission. We ent processes. Then, we use the framework of Clement et al. [21] assume there is an upper bound 푓푃 on the maximum number of combined with this primitive to convert a message-passing consen- processes that may be faulty. Where the context is clear, we omit the sus algorithm that tolerates crash failures into a consensus algorithm 푃 and 푀 subscripts from the number of failures, 푓 . that tolerates Byzantine failures. This yields Robust Backup.2 It Signatures. Our algorithms assume unforgeable signatures: there uses only static permissions and assumes memories are split into are primitives sign(푣) and sValid(푝, 푣) which, respectively, signs a SWMR regions. Therefore, this sub-algorithm works in the tradi- value 푣 and determines if 푣 is signed by process 푝. tional shared-memory model with SWMR registers, and it may be Messages and disks. The model defined above includes two of independent interest. common models as special cases. In the message-passing model, The second sub-algorithm is called Cheap Quorum. It uses dy- there are no memories (푚 = 0), so processes can communicate only namic permissions to decide in two delays using one signature in by sending messages. In the disk model [29], there are no links, common executions. However, the sub-algorithm gives up if the so processes can communicate only via memories; moreover, each system is not synchronous or there are Byzantine failures. memory has a single region which always permits all processes to Finally, we combine both sub-algorithms using ideas from the read and write all registers. Abstract framework of Aublin et al. [7]. More precisely, we start by running Cheap Quorum; if it aborts, we run Robust Backup. There

Consensus 2The attentive reader may wonder why at this point we have not achieved a 2-deciding In the consensus problem, processes propose an initial value and algorithm already: if we apply Clement et al. [21] to a 2-deciding crash-tolerant algo- rithm (such as Fast Paxos [12]), will the result not be a 2-deciding Byzantine-tolerant must make an irrevocable decision on a value. With crash failures, algorithm? The answer is no, because Clement et al. needs reliable broadcast, which we require the following properties: incurs at least 6 delays. is a subtlety: for this idea to work, Robust Backup must decide on a value 푣 if Cheap Quorum decided 푣 previously. To do that, Robust Algorithm 2: Reliable Broadcast Backup decides on a preferred value if at least 푓 + 1 processes 1 SWMR Value[n,M,n]; initialized to ⊥. Value[p] is have this value as input. To do so, we use the classic crash-tolerant ↩→ array of SWMR(p) registers.

Paxos algorithm (run under the Robust Backup algorithm to ensure 3 SWMR L1Proof[n,M,n]; initialized to ⊥. L1Proof[p] is Byzantine tolerance) but with an initial set-up phase that ensures ↩→ array of SWMR(p) registers. this safe decision. We call the protocol Preferential Paxos. 5 SWMR L2Proof[n,M,n]; initialized to ⊥. L2Proof[p] is ↩→ array of SWMR(p) registers. 4.1 The Robust Backup Sub-Algorithm 7 Code for process p We develop Robust Backup using the construction by Clement et 8 last[n]: local array with last k delivered from each al. [21], which we now explain. Clement et al. show how to trans- ↩→ process. Initially, last[q] = 0 9 state[n]: local array of registers. state[q] ∈ { form a message-passing algorithm A that tolerates 푓푃 crash failures ↩→ WaitForSender,WaitForL1Proof,WaitForL2Proof}. into a message-passing algorithm that tolerates 푓푃 Byzantine failures ↩→ Initially, state[q] = WaitForSender in a system where 푛 ≥ 2푓 + 1 processes, assuming unforgeable 푃 11 broadcast (k,m){ signatures and a non-equivocation mechanism. They do so by imple- 12 Value[p,k,p].write(sign((k,m))); } menting trusted message-passing primitives, T-send and T-receive, using non-equivocation and signature verification on every message. 14 value checkL2proof(q,k) { 15 for i∈ Π { Processes include their full history with each message, and then 16 proof = L2Proof[i,k,q].read(); verify locally whether a received message is consistent with the 17 if (proof != ⊥ && check(proof)) { 18 L2Proof[p,k,q].write(proof); protocol. This restricts Byzantine behavior to crash failures. 19 return proof.msg; } } To apply this construction in our model, we show that our model 20 return null; } can implement non-equivocation and message passing. We first show 22 for q in Π in parallel { that shared-memory with SWMR registers (and no memory failures) 23 while true { can implement these primitives, and then show how our model can 24 try_deliver(q); }} } implement shared-memory with SWMR registers. 26 try_deliver(q) { 27 k = last[q]; 4.1.1 Reliable Broadcast. Consider a shared-memory system. 28 val = checkL2Proof(q,k); We present a way to prevent equivocation through a solution to the 29 if (val != null) { reliable broadcast 30 deliver(k, proof.msg, q); problem, which we recall below. Note that our 31 last[q] += 1; definition of reliable broadcast includes the sequence number 푘 (as 32 state = WaitForSender; opposed to being single-shot) so as to facilitate the integration with 33 return; 34 } the Clement et al. construction, as we explain in Section 4.1.2. 36 if state == WaitForSender { DEFINITION 1. Reliable broadcast is defined in terms of two 37 val = Value[q,k,q].read(); primitives, broadcast(푘,푚) and deliver(푘,푚, 푞). When a process 푝 38 if (val==⊥ || !sValid(p, val) || key!=k) 39 return; invokes broadcast(푘,푚) we say that 푝 broadcasts (푘,푚). When a 40 Value[p,k,q].write(sign(val)); process 푝 invokes deliver(푘,푚, 푞) we say that 푝 delivers (푘,푚) from 41 state = WaitForL1Proof; } 푞. Each correct process 푝 must invoke broadcast(푘, ∗) with 푘 one 43 if state == WaitForL1Proof { higher than 푝’s previous invocation (and first invocation with 푘=1). 44 checkedVals = ∅; The following holds: 45 for i ∈ Π{ ( ) 46 val = Value[i,k,q].read(); (1) If a correct process 푝 broadcasts 푘,푚 , then all correct 47 if (val!=⊥ && sValid(p,val) && key==k) { processes eventually deliver (푘,푚) from 푝. 48 add val to checkedVals; } } (2) If 푝 and 푞 are correct processes, 푝 delivers (푘,푚) from 푟, and ′ ′ 50 if size(checkedVals) ≥ majority and 푞 delivers (푘,푚 ) from 푟, then 푚=푚 . ↩→ checkedVals contains only one value { (3) If a correct process delivers (푘,푚) from a correct process 푝, 51 l1prf = sign(checkedVals); then 푝 must have broadcast (푘,푚). 52 L1Proof[p,k,q].write(l1prf); 53 state = WaitForL2Proof; } } (4) If a correct process delivers (푘,푚) from 푝, then all correct processes eventually deliver (푘,푚′) from 푝 for some 푚′. 55 if state == WaitForL2Proof{ 56 checkedL1Proofs = ∅; Algorithm 2 shows how to implement reliable broadcast that is 57 for i in Π{ 58 proof = L1Proof[i,k,q].read(); tolerant to a minority of Byzantine failures in shared-memory using 59 if ( checkL1Proof(proof) ) { SWMR registers. 60 add proof to checkedL1Proofs; } } To broadcast its 푘-th message 푚, 푝 simply signs (푘,푚) and writes 62 if size(checkedL1Proofs) ≥ majority { it in slot 푉푎푙푢푒 [푝, 푘, 푝] of its memory3. 63 l2prf = sign(checkedL1Proofs); 64 L2Proof[p,k,q].write(l2prf); } } }

3The indexing of the slots is as follows: the first index is the writer of the SWMR register, the second index is the sequence number of the message, and the third index is the sender of the message. Delivering a message from another process is more involved, other. Thus, no two processes, Byzantine or otherwise, can construct requiring verification steps to ensure that all correct processes will valid L2 proofs for different values. eventually deliver the same message and no other. The high-level Notably, a weaker version of broadcast, which does not require idea is that before delivering a message (푘,푚) from 푞, each process 푝 Property 4 (i.e., Byzantine consistent broadcast [17, Module 3.11]), checks that no other process saw a different value from 푞, and waits can be solved with just the first stage of Algorithm 2, without the to hear that “enough” other processes also saw the same value. More L1 and L2 proofs. The purpose of those proofs is to ensure the 4th specifically, each process 푝 has 3 slots per process per sequence property holds; that is, to enable all correct processes to deliver a number, that only 푝 can write to, but all processes can read from. value once some correct process delivered. These slots are initialized to ⊥, and 푝 uses them to write the values In the appendix, we formally prove the above intuition and arrive that it has seen. The 3 slots represent 3 levels of ‘proofs’ that this at the following lemma. value is correct; for each process 푞 and sequence number 푘, 푝 has a LEMMA 4.1. Reliable broadcast can be solved in shared-memory slot to write (1) the initial value 푣 it read from 푞 for 푘, (2) a proof with SWMR regular registers with 푛 ≥ 2푓 + 1 processes. that at least 푓 + 1 processes saw the same value 푣 from 푞 for 푘, and (3) a proof that at least 푓 + 1 processes wrote a proof of seeing value 4.1.2 Applying Clement et al’s Construction. Clement et al. 푣 from 푞 for 푘 in their second slot. We call these slots the Value slot, show that given unforgeable transferable signatures and non-equivocation, the L1Proof slot, and the L2Proof slot, respectively. one can reduce Byzantine failures to crash failures in message pass- We note that each such valid proof has signed copies of only one ing systems [21]. They define non-equivocation as a predicate 푣푎푙푖푑푝 value for the message. Any proof that shows copies of two different for each process 푝, which takes a sequence number and a value and values or a value that is not signed is not considered valid. If a proof evaluates to true for just one value per sequence number. All pro- has copies of only value 푣, we say that this proof supports 푣. cesses must be able to call the same 푣푎푙푖푑푝 predicate, which always To deliver a value 푣 from process 푞 with sequence number 푘, terminates every time it is called. process 푝 must successfully write a valid proof-of-proofs in its We now show how to use reliable broadcast to implement mes- L2Proof slot supporting value 푣 (we call this an L2 proof). It has two sages with transferable signatures and non-equivocation as defined options of how to do this; firstly, if it sees a valid L2 proof insome by Clement et al. [21]. Note that our reliable broadcast mechanism other process 푖’s 퐿2푃푟표표푓 [푖, 푘, 푞] slot, it copies this proof over to already involves the use of transferable signatures, so to send and its own L2 proof slot, and can then deliver the value that this proof receive signed messages, one can simply use broadcast and deliver supports. If 푝 does not find a valid L2 proof in some other process’s those messages. However, simply using broadcast and deliver is slot, it must try to construct one itself. We now describe how this is not enough to satisfy the requirements of the 푣푎푙푖푑푝 predicate of done. Clement et al. The problem occurs when trying to validate nested A correct process 푝 goes through three stages when constructing messages recursively. a valid L2 proof for (푘,푚) from 푞. In the pseudocode, the three In particular, recall that in Clement et al’s construction, whenever stages are denoted using states that 푝 goes through: WaitForSender, a message is sent, the entire history of that process, including all WaitForL1Proof, and WaitForL2Proof. messages it has sent and received, is attached. Consider two Byzan- [ ] tine processes 푞1 and 푞2, and assume that 푞1 attempts to equivocate In the first stage, WaitForSender, 푝 reads 푞’s 푉푎푙푢푒 푞, 푘, 푞 slot. ′ If 푝 finds a (푘,푚) pair, 푝 signs and copies it to its 푉푎푙푢푒 [푝, 푘, 푞] slot in its 푘th message, signing both (푘,푚) and (푘,푚 ). Assume there- and enters the WaitForL1Proof state. fore that no correct process delivers any message from 푞1 in its 푘th In the second stage, WaitForL1Proof, 푝 reads all 푉푎푙푢푒 [푖, 푘, 푞] round. However, since 푞2 is also Byzantine, it could claim to have slots, for 푖 ∈ Π. If all the values 푝 reads are correctly signed and delivered (푘,푚) from 푞1. If 푞2 then sends a message that includes equal to (푘,푚), and if there are at least 푓 + 1 such values, then (푞, 푘,푚) as part of its history, a correct process 푝 receiving 푞2’s 푝 compiles them into an L1 proof, which it signs and writes to message must recursively verify the history 푞2 sent. To do so, 푝 can 퐿1푃푟표표푓 [푝, 푘, 푞]; 푝 then enters the WaitForL2Proof state. call try_deliver on (푞1, 푘). However, since no correct process In the third stage, WaitForL2Proof, 푝 reads all 퐿1푃푟표표푓 [푖, 푘, 푞] delivered any message from (푞1, 푘), it is possible that this call never slots, for 푖 ∈ Π. If 푝 finds at least 푓 + 1 valid and signed L1 returns. proofs for (푘,푚), then 푝 compiles them into an L2 proof, which To solve this issue, we introduce a validate operation that can it signs and writes to 퐿2푃푟표표푓 [푝, 푘, 푞]. The next time that 푝 scans be used along with broadcast and deliver to validate the correctness the 퐿2푃푟표표푓 [·, 푘, 푞] slots, 푝 will see its own L2 proof (or some other of a given message. The validate operation is very simple: it valid proof for (푘,푚)) and deliver (푘,푚). takes in a process id, a sequence number, and a message value 푚, and This three-stage validation process ensures the following crucial simply runs the checkL2proof helper function. If the function property: no two valid L2 proofs can support different values. Intu- returns a proof supporting 푚, validate returns true. Otherwise it itively, this property is achieved because for both L1 and L2 proofs, returns false. The pseudocode is shown in Algorithm 3. at least 푓 + 1 values of the previous stage must be copied, meaning In this way, Algorithms 2 and 3 together provide signed messages that at least one correct process was involved in the quorum needed and a non-equivocation primitive. Thus, combined with the con- to construct each proof. Because correct processes read the slots struction of Clement et al. [21], we immediately get the following of all others at each stage before constructing the next proof, and result. because they never overwrite or delete values that they already wrote, THEOREM 4.2. There exists an algorithm for weak Byzantine it is guaranteed that no two correct processes will create valid L1 agreement in a shared-memory system with SWMR regular registers, proofs for different values, since one must see the Value slot of the signatures, and up to 푓푃 process crashes where 푛 ≥ 2푓푃 + 1. Algorithm 3: Validate Operation for Reliable Broadcast Algorithm 4: Cheap Quorum normal operation—code for pro- 1 bool validate(q,k,m){ cess 푝 2 val = checkL2proof(q,k); 1 Leader code 3 if (val == m) { 2 propose(v) { 4 return true; } 3 sign(v); 5 return false; } 4 status = Value[ℓ].write(v); 5 if (status == nak) Panic_mode(); 6 else decide(v); }

Non-equivocation in our model. To convert the above algorithm 8 Follower code to our model, where memory may fail, we use the ideas in [2, 6, 30] 9 propose(w){ 10 do {v = read(Value[ℓ]); to implement failure-free SWMR regular registers from the fail- 11 for all q ∈ Π do pan[q] = read(Panic[q]); prone memory, and then run weak Byzantine agreement using those 12 } until (v ≠ ⊥ || pan[q] == true for some q || regular registers. To implement an SWMR register, a process writes ↩→ timeout); 13 if (v ≠ ⊥ && sValid(p1,v)) { or reads all memories, and waits for a majority to respond. When 14 sign(v); reading, if 푝 sees exactly one distinct non-⊥ value 푣 across the 15 write(Value[p],v); ⊥ 16 do {for all q ∈ Π do val[q] = read(Value[q]); memories, it returns 푣; otherwise, it returns . 17 if |{q : val[q] == v}| ≥ n then { 18 Proof[p].write(sign(val[1..n])); DEFINITION 2. Let A be a message-passing algorithm. Robust 19 for all q ∈ Π do prf[q] = read(Proof[q]); Backup(A) is the algorithm A in which all 푠푒푛푑 and 푟푒푐푒푖푣푒 opera- 20 if |{q : verifyProof(prf[q]) == true}| ≥ tions are replaced by T-send and T-receive operations (respectively) ↩→ n { decide(v); exit; } } 21 for all q ∈ Π do pan[q] = read(Panic[q]); implemented with reliable broadcast. 22 } until (pan[q] == true for some q || timeout); ↩→ } Thus we get the following lemma, from the result of Clement et 23 Panic_mode();} al. [21], Lemma 4.1, and the above handling of memory failures.

LEMMA 4.3. If A is a consensus algorithm that is tolerant to 푓 process crash failures, then Robust Backup(A) is a weak Byzantine Algorithm 5: Cheap Quorum panic mode—code for process 푝 agreement algorithm that is tolerant to up to 푓푃 Byzantine processes and 푓 memory crashes, where 푛 ≥ 2푓 + 1 and 푚 ≥ 2푓 + 1 in the 1 panic_mode(){ 푀 푃 푀 2 Panic[p] = true; message-and-memory model. 3 changePermission(Region[ℓ], R: Π, W: {}, RW: {}); ↩→ // remove write permission The following theorem is an immediate corrolary of the lemma. 4 v = read(Value[p]); 5 prf = read(Proof[p]); THEOREM 4.4. There exists an algorithm for Weak Byzantine 6 if (v ≠ ⊥){ Abort with ⟨v, prf⟩; return; } 7 LVal = read(Value[ℓ]); Agreement in a message-and-memory model with up to 푓푃 Byzantine 8 if (LVal ≠ ⊥) {Abort with ⟨LVal, ⊥⟩; return;} processes and 푓푀 memory crashes, where 푛 ≥ 2푓푃 + 1 and 푚 ≥ 9 Abort with ⟨myInput, ⊥⟩;} 2푓푀 + 1. 4.2 The Cheap Quorum Sub-Algorithm Processes initially execute under a normal mode in common-case We now give an algorithm that decides in two delays in common executions, but may switch to panic mode if they intend to abort, executions in which the system is synchronous and there are no fail- as in [7]. The pseudo-code of the normal mode is in Algorithm 4. ures. It requires only one signature for a fast decision, whereas the Region[푝] contains three registers Value[푝], Panic[푝], Proof[푝] ini- best prior algorithm requires 6푓푃 + 2 signatures and 푛 ≥ 3푓푃 + 1 [7]. tially set to ⊥, false, ⊥. To propose 푣, the leader 푝1 signs 푣 and Our algorithm, called Cheap Quorum, is not in itself a complete writes it to Value[ℓ]. If the write is successful (it may fail because consensus algorithm; it may abort in some executions. If Cheap Quo- its write permission was removed), then 푝1 decides 푣; otherwise 푝1 rum aborts, it outputs an abort value, which is used to initialize the calls Panic_mode(). Note that all processes, including 푝1, continue Robust Backup so that their composition preserves weak Byzantine their execution after deciding. However, 푝1 never decides again if it agreement. This composition is inspired by the Abstract framework decided as the leader. A follower 푞 checks if 푝1 wrote to Value[ℓ] of Aublin et al. [7]. and, if so, whether the value is properly signed. If so, 푞 signs 푣, The algorithm has a special process , say = , which serves ℓ ℓ 푝1 writes it to Value[푞], and waits for other processes to write the same both as a leader and a follower. Other processes act only as follow- value to Value[∗]. If 푞 sees 2푓 + 1 copies of 푣 signed by different ers. The memory is partitioned into + regions denoted Region[ ] 푛 1 푝 processes, 푞 assembles these copies in a unanimity proof, which for each ∈ , plus an extra one for , Region[ ] in which it 푝 Π 푝1 ℓ it signs and writes to Proof[푞]. 푞 then waits for 2푓 + 1 unanimity proposes a value. Initially, Region[ ] is a regular SWMR region 푝 proofs for 푣 to appear in Proof[∗], and checks that they are valid, where is the writer. Unlike in Algorithm 2, some of the permis- 푝 in which case 푞 decides 푣. This waiting continues until a timeout sions are dynamic; processes may remove 푝 ’s write permission to 1 expires4, at which time 푞 calls Panic_mode(). In Panic_mode(), a Region[ℓ] (i.e., the legalChange function returns false to any per- mission change requests, except for ones revoking 푝1’s permission 4The timeout is chosen to be an upper bound on the communication, processing and to write on Region[ℓ]). computation delays in the common case. Fast & Robust Algorithm every process adopts one of the top 푓 +1 priority inputs. In particular, this means that if a majority of processes get the highest priority Preferential Paxos value, 푣1, as input, then 푣1 is guaranteed to be the decision value. The set-up phase is simple; all processes send each other their input Cheap Robust Backup values. Each process 푝 waits to receive 푛 − 푓 such messages, and Quorum abort value Reliable broadcast adopts the value with the highest priority that it sees. This is the RDMA value that 푝 uses as its input to Paxos. The pseudocode for Prefer- RDMA ential Paxos is given in Algorithm 8 in Appendix C, where we also prove the following lemma about Preferential Paxos:

commit value commit value LEMMA 4.7 (PREFERENTIAL PAXOS PRIORITY DECISION). Preferential Paxos implements weak Byzantine agreement with 푛 ≥ 2푓 + 1 processes. Furthermore, let 푣1, . . . , 푣푛 be the input values of Figure 6: Interactions of the components of the Fast & Robust 푃 an instance 퐶 of Preferential Paxos, ordered by priority. The decision Algorithm. value of correct processes is always one of 푣1, . . . , 푣푓 +1. We can now describe Fast & Robust in detail. We start executing process 푝 sets Panic[푝] to true to tell other processes it is panicking; Cheap Quorum. If Cheap Quorum aborts, we execute Preferential other processes periodically check to see if they should panic too. Paxos, with each process receiving its abort value from Cheap Quo- 푝 then removes write permission from Region[ℓ], and decides on a rum as its input value to Preferential Paxos. We define the priorities value to abort: either Value[푝] if it is non-⊥, Value[ℓ] if it is non-⊥, of inputs to Preferential Paxos as follows. or 푝’s input value. If 푝 has a unanimity proof in Proof[푝], it adds it to the abort value. DEFINITION 3 (INPUT PRIORITIESFOR PREFERENTIAL PAXOS). In Appendix B, we prove the correctness of Cheap Quorum, The input values for Preferential Paxos as it is used in Fast & Robust and in particular we show the following two important agreement are split into three sets (here, 푝1 is the leader of Cheap Quorum): properties: • 푇 = {푣 | 푣 contains a correct unanimity proof } • 푀 = {푣 | 푣 ∉ 푇 ∧ 푣 contains the signature of 푝1} LEMMA 4.5 (CHEAP QUORUM DECISION AGREEMENT). Let • 퐵 = {푣 | 푣 ∉ 푇 ∧ 푣 ∉ 푀} 푝 and 푞 be correct processes. If 푝 decides 푣1 while 푞 decides 푣2, then The priority order of the input values is such that for all values 푣1 = 푣2. 푣푇 ∈ 푇 , 푣푀 ∈ 푀, and 푣퐵 ∈ 퐵, 푝푟푖표푟푖푡푦(푣푇 ) > 푝푟푖표푟푖푡푦(푣푀 ) > LEMMA 4.6 (CHEAP QUORUM ABORT AGREEMENT). Let 푝 푝푟푖표푟푖푡푦(푣퐵). and 푞 be correct processes (possibly identical). If 푝 decides 푣 in Cheap Quorum while 푞 aborts from Cheap Quorum, then 푣 will be Figure 6 shows how the various algorithms presented in this 푞’s abort value. Furthermore, if 푝 is a follower, 푞’s abort proof is a section come together to form the Fast & Robust algorithm. In correct unanimity proof. Appendices B and C, we show that Fast & Robust is correct, with the following key lemma: The above construction assumes a fail-free memory with regular registers, but we can extend it to tolerate memory failures using the LEMMA 4.8 (COMPOSITION LEMMA). If some correct process approach of Section 4.1, noting that each register has a single writer decides a value 푣 in Cheap Quorum before an abort, then 푣 is the process. only value that can be decided in Preferential Paxos with priorities as defined in Definition 3.

4.3 Putting it Together: the Fast & Robust THEOREM 4.9. There exists a 2-deciding algorithm for Weak Algorithm Byzantine Agreement in a message-and-memory model with up to The final algorithm, called Fast & Robust, combines Cheap Quorum 푓푃 Byzantine processes and 푓푀 memory crashes, where 푛 ≥ 2푓푃 + 1 (§4.2) and Robust Backup (§4.1), as we now explain. Recall that and 푚 ≥ 2푓푀 + 1. Robust Backup is parameterized by a message-passing consensus al- gorithm A that tolerates crash-failures. A can be any such algorithm 5 CRASH FAILURES (e.g., Paxos). We now restrict ourselves to crash failures of processes and memo- Roughly, in Fast & Robust, we run Cheap Quorum and, if it aborts, ries. Clearly, we can use the algorithms of Section 4 in this setting, we use a process’s abort value as its input value to Robust Backup. to obtain a 2-deciding consensus algorithm with 푛 ≥ 2푓푃 + 1 and However, we must carefully glue the two algorithms together to 푚 ≥ 2푓푀 + 1. However, this is overkill since those algorithms use ensure that if some correct process decided 푣 in Cheap Quorum, then sophisticated mechanisms (signatures, non-equivocation) to guard 푣 is the only value that can be decided in Robust Backup. against Byzantine behavior. With only crash failures, we now show For this purpose, we propose a simple wrapper for Robust Backup, it is possible to retain the efficiency of a 2-deciding algorithm while called Preferential Paxos. Preferential Paxos first runs a set-up phase, improving resiliency. In Section 5.1, we first give a 2-deciding al- in which processes may adopt new values, and then runs Robust gorithm that allows the crash of all but one process (푛 ≥ 푓푃 + 1) Backup with the new values. More specifically, there are some pre- and a minority of memories (푚 ≥ 2푓푀 + 1). In Section 5.2, we ferred input values 푣1 . . . 푣푘 , ordered by priority. We guarantee that improve resilience further by giving a 2-deciding algorithm that The pseudocode of Protected Memory Paxos is in Algorithm 7. Algorithm 7: Protected Memory Paxos—code for process 푝 Each memory has one memory region, and at any time exactly one 1 Registers: for i=1..m, p ∈ Π, process can write to the region. Each memory 푖 holds a register 2 slot[i,p]: tuple (minProp, accProp, value)// in slot[푖, 푝] for each process 푝. Intuitively, slot[푖, 푝] is intended for 푝 ↩→ memory i 3 Ω: failure detector that returns current leader to write, but 푝 may not have write permission to do that if it is not the leader—in that case, no process writes slot[푖, 푝]. 5 startPhase2(i) { When a process 푝 becomes the leader, it must execute a sequence 6 add i to ListOfReady processes; 7 while (size(ListOfReady)propNr) { abort(); } these steps on a majority of the memories. If any of 푝’s writes fail 27 atomic { 28 if(val.propNr>CurrentMaxProp){ or 푝 finds a proposal with a higher proposal number, then 푝 gives 29 if (Phase2Started) {abort();} up. This is represented with an abort in the pseudocode; when an 30 CurrentVal = val.value; abort is executed, the for loop terminates. We assume that when 31 CurrentMaxProp = val.propNr; 32 }}} the for loop terminates—either because some thread has aborted or 33 startPhase2(i);} because a majority of threads have reached the end of the loop—all 34 // done phase 1 or (p == p1 && p1's first ↩→ attempt) threads of the for loop are terminated and control returns to the main 35 success = write(slot[i,p], (propNr,propNr, loop (lines 11–37). ↩→ CurrentVal)); If 푝 does not abort, it adopts the value with highest proposal 36 if (not success) { abort(); } 37 } until this has been done at a majority of the number of all those it read in the memories. To make it clear that ↩→ memories, or until 'abort' has been races should be avoided among parallel threads in the pseudocode, ↩→ called we wrap this part in an atomic environment. 38 if (loop completed without abort) { 39 decide CurrentVal; } } } In the next phase, each of 푝’s threads writes its value to its slot on its memory. If a write fails, 푝 gives up. If 푝 succeeds, this is where we optimize time: 푝 can simply decide, whereas Disk Paxos must read the memories again. tolerates crashes of a minority of the combined set of memories and Note that it is possible that some of the memories that made up processes. the majority that passed the initial barrier may crash later on. To prevent 푝 from stalling forever in such a situation, it is important 5.1 Protected Memory Paxos that straggler threads that complete phase 1 later on be allowed to Our starting point is the Disk Paxos algorithm [29], which works in participate in phase 2. However, if such a straggler thread observes a system with processes and memories where 푛 ≥ 푓푃 + 1 and 푚 ≥ a more up-to-date value in its memory than the one adopted by 푝 2푓푀 + 1. This is our resiliency goal, but Disk Paxos takes four delays for phase 2, this must be taken into account. In this case, to avoid in common executions. Our new algorithm, called Protected Memory inconsistencies, 푝 must abort its current attempt and restart the loop Paxos, removes two delays; it retains the structure of Disk Paxos from scratch. but uses permissions to skip steps. Initially some fixed leader ℓ = 푝1 The code ensures that some correct process eventually decides, has exclusive write permission to all memories; if another process but it is easy to extend it so all correct processes decide [18], by becomes leader, it takes the exclusive permission. Having exclusive having a decided process broadcast its decision. Also, the code permission permits a leader ℓ to optimize execution, because ℓ can shows one instance of consensus, with 푝1 as initial leader. With do two things simultaneously: (1) write its consensus proposal and many consensus instances, the leader terminates one instance and (2) determine whether another leader took over. Specifically, if ℓ becomes the default leader in the next. succeeds in (1), it knows no leader ℓ ′ took over because ℓ ′ would have taken the permission. Thus ℓ avoids the last read in Disk Paxos, THEOREM 5.1. Consider a message-and-memory model with up saving two delays. Of course, care must be taken to implement this to 푓푃 process crashes and 푓푀 memory crashes, where 푛 ≥ 푓푃 + 1 and without violating safety. 푚 ≥ 2푓푀 + 1. There exists a 2-deciding algorithm for consensus. 5.2 Aligned Paxos eventually decide value 푣 ′. Let 푡 ′ be the time at which 푝′ decides. All ′ We now further enhance the failure resilience. We show that mem- of 푝’s write operations terminate and are linearized in 퐸 after time ′ ′ ories and processes are equivalent agents, in that it suffices for a 푡 . Execution 퐸 is indistinguishable to 푝 from execution 퐸, in which ′ majority of the agents (processes and memories together) to remain it ran alone. Therefore, 푝 decides 푣 ≠ 푣 , violating agreement. □ alive to solve consensus. Our new algorithm, Aligned Paxos, achieves this resiliency. To do so, the algorithm relies on the ability to use both the messages and the memories in our model; permissions are Theorem 6.1, together with the Fast Paxos algorithm of Lam- not needed. The key idea is to align a message-passing algorithm and port [39], shows that an atomic read-write shared memory model a memory-based algorithm to use any majority of agents. We align is strictly weaker than the message passing model in its ability to Paxos [38] and Protected Memory Paxos so that their decisions are solve consensus quickly. This result may be of independent interest, coordinated. More specifically, Protected Memory Paxos and Paxos since often the classic shared memory and message passing models have two phases. To align these algorithms, we factor out their differ- are seen as equivalent, because of the seminal computational equiv- ences and replace their steps with an abstraction that is implemented alence result of Attiya, Bar-Noy, and Dolev [6]. Interestingly, it is differently for each algorithm. The result is our Aligned Paxos algo- known that shared memory can tolerante more failures when solving rithm, which has two phases, each with three steps: communicate, consensus (with randomization or partial synchrony) [5, 15], and hear back, and analyze. Each step treats processes and memories therefore it seems that perhaps shared memory is strictly stronger separately, and translates the results of operations on different agents than message passing for solving consensus. However, our result to a common language. We implement the steps using their ana- shows that there are aspects in which message passing is stronger logues in Paxos and Protected Memory Paxos5. The pseudocode of than shared memory. In particular, message passing can solve con- Aligned Paxos is shown in Appendix E. sensus faster than shared memory in well-behaved executions.

6 DYNAMIC PERMISSIONS ARE 7 RDMA IN PRACTICE NECESSARY FOR EFFICIENT CONSENSUS Our model is meant to reflect capabilities of RDMA, while providing In §5.1, we showed how dynamic permissions can improve the a clean abstraction to reason about. We now give an overview of how performance of Disk Paxos. Are dynamic permissions necessary? RDMA works, and how features of our model can be implemented We prove that with shared memory (or disks) alone, one cannot using RDMA. achieve 2-deciding consensus, even if the memory never fails, it RDMA enables a remote process to access local memory directly has static permissions, processes may only fail by crashing, and the through the network interface card (NIC), without involving the system is partially synchronous in the sense that eventually there is CPU. For a piece of local memory to be accessible to a remote a known upper bound on the time it takes a correct process to take a process 푝, the CPU has to register that memory region and associate step [27]. This result applies a fortiori to the Disk Paxos model [29]. it with the appropriate connection (called Queue Pair) for 푝. The THEOREM 6.1. Consider a partially synchronous shared-memory association of a registered memory region and a queue pair is done model with registers, where registers can have arbitrary static per- indirectly through a protection domain: both memory regions and missions, memory never fails, and at most one processes may fail by queue pairs are associated with a protection domain, and a queue crashing. No consensus algorithm is 2-deciding. pair 푞 can be used to access a memory region 푟 if 푞 and 푟 and in the same protection domain. The CPU must also specify what access PROOF. Assume by contradiction that 퐴 is an algorithm in the level (read, write, read-write) is allowed to the memory region in stated model that is 2-deciding. That is, there is some execution 퐸 each protection domain. A local memory area can thus be registered of 퐴 in which some process 푝 decides a value 푣 with 2 delays. We and associated with several queue pairs, with the same or different denote by 푅 and 푊 the set of objects which 푝 reads and writes in access levels, by associating it with one or more protection domains. 퐸 respectively. Note that since 푝 decides in 2 delays in 퐸, 푅 and Each RDMA connection can be used by the remote server to access 푊 must be disjoint, by the definition of operation delay and the registered memory regions using a unique region-specific key created fact that a process has at most one outstanding operation per object. as a part of the registration process. Furthermore, 푝 must issue all of its read and writes without waiting As highlighted by previous work [45], failures of the CPU, NIC for the response of any operation. and DRAM can be seen as independent (e.g., arbitrary delays, too Consider an execution 퐸′ in which 푝 reads from the same set 푅 many bit errors, failed ECC checks, respectively). For instance, zom- of objects and writes the same values as in 퐸 to the same set 푊 of bie servers in which the CPU is blocked but RDMA requests can objects. All of the read operations that 푝 issues return by some time still be served account for roughly half of all failures [45]. This 푡0, but the write operations of 푝 are delayed for a long time. Another motivates our choice to treat processes and memory separately in ′ ′ process 푝 begins its proposal of a value 푣 ≠ 푣 after 푡0. Since no our model. In practice, if a CPU fails permanently, the memory will process other than 푝′ writes to any objects, 퐸′ is indistinguishable also become unreachable through RDMA eventually; however, in to 푝′ from an execution in which it runs alone. Since 퐴 is a correct such cases memory may remain available long enough for ongoing consensus algorithm that terminates if there is no contention, 푝′ must operations to complete. Also, in practical settings it is possible for full-system crashes to occur (e.g., machine restarts), which corre- 5We believe other implementations are possible. For example, replacing the Protected Memory Paxos implementation for memories with the Disk Paxos implementation spond to a process and a memory failing at the same time—this is yields an algorithm that does not use permissions. allowed by our model. Memory regions in our model correspond to RDMA memory REFERENCES regions. Static permissions can be implemented by making the ap- [1] Ittai Abraham, Gregory Chockler, Idit Keidar, and Dahlia Malkhi. Byzantine disk propriate memory region registration before the execution of the paxos: optimal resilience with byzantine shared memory. Distributed computing (DIST), 18(5):387–408, 2006. algorithm; these permissions then persist during execution without [2] Yehuda Afek, David S. Greenberg, Michael Merritt, and Gadi Taubenfeld. Comput- CPU involvement. Dynamic permissions require the host CPU to ing with faulty shared memory. In ACM Symposium on Principles of Distributed Computing (PODC), pages 47–58, August 1992. change the access levels; this should be done in the OS kernel: the [3] Marcos K Aguilera, Naama Ben-David, Irina Calciu, Rachid Guerraoui, Erez kernel creates regions and controls their permissions, and then shares Petrank, and Sam Toueg. Passing messages while sharing memory. In ACM memory with user-space processes. In this way, Byzantine processes Symposium on Principles of Distributed Computing (PODC), pages 51–60. ACM, 2018. cannot change permissions illegally. The assumption is that the ker- [4] Noga Alon, Michael Merritt, Omer Reingold, Gadi Taubenfeld, and Rebecca N nel is not Byzantine. Alternatively, future hardware support similar Wright. Tight bounds for shared memory systems accessed by byzantine processes. to SGX could even allow parts of the kernel to be Byzantine. Distributed computing (DIST), 18(2):99–109, 2005. [5] James Aspnes and Maurice Herlihy. Fast randomized consensus using shared Using RDMA, a process 푝 can grant permissions to a remote memory. Journal of algorithms, 11(3):441–461, 1990. process 푞 by registering memory regions with the appropriate ac- [6] Hagit Attiya, Amotz Bar-Noy, and Danny Dolev. Sharing memory robustly in message-passing systems. Journal of the ACM (JACM), 42(1):124–142, 1995. cess permissions (read, write, or read/write) and sending the corre- [7] Pierre-Louis Aublin, Rachid Guerraoui, Nikola Kneževic,´ Vivien Quéma, and sponding key to 푞. 푝 can revoke permissions dynamically by simply Marko Vukolic.´ The next 700 BFT protocols. ACM Transactions on Computer deregistering the memory region. Systems (TOCS), 32(4):12, 2015. [8] Rida Bazzi and Gil Neiger. Optimally simulating crash failures in a byzantine For our reliable broadcast algorithm, each process can register environment. In International Workshop on Distributed Algorithms (WDAG), the two dimensional array of values in read-only mode with a pro- pages 108–128. Springer, 1991. tection domain. All the queue pairs used by that process are also [9] Jonathan Behrens, Ken Birman, Sagar Jha, Matthew Milano, Edward Tremel, Eugene Bagdasaryan, Theo Gkountouvas, Weijia Song, and Robbert Van Renesse. created in the context of the same protection domain. Additionally, Derecho: Group communication at the speed of light. Technical report, Technical the process can preserve write access permission to its row via an- Report. Cornell University, 2016. [10] Michael Ben-Or. Another advantage of free choice (extended abstract): Com- other registration of just that row with the protection domain, thus pletely asynchronous agreement protocols. In ACM Symposium on Principles of enabling single-writer multiple-reader access. Thereafter the reliable Distributed Computing (PODC), pages 27–30. ACM, 1983. broadcast algorithm can be implemented trivially by using RDMA [11] Alysson Neves Bessani, Miguel Correia, Joni da Silva Fraga, and Lau Cheuk Lung. Sharing memory between byzantine processes using policy-enforced tuple spaces. reads and writes by all processes. Reliable broadcast with unreliable IEEE Transactions on Parallel and Distributed Systems, 20(3):419–432, 2009. memories is similarly straightforward since failure of the memory [12] Romain Boichat, Partha Dutta, Svend Frolund, and Rachid Guerraoui. Recon- ensures that no process will be able to access the memory. structing paxos. SIGACT News, 34(2):42–57, March 2003. [13] Zohir Bouzid, Damien Imbs, and . A necessary condition for For Cheap Quorum, the static memory region registrations are byzantine k-set agreement. Information Processing Letters, 116(12):757–759, straightforward as above. To revoke the leader’s write permission, it 2016. [14] Gabriel Bracha. Asynchronous byzantine agreement protocols. Information and suffices for a region’s host process to deregister the memory region. Computation, 75(2):130–143, 1987. Panic messages can be relayed using RDMA message sends. [15] Gabriel Bracha and Sam Toueg. Asynchronous consensus and broadcast protocols. In our crash-only consensus algorithm, we leverage the capability Journal of the ACM (JACM), 32(4):824–840, 1985. [16] Francisco Brasileiro, Fabíola Greve, Achour Mostéfaoui, and Michel Raynal. of registering overlapping memory regions in a protection domain. Consensus in one communication step. In International Conference on Parallel As in above algorithms, each process uses one protection domain for Computing Technologies, pages 42–50. Springer, 2001. RDMA accesses. Queue pairs for connections to all other processes [17] Christian Cachin, Rachid Guerraoui, and Luís E. T. Rodrigues. Introduction to Reliable and Secure Distributed Programming (2. ed.). Springer, 2011. are associated with this protection domain. The process’ entire slot [18] Tushar Deepak Chandra and Sam Toueg. Unreliable failure detectors for reliable array is registered with the protection domain in read-only mode. distributed systems. Journal of the ACM (JACM), 43(2):225–267, 1996. [19] Byung-Gon Chun, Petros Maniatis, and Scott Shenker. Diverse replication for In addition, the same slot array can be dynamically registered (and single-machine byzantine-fault tolerance. In USENIX Annual Technical Confer- deregistered) in write mode based on incoming write permission ence (ATC), pages 287–292, 2008. requests: A proposer requests write permission using an RDMA [20] Byung-Gon Chun, Petros Maniatis, Scott Shenker, and John Kubiatowicz. Attested append-only memory: making adversaries stick to their word. In ACM Symposium message send. In response, the acceptor first deregisters write per- on Operating Systems Principles (SOSP), pages 189–204, 2007. mission for the immediate previous proposer. The acceptor thereafter [21] Allen Clement, Flavio Junqueira, Aniket Kate, and Rodrigo Rodrigues. On registers the slot array in write mode and responds to the proposer the (limited) power of non-equivocation. In ACM Symposium on Principles of Distributed Computing (PODC), pages 301–308. ACM, 2012. with the new key associated with the newly registered slot array. [22] Miguel Correia, Nuno Ferreira Neves, and Paulo Veríssimo. How to tolerate Reads of the slot array are performed by the proposer using RDMA half less one byzantine nodes in practical distributed systems. In International Symposium on Reliable Distributed Systems (SRDS), pages 174–183, 2004. reads. Subsequent second phase RDMA write of the value can be [23] Miguel Correia, Giuliana S Veronese, and Lau Cheuk Lung. Asynchronous byzan- performed on the slot array as long as the proposer continues to tine consensus with 2f+ 1 processes. In ACM symposium on applied computing have write permission to the slot array. The RDMA write fails if (SAC), pages 475–480. ACM, 2010. [24] Dan Dobre and Neeraj Suri. One-step consensus with zero-degradation. In the acceptor granted write permission to another proposer in the International Conference on Dependable Systems and Networks (DSN), pages meantime. 137–146. IEEE Computer Society, 2006. [25] Aleksandar Dragojevic,´ Dushyanth Narayanan, Miguel Castro, and Orion Hodson. FaRM: Fast remote memory. In USENIX Symposium on Networked Systems Design and Implementation (NSDI), pages 401–414, 2014. [26] Partha Dutta, Rachid Guerraoui, and . How fast can eventual synchrony lead to consensus? In International Conference on Dependable Systems Acknowledgements. We wish to thank the anonymous reviewers and Networks (DSN), pages 22–27. IEEE, 2005. for their helpful comments on improving the paper. This work has [27] , Nancy Lynch, and . Consensus in the presence been supported in part by the European Research Council (ERC) of partial synchrony. Journal of the ACM (JACM), 35(2):288–323, 1988. Grant 339539 (AOC), and a Microsoft PhD Fellowship. [28] Michael J Fischer, Nancy A Lynch, and Michael S Paterson. Impossibility of infinite loop at line 24 will invoke and return try_deliver(푞) infinitely distributed consensus with one faulty process. Journal of the ACM (JACM), 1985. often. Since the parallel for loop at line 22 performs the infinite loop [29] Eli Gafni and Leslie Lamport. Disk paxos. Distributed computing (DIST), 16(1):1– 20, 2003. in parallel for each 푞 ∈ Π, the Observation holds. □ [30] Prasad Jayanti, Tushar Deepak Chandra, and Sam Toueg. Fault-tolerant wait-free shared objects. Journal of the ACM (JACM), 45(3):451–500, May 1998. PROOFOF LEMMA 4.1. We prove the lemma by showing that [31] Anuj Kalia, Michael Kaminsky, and David G Andersen. Using RDMA efficiently for key-value services. ACM SIGCOMM Computer Communication Review, Algorithm 2 correctly implements reliable broadcast. That is, we 44(4):295–306, 2015. need to show that Algorithm 2 satisfies the four properties of reliable [32] Anuj Kalia, Michael Kaminsky, and David G Andersen. Design guidelines for high performance RDMA systems. In USENIX Annual Technical Conference broadcast. (ATC), page 437, 2016. Property 1. Let 푝 be a correct process that broadcasts (푘,푚). [33] Anuj Kalia, Michael Kaminsky, and David G Andersen. FaSST: Fast, scalable We show that all correct processes eventually deliver (푘,푚) from and simple distributed transactions with two-sided (RDMA) datagram RPCs. In USENIX Symposium on Operating System Design and Implementation (OSDI), 푝. Assume by contradiction that there exists some correct process volume 16, pages 185–201, 2016. 푞 which does not deliver (푘,푚). Furthermore, assume without loss [34] Rüdiger Kapitza, Johannes Behl, Christian Cachin, Tobias Distler, Simon Kuhnle, of generality that is the smallest key for which Property 1 is Seyed Vahid Mohammadi, Wolfgang Schröder-Preikschat, and Klaus Stengel. 푘 Cheapbft: resource-efficient byzantine fault tolerance. In European Conference broken. That is, all correct processes must eventually deliver all on Computer Systems (EuroSys), pages 295–308, 2012. messages (푘 ′,푚′) from 푝, for 푘 ′ < 푘. Thus, all correct processes [35] Idit Keidar and Sergio Rajsbaum. On the cost of fault-tolerant consensus when there are no faults: preliminary version. ACM SIGACT News, 32(2):45–63, 2001. must eventually increment 푙푎푠푡 [푝] to 푘. [36] Klaus Kursawe. Optimistic byzantine agreement. In International Symposium on We consider two cases, depending on whether or not some process Reliable Distributed Systems (SRDS), pages 262–267, October 2002. eventually writes an L2 proof for some (푘,푚′) message from 푝 in [37] Leslie Lamport. The weak byzantine generals problem. Journal of the ACM (JACM), 30(3):668–676, July 1983. its L2Proof slot. [38] Leslie Lamport. The part-time parliament. ACM Transactions on Computer First consider the case where no process ever writes an L2 proof Systems (TOCS), 16(2):133–169, 1998. of any value ( ′) from . Since is correct, upon broadcasting [39] Leslie Lamport. Fast paxos. Distributed computing (DIST), 19(2):79–103, 2006. 푘,푚 푝 푝 [40] Leslie Lamport, Robert Shostak, and Marshall Pease. The byzantine generals (푘,푚), 푝 must sign and write (푘,푚) into 푉푎푙푢푒푠 [푝, 푘, 푝] at line 12. problem. ACM Transactions on Programming Languages and Systems (TOPLAS), By Observation 1, (푘,푚) will remain in that slot forever. Because 4(3):382–401, 1982. [41] Dahlia Malkhi, Michael Merritt, Michael K Reiter, and Gadi Taubenfeld. Objects of this, and because there is no L2 proof, all correct processes, after shared by byzantine processes. Distributed computing (DIST), 16(1):37–48, 2003. reaching 푙푎푠푡 [푝] = 푘, will eventually read (푘,푚) in line 37, write [42] Jean-Philippe Martin and Lorenzo Alvisi. Fast byzantine consensus. IEEE it into their own Value slot in line 40 and change their state to Transactions on Dependable and Secure Computing (TDSC), 3(3):202–215, July 2006. WaitForL1Proof. [43] Gil Neiger and Sam Toueg. Automatically increasing the fault-tolerance of Furthermore, since 푝 is correct and we assume signatures are distributed algorithms. Journal of Algorithms, 11(3):374–419, 1990. unforgeable, no process can write any other valid value ( ′ ′) ≠ [44] Marshall Pease, Robert Shostak, and Leslie Lamport. Reaching agreement in the 푞 푘 ,푚 presence of faults. Journal of the ACM (JACM), 27(2):228–234, 1980. (푘,푚) into its 푉푎푙푢푒푠 [푞, 푘, 푝] slot. Thus, eventually each correct [45] Marius Poke and Torsten Hoefler. DARE: High-performance state machine repli- process 푞 will add at least 푓 + 1 copies of (푘,푚) to its checkedVals, cation on RDMA networks. In Symposium on High-Performance Parallel and Distributed Computing (HPDC), pages 107–118. ACM, 2015. write an L1proof consisting of these values into 퐿1푃푟표표푓 [푞, 푘, 푝] in [46] Signe Rüsch, Ines Messadi, and Rüdiger Kapitza. Towards low-latency byzantine line 52, and change their state to WaitForL2Proof. agreement protocols using RDMA. In IEEE/IFIP International Conference on Therefore, all correct processes will eventually read at least 푓 + 1 Dependable Systems and Networks Workshops (DSN-W), pages 146–151. IEEE, 2018. valid L1 Proofs for (푘,푚) in line 58 and construct and write valid L2 [47] Yee Jiun Song and Robbert van Renesse. Bosco: One-step byzantine asynchronous proofs for (푘,푚). This contradicts the assumption that no L2 proof consensus. In International Symposium on Distributed Computing (DISC), pages ever gets written. 438–450, September 2008. [48] Giuliana Santos Veronese, Miguel Correia, Alysson Neves Bessani, Lau Cheuk In the case where there is some L2 proof, by the argument above, Lung, and Paulo Veríssimo. Efficient byzantine fault-tolerance. IEEE Trans. the only value it can prove is (푘,푚). Therefore, all correct processes Computers, 62(1):16–30, 2013. [49] Cheng Wang, Jianyu Jiang, Xusheng Chen, Ning Yi, and Heming Cui. APUS: will see at least one valid L2 proof at deliver. This contradicts our Fast and scalable paxos on RDMA. In Symposium on Cloud Computing (SoCC), assumption that 푞 is correct but does not deliver (푘,푚) from 푝. pages 94–107. ACM, 2017. Property 2. We now prove the second property of reliable broad- cast. Let 푝 and 푝′ be any two correct processes, and 푞 be some A CORRECTNESS OF RELIABLE process, such that 푝 delivers (푘,푚) from 푞 and 푝′ delivers (푘,푚′) BROADCAST from 푞. Assume by contradiction that 푚 ≠ 푚′. ′ OBSERVATION 1. In Algorithm 2, if 푝 is a correct process, then Since 푝 and 푝 are correct, they must have seen valid L2 proofs ′ no slot that belongs to 푝 is written to more than once. at line 16 before delivering (푘,푚) and (푘,푚 ) respectively. Let P and P′ be those valid proofs for (푘,푚) and (푘,푚′) respectively. P PROOF. Since 푝 is correct, 푝 never writes on any slot more than (resp. P′) consists of at least 푓 + 1 valid L1 proofs; therefore, at once. Furthermore, since all slots are single-writer registers, no other least one of those proofs was created by some correct process 푟 (resp. process can write on these slots. □ 푟 ′). Since 푟 (resp. 푟 ′) is correct, it must have written (푘,푚) (resp. (푘,푚′)) to its Values slot in line 40. Note that after copying a value OBSERVATION 2. In Algorithm 2, correct processes invoke and to their slot, in the WaitForL1Proof state, correct processes read all return from try_deliver(q) infinitely often, for all 푞 ∈ Π. Value slots line 46. Thus, both 푟 and 푟 ′ read all Value slots before ′ PROOF. The try_deliver() function does not contain any blocking compiling their L1 proof for (푘,푚) (resp. (푘,푚 )). steps, loops or goto statements. Thus, if a correct process invokes Assume without loss of generality that 푟 wrote (푘,푚) before 푟 ′ try_deliver(), it will eventually return. Therefore, for a fixed 푞 the wrote (푘,푚′); by Observation 1, it must then be the case that 푟 ′ later saw both (푘,푚) and (푘,푚′) when it read all Values slots (line 46). that they cannot write to initially, no other process can overwrite Since 푟 ′ is correct, it cannot have then compiled an L1 proof for these values. □ (푘,푚′) (the check at line 50 failed). We have reached a contradiction. LEMMA B.1 (CHEAP QUORUM VALIDITY). In Cheap Quorum, Property 3. We show that if a correct process 푝 delivers (푘,푚) if there are no faulty processes and some process decides 푣, then 푣 is from a correct process 푝′, then 푝′ broadcast (푘,푚). Correct processes the input of some process. only deliver values for which a valid L2 proof exists (lines 16—30). Therefore, 푝 must have seen a valid L2 proof P for (푘,푚). P consists PROOF. If 푝 = 푝1, the lemma is trivially true, because 푝1 can of at least 푓 + 1 L1 proofs for (푘,푚) and each L1 proof consists only decide on its input value. If 푝 ≠ 푝1, 푝 can only decide on a ′ ′ of at least 푓 + 1 matching copies of (푘,푚), signed by 푝 . Since 푝 value 푣 if it read that value from the leader’s region. Since only the ′ is correct and we assume signatures are unforgeable, 푝 must have leader can write to its region, it follows that 푝 can only decide on a ′ broadcast (푘,푚) (otherwise 푝 would not have attached its signature value that was proposed by the leader (푝1). □ to (푘,푚)). Property 4. Let 푝 be a correct process such that 푝 delivers (푘,푚) LEMMA B.2 (CHEAP QUORUM TERMINATION). If a correct from 푞. We show that all correct process must deliver (푘,푚′) from process 푝 proposes some value, every correct process 푞 will decide 푞, for some 푚′. a value or abort. ( ) By construction of the algorithm, if 푝 delivers 푘,푚 from 푞, then PROOF. Clearly, if 푞 = 푝1 proposes a value, then 푞 decides. Now ( ) for all 푖 < 푘 there exists 푚푖 such that 푝 delivered 푖,푚푖 from 푞 let 푞 ≠ 푝1 be a correct follower and assume 푝1 is a correct leader ( ) ( ) before delivering 푘,푚 (this is because 푝 can only deliver 푘,푚 that proposes 푣. Since 푝1 proposed 푣, 푝1 was able to write 푣 in the if 푙푎푠푡 [푞] = 푘 and 푙푎푠푡 [푞] is only incremented to 푘 after 푝 delivers leader region, where 푣 remains forever by Observation 3. Clearly, (푘 − 1,푚푘−1)). if 푞 eventually enters panic mode, then it eventually aborts; there is Assume by contradiction that there exists some correct process 푟 no waiting done in panic mode. If 푞 never enters panic mode, then ′ ′ which does not deliver (푘,푚 ) from 푞, for any 푚 . Further assume 푞 eventually sees 푣 on the leader region and eventually finds 2푓 + 1 without loss of generality that 푘 is the smallest key for which 푟 does copies of 푣 on the regions of other followers (otherwise 푞 would ′ not deliver any message from 푞. Thus, 푟 must have delivered (푖,푚푖 ) enter panic mode). Thus 푞 eventually decides 푣. □ from 푞 for all 푖 < 푘; thus, 푟 must have incremented 푙푎푠푡 [푞] to 푘. Since 푟 never delivers any message from 푞 for key 푘, 푟’s 푙푎푠푡 [푞] will LEMMA B.3 (CHEAP QUORUM PROGRESS). If the system is never increase past 푘. synchronous and all processes are correct, then no correct process Since 푝 delivers (푘,푚) from 푞, then 푝 must have written a valid aborts in Cheap Quorum. P ( ) L2 proof of 푘,푚 in its L2Proof slot in line 18. By Observation 1, PROOF. Assume the contrary: there exists an execution in which P will remain in 푝’s L2Proof[p,k,q] slot forever. Thus, at least the system is synchronous and all processes are correct, yet some one of the slots 퐿2푃푟표표푓 [·, 푘, 푞] will forever contain a valid L2 process aborts. Processes can only abort after entering panic mode, proof. Since 푟’s 푙푎푠푡 [푞] eventually reaches 푘 and never increases so let 푡 be the first time when a process enters panic mode andlet 푝 past 푘, 푟 will eventually (by Observation 2) see a valid L2 proof in be that process. Since 푝 cannot have seen any other process declare line 16 and deliver a message for key 푘 from 푞. We have reached a panic, 푝 must have either timed out at line 12 or 22, or its checks contradiction. □ failed on line 13. However, since the entire system is synchronous and 푝 is correct, 푝 could not have panicked because of a time-out at B CORRECTNESS OF CHEAP QUORUM line 12. So, 푝1 must have written its value 푣, correctly signed, to 푝1’s We prove that Cheap Quorum satisfies certain useful properties that region at a time 푡 ′ < 푡. Therefore, 푝 also could not have panicked by will help us show that it composes with Preferential Paxos to form failing its checks on line 13. Finally, since all processes are correct a correct weak Byzantine agreement protocol. For the proofs, we and the system is synchronous, all processes must have seen 푝1’s first formalize some terminology. We say that a process proposed a value and copied it to their slot. Thus, 푝 must have seen these values value 푣 by time 푡 if it successfully executes line 4; that is, 푝 receives and decided on 푣 at line 20, contradicting the assumption that 푝 the response 푎푐푘 in line 4 by 푡. When a process aborts, note that it entered panic mode. □ outputs a tuple. We say that the first element of its tuple is its abort LEMMA B.4 (LEMMA 4.5: CHEAP QUORUM DECISION AGREE- value, and the second is its abort proof. We sometimes say that a MENT). Let 푝 and 푞 be correct processes. If 푝 decides 푣 while 푞 process 푝 aborts with value 푣 and proof 푝푟, meaning that 푝 outputs 1 decides 푣 , then 푣 = 푣 . (푣, 푝푟) in its abort. Furthermore, the value in a process 푝’s Proof 2 1 2 region is called a correct unanimity proof if it contains 푛 copies of PROOF. Assume the property does not hold: 푝 decided some the same value, each correctly signed by a different process. value 푣1 and 푞 decided some different value 푣2. Since 푝 decided 푣1, then 푝 must have seen a copy of 푣1 at 2푓푃 + 1 replicas, including 푞. OBSERVATION 3. In Cheap Quorum, no value written by a cor- But then 푞 cannot have decided 푣 , because by Observation 3, 푣 rect process is ever overwritten. 2 1 never gets overwritten from 푞’s region, and by the code, 푞 only can decide a value written in its region. □ PROOF. By inspecting the code, we can see that the correct behav- ior is for processes to never overwrite any values. Furthermore, since LEMMA B.5 (LEMMA 4.6: CHEAP QUORUM ABORT AGREE- all regions are initially single-writer, and the legalChange function MENT). Let 푝 and 푞 be correct processes (possibly identical). If 푝 never allows another process to acquire write permission on a region decides 푣 in Cheap Quorum while 푞 aborts from Cheap Quorum, then 푣 will be 푞’s abort value. Furthermore, if 푝 is a follower, 푞’s weak Byzantine agreement with 푛 ≥ 2푓푃 +1 processes. Thus we only abort proof is a correct unanimity proof. need to show that Preferential Paxos satisfies the priority decision property with 2푓푃 + 1 processes that may only fail by crashing. PROOF. If 푝 = 푞, the property follows immediately, because of Since Robust Backup(Paxos) satisfies validity, if all processes ≠ lines 4 through 6 of panic mode. If 푝 푞, we consider two cases: call Robust Backup(Paxos) in line 5 with a value 푣 that is one of • If 푝 is a follower, then for 푝 to decide, all processes, and in + ∈ { } the 푓푃 1 top priority values (that is, 푣 푣1, . . . , 푣푓푃 +1 ), then the particular, 푞, must have replicated both 푣 and a correct proof { } decision of correct processes will also be in 푣1, . . . , 푣푓푃 +1 . So we of unanimity before 푝 decided. Therefore, by Observation 3, just need to show that every process indeed adopts one of the top 푣 and the unanimity proof are still there when 푞 executes the 푓푃 +1 values. Note that each process 푝 waits to see 푛 − 푓푃 values, and panic code in lines 4 through 6. Therefore 푞 will abort with 푣 then picks the highest priority value that it saw. No process can lie or as its value and a correct unanimity proof as its abort proof. pick a different value, since we use T-send and T-receive throughout. • If 푝 is the leader, then first note that since 푝 is correct, by Thus, 푝 can miss at most 푓푃 values that are higher priority than the Observation 3 푣 remains the value written in the leader’s Value one that it adopts. □ region. There are two cases. If 푞 has replicated a value into its Value region, then it must have read it from 푉푎푙푢푒 [푝1], and We now prove the following key composition property that shows therefore it must be 푣. Again by Observation 3, 푣 must still be that the composition of Cheap Quorum and Preferential Paxos is the value written in 푞’s Value region when 푞 executes the panic safe. code. Therefore 푞 aborts with value 푣. Otherwise, if 푞 has not LEMMA C.2 (LEMMA 4.8: COMPOSITION LEMMA). If some replicated a value, then 푞’s Value region must be empty at the correct process decides a value 푣 in Cheap Quorum before an abort, time of the panic, since the legalChange function disallows then 푣 is the only value that can be decided in Preferential Paxos other processes from writing on that region. Therefore 푞 reads with priorities as defined in Definition 3. 푣 from 푉푎푙푢푒 [푝1] and aborts with 푣. □ PROOF. To prove this lemma, we mainly rely on two properties: LEMMA B.6. Cheap Quorum is 2-deciding. the Cheap Quorum Abort Agreement (Lemma 4.6) and Preferential PROOF. Consider an execution in which every process is correct Paxos Priority Decision (Lemma 4.7). We consider two cases. and the system is synchronous. Then no process will enter panic Case 1. Some correct follower process 푝 ≠ 푝1 decided 푣 in Cheap Quorum. Then note that by Lemma 4.6, all correct processes aborted mode (by Lemma B.3) and thus 푝1 will not have its permission with value 푣 and a correct unanimity proof. Since 푛 ≥ 2푓 + 1, there revoked. 푝1 will therefore be able to write its input value to 푝1’s + region and decide after this single write (2 delays). □ are at least 푓 1 correct processes. Note that by the way we assign priorities to inputs of Preferential Paxos in the composition of the C CORRECTNESS OF THE FAST & ROBUST two algorithms, all correct processes have inputs with the highest priority. Therefore, by Lemma 4.7, the only decision value possible The following is the pseudocode of Preferential Paxos. Recall that in Preferential Paxos is 푣. Furthermore, note that by Lemma 4.5, if T-send and T-receive are the trusted message passing primitives that any other correct process decided in Cheap Quorum, that process’s are implemented in [21] using non-equivocation and signatures. decision value was also 푣. Case 2. Only the leader, 푝1, decides in Cheap Quorum, and 푝1 is correct. Then by Lemma 4.6, all correct processes aborted with value Algorithm 8: Preferential Paxos—code for process 푝 푣. Since 푝1 is correct, 푣 is signed by 푝1. It is possible that some of 1 propose((v, priorityTag)){ the processes also had a correct unanimity proof as their abort proof. 2 T-send(v, priorityTag) to all; However, note that in this scenario, all correct processes (at least 3 Wait to T-receive (val,priorityTag) from 푛 − 푓푃 ↩→ processes; 푓 + 1 processes) had inputs with either the highest or second highest 4 best = value with highest priority out of priorities, all with the same abort value. Therefore, by Lemma 4.7, ↩→ messages received; 5 RobustBackup(Paxos).propose(best); } the decision value must have been the value of one of these inputs. Since all these inputs had the same value 푣, 푣 must be the decision value of Preferential Paxos. □

LEMMA C.1 (LEMMA 4.7: PREFERENTIAL PAXOS PRIORITY THEOREM C.3 (END-TO-END VALIDITY). In the Fast & Robust DECISION). Preferential Paxos implements weak Byzantine agree- algorithm, if there are no faulty processes and some process decides ment with 푛 ≥ 2푓푃 + 1 processes. Furthermore, let 푣1, . . . , 푣푛 be v, then v is the input of some process. the input values of an instance 퐶 of Preferential Paxos, ordered by priority. The decision value of correct processes is always one of PROOF. Note that by Lemmas B.1 and 4.7, this holds for each of the algorithms individually. Furthermore, recall that the abort values 푣1, . . . , 푣 + . 푓푃 1 of Cheap Quorum become the input values of Preferential Paxos, PROOF. By Lemma 4.3, Robust Backup(Paxos) solves weak and the set-up phase does not invent new values. Therefore, we just Byzantine agreement with 푛 ≥ 2푓푃 + 1 processes. Note that be- have to show that if Cheap Quorum aborts, then all abort values fore calling Robust Backup(Paxos), each process may change its are inputs of some process. Note that by the code in panic mode, if input, but only to the input of another process. Thus, by the correct- Cheap Quorum aborts, a process 푝 can output an abort value from ness and fault tolerance of Paxos, Preferential Paxos clearly solves one of three sources: its own Value region, the leader’s Value region, or its own input value. Clearly, if its abort value is its input, then we PROOF. Assume by contradiction that some process 푝 decides are done. Furthermore note that a correct leader only writes its input a value 푣 and 푣 is not the input of any process. Since 푣 is not the in the Value region, and correct followers only write a copy of the input value of 푝, then 푝 must have adopted 푣 by reading it from leader’s Value region in their own region. Since there are no faults, some process 푝′ at line 23. Note also that a process cannot adopt the this means that only the input of the leader may be written in any initial value ⊥, and thus, 푣 must have been written in 푝′’s memory Value region, and therefore all processes always abort with some by some other process. Thus we can define a sequence of processes processes input as their abort value. □ 푠1, 푠2, . . . , 푠푘 , where 푠푖 adopts 푣 from the location where it was written by 푠푖+1 and 푠1 = 푝. This sequence is necessarily finite since there THEOREM C.4 (END-TO-END AGREEMENT). In the Fast & have been a finite number of steps taken up to the point when 푝 Robust algorithm, if 푝 and 푞 are correct processes such that 푝 decides decided 푣. Therefore, there must be a last element of the sequence, 푣 and 푞 decides 푣 , then 푣 = 푣 . 1 2 1 2 푠푘 who wrote 푣 in line 35 without having adopted 푣. This implies 푣 was 푠 ’s input value, a contradiction. □ PROOF. First note that by Lemmas 4.5 and 4.7, each of the algo- 푘 rithms satisfy this individually. Thus Lemma 4.8 implies the theo- We now focus on agreement. rem. □ THEOREM D.2 (AGREEMENT). In Algorithm 7, for any pro- THEOREM C.5 (END-TO-END TERMINATION). In Fast & Ro- cesses 푝 and 푞, if 푝 and 푞 decide values 푣푝 and 푣푞 respectively, then bust algorithm, if some correct process is eventually the sole leader 푣푝 = 푣푞. forever, then every correct process eventually decides. Before showing the proof of the theorem, we first introduce the following useful lemmas. PROOF. Assume towards a contradiction that some correct pro- cess 푝 is eventually the sole leader forever, and let 푡 be the time LEMMA D.3. The values a leader accesses on remote memory when 푝 last becomes leader. Now consider some process 푞 that has cannot change between when it reads them and when it writes them. not decided before . We consider several cases: 푡 PROOF. Recall that each memory only allows write-access to (1) If 푞 is executing Preferential Paxos at time 푡, then 푞 will even- the most recent process that acquired it. In particular, that means tually decide, by termination of Preferential Paxos (Lemma 4.7). that each memory only gives access to one process at a time. Note (2) If 푞 is executing Cheap Quorum at time 푡, we distinguish two that the only place at which a process acquires write-permissions sub-cases: on a memory is at the very beginning of its run, before reading the (a) 푝 is also executing as the leader of Cheap Quorum at time values written on the memory. In particular, for each memory 푑 a 푡. Then 푝 will eventually propose a value, so 푞 will ei- process does not issue a read on 푑 before its permission request ther decide in Cheap Quorum or abort from Cheap Quo- on 푑 successfully completes. Therefore, if a process 푝 succeeds in rum (by Lemma B.2) and decide in Preferential Paxos by writing on memory 푚, then no other process could have acquired 푑’s Lemma 4.7. write-permission after 푝 did, and therefore, no other process could (b) 푝 is executing in Preferential Paxos. Then 푝 must have have changed the values written on 푚 after 푝’s read of 푚. □ panicked and aborted from Cheap Quorum. Thus, 푞 will also abort from Cheap Quorum and decide in Preferential LEMMA D.4. If a leader writes values 푣푖 and 푣 푗 at line 35 with Paxos by Lemma 4.7. □ the same proposal number to memories 푖 and 푗, respectively, then 푣푖 = 푣 푗 . Note that to strengthen C.5 to general termination as stated in our PROOF. Assume by contradiction that a leader 푝 writes different model, we require the additional standard assumption [38] that some values 푣 ≠ 푣 with the same proposal number. Since each thread correct process 푝 is eventually the sole leader forever. In practice, 1 2 of 푝 executes the phase 2 write (line 35) at most once per proposal however, 푝 does not need to be the sole leader forever, but rather number, it must be that different threads 푇 and 푇 of 푝 wrote 푣 and long enough so that all correct processes decide. 1 2 1 푣2, respectively. If 푝 does not perform phase 1 (i.e., if 푝 = 푝1 and D CORRECTNESS OF PROTECTED this is 푝’s first attempt), then it is impossible for 푇1 and 푇2 to write different values, since CurrentVal was set to 푣 at line 14 and was not MEMORY PAXOS changed afterwards. Otherwise (if 푝 performs phase 1), then let 푡1 In this section, we present the proof of Theorem 5.1. We do so by and 푡2 be the times when 푇1 and 푇2 executed line 8, respectively (푇1 showing the Algorithm 7 is an algorithm that satisfies all of the and 푇2 must have done so, since we assume that they both reached properties in the theorem. the phase 2 write at line 35). Assume wlog that 푡1 ≤ 푡2. Due to the We first show that Algorithm 7 correctly implements consensus, check and abort at line 29, CurrentVal cannot change after 푡1 while starting with validity. Intuitively, validity is preserved because each keeping the same proposal number. Thus, when 푇1 and 푇2 perform process that writes any value in a slot either writes its own value, or their phase 2 writes (after 푡1), CurrentVal has the same value as it adopts a value that was previously written in a slot. We show that did at 푡1; it is therefore impossible for 푇1 and 푇2 to write different every value written in any slot must have been the input of some values. We have reached a contradiction. □ process. LEMMA D.5. If a process 푝 performs phase 1 and then writes THEOREM D.1 (VALIDITY). In Algorithm 7, if a process 푝 de- to some memory 푚 with proposal number 푏 at line 35, then 푝 must cides a value 푣, then 푣 was the input to some process. have written 푏 to 푚 at line 21 and read from 푚 at line 23. PROOF. Let 푇 be the thread of 푝 which writes to 푚 at line 35. will perform the read at line 23. If some thread aborts at lines 22 If phase 1 is performed (i.e., the condition at line 19 is satisfied), and 26, then the loop aborts and we are done. Otherwise, a majority then by construction 푇 cannot reach line 35 without first performing of threads must add themselves to ListOfReady, pass the barrier at lines 21 and 23. Since 푇 only communicates with 푚, it must be that line 7 and invoke the write at line 35. If some such write fails, the lines 21 and 23 are performed on 푚. □ loop aborts; otherwise, a majority of threads will reach the check at line 37 and thus the loop will exit. □ PROOFOF THEOREM D.2. Assume by contradiction that 푣푝 ≠ 푣푞. Let 푏푝 and 푏푞 be the proposal numbers with which 푣푝 and 푣푞 PROOFOF THEOREM D.6. The Ω failure detector guarantees are decided, respectively. Let 푊푝 (resp. 푊푞) be the set of memories that eventually, all processes trust the same correct process 푝. Let to which 푝 (resp. 푞) successfully wrote in phase 2 line 35 before 푡 be the time after which all correct processes trust 푝 forever. By deciding 푣푝 (resp. 푣푞). Since 푊푝 and 푊푞 are both majorities, their Lemma D.7, at some time 푡 ′ ≥ 푡, all correct processes except 푝 intersection must be non-empty. Let 푚 be any memory in 푊푝 ∩ 푊푞. will be blocked at line 12. Therefore, the minProposal values of We first consider the case in which one of the processes didnot all memories, on all slots except those of 푝 stop increasing. Thus, perform phase 1 before deciding (i.e., one of the processes is 푝1 and eventually, 푝 picks a propNr that is larger than all others written on it decided on its first attempt). Let that process be 푝 wlog. Further any memory, and stops restarting at line 26. Furthermore, since no assume wlog that 푞 is the first process to enter phase 2 with a value process other than 푝 is executing any steps of the algorithm, and in different from 푣푝 . 푝’s phase 2 write on 푚 must have preceded 푞 particular, no process other than 푝 ever acquires any memory after obtaining permissions from 푚 (otherwise, 푝’s write would have time 푡 ′, 푝 never loses its permission on any of the memories. So, failed due to lack of permissions). Thus, 푞 must have seen 푝’s value all writes executed by 푝 on any correct memory must return 푎푐푘. during its read on 푚 at line 23, and thus 푞 cannot have adopted its Therefore, 푝 will decide and broadcast its decision to all. All correct own value. Since 푞 is the first process to enter phase 2 with a value processes will receive 푝’s decision and decide as well. □ different from 푣푝 , 푞 cannot have adopted any other value than 푣푝 , so 푞 must have adopted 푣푝 . Contradiction. To complete the proof of Theorem 5.1, we now show that Algo- We now consider the remaining case: both 푝 and 푞 performed rithm 7 is 2-deciding. phase 1 before deciding. We assume wlog that 푏푝 < 푏푞 and that 푏푞 is THEOREM D.8. Algorithm 7 is 2-deciding. the smallest proposal number larger than 푏푝 for which some process enters phase 2 with CurrentVal ≠ 푣푝 . PROOF. Consider an execution in which 푝 is timely, and no pro- Since 푏 < 푏 , 푝’s read at 푚 must have preceded 푞’s phase 1 1 푝 푞 cess’s failure detector ever suspects 푝 . Then, since no process thinks write at 푚 (otherwise 푝 would have seen 푞’s larger proposal num- 1 itself the leader, and processes do not deviate from their protocols, no ber and aborted). This implies that 푝’s phase 2 write at 푚 must process calls changePermission on any memory. Furthermore, 푝 ’s have preceded 푞’s phase 1 write at 푚 (by Lemma D.3). Thus 푞 1 does not perform phase 1 (lines 19–33), since it is 푝 ’s first attempt. must have seen 푣 during its read and cannot have adopted its own 1 푝 Thus, since 푝 initially has write permission on all memories, all of input value. However, 푞 cannot have adopted 푣 , so 푞 must have 1 푝 푝 ’s phase 2 writes succeed. Therefore, 푝 terminates, deciding its adopted 푣 from some other slot 푠푙 that 푞 saw during its read. It 1 1 푞 own proposed value 푣, after one write per memory. □ must be that 푠푙.푚푖푛푃푟표푝표푠푎푙 < 푏푞, otherwise 푞 would have aborted. Since ≥ for any slot, it follows that 푠푙.푚푖푛푃푟표푝표푠푎푙 푠푙.푎푐푐푃푟표푝표푠푎푙 E PSEUDOCODE FOR THE ALIGNED PAXOS 푠푙.푎푐푐푃푟표푝표푠푎푙 < 푏푞. If 푠푙.푎푐푐푃푟표푝표푠푎푙 < 푏푝 , 푞 cannot have adopted 푠푙.푣푎푙푢푒 in line 30 (it would have adopted 푣푝 instead). Thus it must be that 푏푝 ≤ 푠푙.푎푐푐푃푟표푝표푠푎푙 < 푏푞; however, this is impossible be- cause we assumed that 푏푞 is the smallest proposal number larger Algorithm 9: Aligned Paxos than for which some process enters phase 2 with CurrentVal ≠ . 푏푝 푣푝 1 A=(P, M) is set of acceptors We have reached a contradiction. □ 2 propose(v){ 3 resps = [] //prepare empty responses list Finally, we prove that the termination property holds. 4 choose propNr bigger than any seen before 5 for all a in A{ THEOREM D.6 (TERMINATION). Eventually, all correct pro- 6 cflag = communicate1(a, propNr) 7 resp = hearback1(a) cesses decide. 8 if (cflag){resps.append((a, resp)) } } 9 wait until resps has responses from a majority of A LEMMA D.7. If a correct process 푝 is executing the for loop at 10 next = analyze1(resps) lines 18–37, then 푝 will eventually exit from the loop. 11 if (next == RESTART) restart; 12 resps = [] PROOF. The threads of the for loop perform the following poten- 13 for all a in A{ 14 cflag = communicate2(a, next) tially blocking steps: obtaining permission (line 20), writing (lines 21 15 resp = hearback2(a) and 35), reading (line 23), and waiting for other threads (the barrier 16 if (cflag){resps.append((a, resp)) } } at line 7 and the exit condition at line 37). By our assumption that 17 wait until resps has responses from a majority of A 18 next = analyze2(resps) a majority of memories are correct, a majority of the threads of the 19 if (next == RESTART) restart; for loop must eventually obtain permission in line 20 and invoke the 20 decide next; } write at line 21. If one of these writes fails due to lack of permission, the loop aborts and we are done. Otherwise, a majority of threads Algorithm 15: Analyze Phase 2—code for process 푝 Algorithm 12: Analyze Phase 1—code for process 푝 1 value analyze2(value v, responses resps){ 1 responses is a list of (agent, response) pairs 2 if there are at least A/2 + 1 resps such that resp. 2 value analyze1(responses resps){ ↩→ response==ack{ 3 for resp in resps { 3 return v } 4 if (resp.agent is memory){ 4 return RESTART } 5 for all slots s in v.info { 6 if (s.minProposal > propNr) return RESTART } 7 (v, accProposal) = value and accProposal of slot ↩→ with highest 8 accProposal that had a value } } Algorithm 10: Communicate Phase 1—code for process 푝 9 return v where (v, accProposal) is the highest ↩→ accProposal seen in resps.response of all 1 bool communicate1(agent a, value propNr){ ↩→ agents } 2 if (a is memory){ 3 changePermission(a, {(R:Π-{p}, W:∅, RW: {p}); // ↩→ acquire write permission 4 return write(a[p], {propNr, -, -}) } 5 else{ 6 send prepare(propNr) to a Algorithm 13: Communicate Phase 2—code for process 푝 7 return true } } 1 bool communicate2(agent a, value msg){ 2 if (a is memory){ 3 return write(a[p], {msg.propNr, msg.propNr, msg. ↩→ val})} Algorithm 11: Hear Back Phase 1—code for process 푝 4 else{ 5 send accepted(msg.propNr, msg.val) to a 1 value hearback1(agent a){ 6 return true } } 2 if (a is memory){ 3 for all processes q{ 4 localInfo[q] = read(a[q]) } 5 return = localInfo} 6 else{ 7 return value received from a } } Algorithm 14: Hear Back Phase 2—code for process 푝 1 value hearback2(agent a){ 2 if (a is memory){ 3 return ack } 4 else{ 5 return value received from a } }