Efficient Replication via Timestamp Stability

Vitor Enes Carlos Baquero Alexey Gotsman Pierre Sutra INESC TEC and INESC TEC and IMDEA Software Institute Télécom SudParis University of Minho University of Minho Abstract most critical data. State-machine replication (SMR) [38] is an Modern web applications replicate their data across the globe approach for providing such guarantees used by a number and require strong consistency guarantees for their most of systems [8, 14, 21, 26, 39, 44]. In SMR, a desired service critical data. These guarantees are usually provided via state- is defined by a deterministic state machine, and each site machine replication (SMR). Recent advances in SMR have maintains its own local replica of the machine. An SMR pro- focused on leaderless protocols, which improve the avail- tocol coordinates the execution of commands at the sites to ability and performance of traditional Paxos-based solutions. ensure that the system is linearizable [18], i.e., behaves as if We propose Tempo – a leaderless SMR protocol that, in com- commands are executed sequentially by a single site. parison to prior solutions, achieves superior throughput and Traditional SMR protocols, such as Paxos [28] and offers predictable performance even in contended workloads. Raft [34], rely on a distinguished leader site that defines the To achieve these benefits, Tempo timestamps each applica- order in which client commands are executed at the replicas. tion command and executes it only after the timestamp be- Unfortunately, this site is a single point of failure and con- comes stable, i.e., all commands with a lower timestamp are tention, and a source of higher latency for clients located far known. Both the timestamping and stability detection mech- from it. Recent efforts to improve SMR have thus focused anisms are fully decentralized, thus obviating the need for on leaderless protocols, which distribute the task of ordering a leader replica. Our protocol furthermore generalizes to commands among replicas and thus allow a client to contact partial replication settings, enabling scalability in highly par- the closest replica instead of the leader [1, 5, 13, 30, 31, 42]. allel workloads. We evaluate the protocol in both real and Compared to centralized solutions, leaderless SMR offers simulated geo-distributed environments and demonstrate lower average latency, fairer latency distribution with re- that it outperforms state-of-the-art alternatives. spect to client locations, and higher availability. Leaderless SMR protocols also generalize to the setting of CCS Concepts: • Theory of computation → Distributed partial replication, where the service state is split into a set algorithms. of partitions, each stored at a group of replicas. A client com- Keywords: Fault tolerance, Consensus, Geo-replication. mand can access multiple partitions, and the SMR protocol ensures that the system is still linearizable, i.e., behaves as ACM Reference Format: if the commands are executed by a single machine storing Vitor Enes, Carlos Baquero, Alexey Gotsman, and Pierre Sutra. a complete service state. This approach allows implement- 2021. Efficient Replication via Timestamp Stability. In Sixteenth ing services that are too big to fit onto a single machine. It European Conference on Computer Systems (EuroSys ’21), April 26–28, 2021, Online, United Kingdom. ACM, New York, NY, USA, 25 pages. also enables scalability, since commands accessing disjoint https://doi.org/10.1145/3447786.3456236 sets of partitions can be executed in parallel. This has been demonstrated by Janus [32] which adapted a leaderless SMR 1 Introduction protocol called Egalitarian Paxos (EPaxos) [31] to the setting of partial replication. The resulting protocol provided bet- Modern web applications are routinely accessed by clients ter performance than classical solutions such as two-phase arXiv:2104.01142v2 [cs.DC] 25 Apr 2021 all over the world. To support such applications, storage commit layered over Paxos. systems need to replicate data at different geographical loca- Unfortunately, all existing leaderless SMR protocols suffer tions while providing strong consistency guarantees for the from drawbacks in the way they order commands. Some pro- Permission to make digital or hard copies of all or part of this work for tocols [1, 5, 13, 31] maintain explicit dependencies between personal or classroom use is granted without fee provided that copies are not commands: a replica may execute a command only after all made or distributed for profit or commercial advantage and that copies bear its dependencies get executed. These dependencies may form this notice and the full citation on the first page. Copyrights for components arbitrary long chains. As a consequence, in theory the pro- of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to tocols do not guarantee progress even under a synchronous redistribute to lists, requires prior specific permission and/or a fee. Request network. In practice, their performance is unpredictable, and permissions from [email protected]. in particular, exhibits a high tail latency [5, 36]. Other proto- EuroSys ’21, April 26–28, 2021, Online, United Kingdom cols [11, 30] need to contact every replica on the critical path © 2021 Association for Computing Machinery. of each command. While these protocols guarantee progress ACM ISBN 978-1-4503-8334-9/21/04...$15.00 https://doi.org/10.1145/3447786.3456236 1 EuroSys ’21, April 26–28, 2021, Online, United Kingdom Vitor Enes, Carlos Baquero, Alexey Gotsman, and Pierre Sutra under synchrony, they make the system run at the speed of 2 Partial State-Machine Replication the slowest replica. We consider a geo-distributed system where processes All of these drawbacks carry over to the setting of par- may fail by crashing, but do not behave maliciously. State- tial replication where they are aggravated by the fact that machine replication (SMR) is a common way of implement- commands span multiple machines. ing fault-tolerant services in such a system [38]. In SMR, the In this paper we propose Tempo, a new leaderless SMR pro- service is defined as a deterministic state machine accepting tocol that lifts the above limitations while handling both full a set of commands C. Each process maintains a replica of the and partial replication settings. Tempo guarantees progress machine and receives commands from clients, external to under a synchronous network without the need to contact the system. An SMR protocol coordinates the execution of all replicas. It also exhibits low tail latency even in contended commands at the processes, ensuring that they stay in sync. workloads, thus ensuring predictable performance. Finally, We consider a general version of SMR where each pro- it delivers superior throughput than prior solutions, such as cess replicates only a part of the service state – partial SMR EPaxos and Janus. The protocol achieves all these benefits by (PSMR) [19, 32, 37]. We assume that the service state is di- assigning a scalar timestamp to each command and executing vided into partitions, so that each variable defining the state commands in the order of these timestamps. To determine belongs to a unique partition. Partitions are arbitrarily fine- when a command can be executed, each replica waits until grained: e.g., just a single state variable. Each command the command’s timestamp is stable, i.e., all commands with accesses one or more partitions. We assume that a process a lower timestamp are known. Ordering commands in this replicates a single partition, but multiple processes may be way is used in many protocols [1, 9, 11, 27]. A key novelty co-located at the same machine. Each partition is replicated of Tempo is that both timestamping and stability detection at 푟 processes, of which at most 푓 may fail. Following Flexible are fault-tolerant and fully decentralized, which preserves 푓 푓 푟−1 Paxos [20], can be any value such that 1 ≤ ≤ ⌊ 2 ⌋. This the key benefits of leaderless SMR. allows using small values of 푓 regardless of the replication In more detail, each Tempo process maintains a local clock factor 푟, which is appropriate in geo-replication [8, 13]. We from which timestamps are generated. In the case of full write I푝 for the set of all the processes replicating a partition replication, to submit a command a client sends it to the 푝, I푐 for the set of processes that replicate the partitions closest process, which acts as its coordinator. The coordinator accessed by a command 푐, and I for the set of all processes. computes a timestamp for the command by forwarding it A PSMR protocol allows a process 푖 to submit a command 푐 to a quorum of replicas, each of which makes a timestamp on behalf of a client. For simplicity, we assume that each com- proposal, and taking the maximum of these proposals. If mand is unique and the process submitting it replicates one enough replicas in the quorum make the same proposal, of the partitions it accesses: 푖 ∈ I푐 . For each partition 푝 ac- then the timestamp is decided immediately (fast path). If cessed by 푐, the protocol then triggers an upcall execute푝 (푐) not, the coordinator does an additional round trip to the at each process storing 푝, asking it to apply 푐 to the local replicas to persist the timestamp (slow path); this may happen state of partition 푝. After 푐 is executed by at least one pro- when commands are submitted concurrently. Thus, under cess in each partition it accesses, the process that submitted favorable conditions, the replica nearest to the client decides the command aggregates the return values of 푐 from each the command’s timestamp in a single round trip. partition and returns them to the client. To execute a command, a replica then needs to determine PSMR ensures the highest standard of consistency of repli- when its timestamp is stable, i.e., it knows about all com- cated data – linearizability [18] – which provides an illusion mands with lower timestamps. The replica does this by gath- that commands are executed sequentially by a single ma- ering information about which timestamp ranges have been chine storing a complete service state. To this end, a PSMR used up by each replica, so that no more commands will get protocol has to satisfy the following specification. Given two proposals in these ranges. This information is piggy-backed commands 푐 and 푑, we write 푐 ↦→푖 푑 when they access a on replicas’ messages, which often allows a timestamp of a common partition and 푐 is executed before 푑 at some process command to become stable immediately after it is decided. 푖 ∈ I푐 ∩ I푑 . We also define the following real-time order: The above protocol easily extends to partial replication: in 푐 { 푑 when the command 푐 returns before the command 푑 this case a command’s timestamp is the maximum over the Ð was submitted. Let ↦→ = ( 푖 ∈I ↦→푖 ) ∪ {. A PSMR protocol timestamps computed for each of the partitions it accesses. ensures the following properties: We evaluate Tempo in three environments: a simulator, a 푐 controlled cluster environment and using multiple regions Validity. If a process executes some command , then it 푐 푐 in Amazon EC2. We show that Tempo improves throughput executes at most once and only if was submitted before. over existing SMR protocols by 1.8-5.1x, while lowering tail Ordering. The relation ↦→ is acyclic. 푐 latency with respect to prior leaderless protocols by an or- Liveness. If a command is submitted by a non-faulty der of magnitude. This advantage is maintained in partial process or executed at some process, then it is executed at I replication, where Tempo outperforms Janus by 1.2-16x. all non-faulty processes in 푐 . 2 Efficient Replication via Timestamp Stability EuroSys ’21, April 26–28, 2021, Online, United Kingdom

The Ordering property ensures that commands are exe- are submitted concurrently to the same partition (however, cuted in a consistent manner throughout the system [17]. For recall that partitions may be arbitrarily fine-grained). example, it implies that two commands, both accessing the Since processes execute committed commands in the time- same two partitions, cannot be executed at these partitions stamp order, before executing a command a process must in contradictory orders. As usual, to ensure Liveness we as- know all the commands that precede it. sume that the network is eventually synchronous, and in particular, that message delays between non-failed processes Property 2 (Timestamp stability). Consider a command 푐 are eventually bounded [12]. committed at 푖 with timestamp 푡. Process 푖 can only execute PSMR is expressive enough to implement a wide spectrum 푐 after its timestamp is stable, i.e., every command with a of distributed applications. In particular, it directly allows timestamp lower or equal to 푡 is also committed at 푖. implementing one-shot transactions, which consist of inde- pendent pieces of code (such as stored procedures), each To check the stability of a timestamp 푡 (§3.2), each process accessing a different partition [22, 29, 32]. It can also be used 푖 tracks timestamp proposals issued by other processes. Once to construct general-purpose transactions [32, 40]. the Clocks at any majority of the processes pass 푡, process 푖 can be sure that new commands will get higher timestamps: 3 Single-Partition Protocol these are computed as the maximal proposal from at least a 푖 For simplicity, we first present the protocol in the case when majority, and any two majorities intersect. Process can then there is only a single partition, and cover the general case in use the information gathered about the timestamp proposals §4. We start with an overview of the single-partition protocol. from other processes to find out about all the commands that 푡 To ensure the Ordering property of PSMR, Tempo assigns have got a timestamp lower than . a scalar timestamp to each command. Processes execute com- mands in the order of these timestamps, thus ensuring that 3.1 Commit Protocol processes execute commands in the same order. To submit a Algorithm 1 specifies the single-partition commit protocol command, a client sends it to a nearby process which acts at a process 푖 replicating a partition 푝. We assume that self- as the coordinator for the command. The coordinator is in addressed messages are delivered immediately. A command charge of assigning a timestamp to the command and com- 푐 ∈ C is submitted by a client by calling submit(푐) at a pro- municating this timestamp to all processes. When a process cess 푖 that replicates a partition accessed by the command finds out about the command’s timestamp, we say that the (line 1). Process 푖 then creates a unique identifier id ∈ D process commits the command. If the coordinator is sus- and a mapping Q from a partition accessed by the com- pected to have failed, another process takes over its role mand to the fast quorum to be used at that partition. Be- through a recovery mechanism (§5). Tempo ensures that, cause we consider a single partition for now, in what follows even in case of failures, processes agree on the timestamp as- Q contains only one fast quorum, Q[푝]. Finally, process 푖 , 푐, I푖 signed to the command, as stated by the following property. sends MSubmit(id Q) to a set of processes 푐 , which in the single-partition case simply denotes {푖}. Property 1 (Timestamp agreement). Two processes cannot A command goes through several phases at each process: commit the same command with different timestamps. from the initial phase start, to a commit phase once the A coordinator computes a timestamp for a command as command is committed, and an execute phase once it is follows (§3.1). It first forwards the command toa fast quorum executed. We summarize these phases and allowed phase 푟 푓 of ⌊ 2 ⌋ + processes, including the coordinator itself. Each transitions in Figure 1. A mapping phase at a process tracks process maintains a Clock variable. When the process re- the progress of a command with a given identifier through ceives a command from the coordinator, it increments Clock phases. For brevity, the name of the phase written in lower and replies to the coordinator with the new Clock value as a case denotes all the commands in that phase, e.g., start = timestamp proposal. The coordinator then takes the highest {id ∈ D | phase[id] = start}. We also define pending as proposal as the command’s timestamp. If enough processes follows: pending = payload ∪ propose ∪ recoverp ∪ recoverr. have made such a proposal, the coordinator considers the timestamp decided and takes the fast path: it just commu- Start phase. When a process receives an MSubmit message, nicates the timestamp to the processes, which commit the it starts serving as the command coordinator (line 5). The command. The protocol ensures that the timestamp can be coordinator first computes its timestamp proposal for the recovered even if the coordinator fails, thus maintaining command as Clock + 1. After computing the proposal, the Property 1. Otherwise, the coordinator takes the slow path, coordinator sends an MPropose message to the fast quorum where it stores the timestamp at a slow quorum of 푓 + 1 pro- Q[푝] and an MPayload message to the remaining processes. cesses using a variant of Flexible Paxos [20]. This ensures Since the fast quorum contains the coordinator, the coordina- that the timestamp survives any allowed number of failures. tor also sends the MPropose message to itself. As mentioned The slow path may have to be taken in cases when commands earlier, self-addressed messages are delivered immediately. 3 EuroSys ’21, April 26–28, 2021, Online, United Kingdom Vitor Enes, Carlos Baquero, Alexey Gotsman, and Pierre Sutra

Algorithm 1: Commit protocol at process 푖 ∈ I푝 . start y % 1 submit(푐) recover-ro payload propose/ recover-p 2 pre: 푖 ∈ I푐 , commit% y r 3 id ← next_id(); Q ← fast_quorums(푖, I푐 ) 푖 4 MSubmit(id, 푐, Q) I send to 푐 execute ( Q) 5 receive MSubmit id, 푐, Figure 1. Command journey through phases in Tempo. 6 푡 ← Clock + 1 7 send MPropose(id, 푐, Q, 푡) to Q[푝] Propose phase. Upon receiving an MPropose message 8 send MPayload(id, 푐, Q) to I푝 \Q[푝] (line 12), a fast-quorum process also saves the command pay- 9 MPayload(id, 푐, Q) receive load and fast quorums, but sets its phase to propose. Then 10 pre: id ∈ start the process computes its own timestamp proposal using 11 cmd[id] ← 푐; quorums[id] ← Q; phase[id] ← payload the function proposal and stores it in a mapping ts. Finally, ( Q ) 12 receive MPropose id, 푐, , 푡 from 푗 the process replies to the coordinator with an MProposeAck 13 id ∈ start pre: message, carrying the computed timestamp proposal. 14 cmd[id] ← 푐; quorums[id] ← Q; phase[id] ← propose The function proposal takes as input an identifier id 15 ts[id] ← proposal(id, 푡) and a timestamp 푚 and computes a timestamp proposal as 16 MProposeAck(id, ts[id]) 푗 send to 푡 = max(푚, Clock + 1), so that 푡 ≥ 푚 (line 35). The function 17 receive MProposeAck(id, 푡푗 ) from ∀푗 ∈ 푄 bumps the Clock to the computed timestamp 푡 and returns 푡 18 pre: id ∈ propose ∧ 푄 = quorums[id][푝] (lines 38-39); we explain lines 36-37 later. As we have already 19 푡 ← max{푡푗 | 푗 ∈ 푄} noted, the coordinator computes the command’s timestamp I 20 if count(푡) ≥ 푓 then send MCommit(id, 푡) to cmd[id] as the highest of the proposals from fast-quorum processes. I 21 else send MConsensus(id, 푡, 푖) to 푝 Proactively taking the max between the coordinator’s pro- 22 receive MCommit(id, 푡) posal 푚 and Clock + 1 in proposal ensures that a process’s 23 pre: id ∈ pending proposal is at least as high as the coordinator’s; as we ex- 24 ts[id] ← 푡; phase[id] ← commit plain shortly, this helps recovering timestamps in case of 25 bump(ts[id]) coordinator failure. 26 receive MConsensus(id, 푡,푏) from 푗 Commit phase. Once the coordinator receives an 27 pre: bal[id] ≤ 푏 MProposeAck message from all the processes in the fast 28 ts[id] ← 푡; bal[id] ← 푏; abal[id] ← 푏 quorum 푄 = Q[푝] (line 17), it computes the command’s 29 bump(푡) timestamp as the highest of all timestamp proposals: 30 send MConsensusAck(id,푏) to 푗 푡 = max{푡푗 | 푗 ∈ 푄}. Then the coordinator decides to 31 MConsensusAck(id,푏) 푄 receive from either take the fast path (line 20) or the slow path (line 21). 32 pre: bal[id] = 푏 ∧ |푄| = 푓 + 1 Both paths end with the coordinator sending an MCommit 33 send MCommit(id, ts[id]) to I cmd[id] message containing the command’s timestamp. Since 푟 34 proposal(id,푚) 푄 푓 푓 | | = ⌊ 2 ⌋ + and ≥ 1, we have the following property 35 푡 ← max(푚, Clock + 1) which ensures that a committed timestamp is computed 36 Detached ← Detached ∪ {⟨푖,푢⟩ | Clock + 1 ≤ 푢 ≤ 푡 − 1} over (at least) a majority of processes. 37 Attached[id] ← {⟨푖, 푡⟩} Property 3. For any message MCommit(id, 푡), there is a 38 Clock ← 푡 푟 set of processes 푄 such that |푄| ≥ ⌊ ⌋ + 1 and 푡 = max{푡푗 | 39 return 푡 2 푗 ∈ 푄}, where 푡푗 is the output of function proposal(id, _) ( ) 40 bump 푡 previously called at process 푗 ∈ 푄. 41 푡 ← max(푡, Clock) 푡 42 Detached ← Detached ∪ {⟨푖,푢⟩ | Clock + 1 ≤ 푢 ≤ 푡} This property is also preserved if is computed by a pro- 43 Clock ← 푡 cess performing recovery in case of coordinator failure (§5). Once a process receives an MCommit message (line 22), it saves the command’s timestamp in ts[id] and moves the command to the commit phase. It then bumps the Clock to Payload phase. Upon receiving an MPayload message the committed timestamp using a function bump (line 40). (line 9), a process simply saves the command payload in We next explain the fast and slow paths, as well as the con- a mapping cmd and sets the command’s phase to payload. ditions under which they are taken. It also saves Q in a mapping quorums. This is necessary for the recovery mechanism to know the fast quorum used for Fast path. The fast path can be taken if the highest proposal the command (§5). 푡 is made by at least 푓 processes. This condition is expressed 4 Efficient Replication via Timestamp Stability EuroSys ’21, April 26–28, 2021, Online, United Kingdom

푟 by count(푡) ≥ 푓 in line 20, where count(푡) = |{푗 ∈ 푄 | 푡푗 = Table 1. Tempo examples with = 5 processes while toler- 푡}|. If the condition holds, the coordinator immediately sends ating 푓 faults. Only 4 processes are depicted, A, B, C and D, an MCommit message with the computed timestamp1. The with A always acting as the coordinator. protocol ensures that, if the coordinator fails before sending A B C D match fast path 푡 all the MCommit messages, can be recovered as follows. 푎) 푓 = 2 6 6 → 7 10 → 11 10 → 11 ✗ ✓ (푡) ≥ 푓 First, the condition count ensures that the timestamp 푏) 푓 = 2 6 6 → 7 10 → 11 5 → 6 ✗ ✗ 푡 푓 − can be obtained without 1 fast-quorum processes (e.g., 푐) 푓 = 1 6 6 → 7 10 → 11 ✗ ✓ if they fail) by selecting the highest proposal made by the 푑) 푓 = 1 6 5 → 6 1 → 6 ✓ ✓ remaining quorum members. Moreover, the proposal by the coordinator is also not necessary to obtain 푡. This is because the timestamps proposed. This is because Tempo fast-path fast-quorum processes only propose timestamps no lower condition count(max{푡 | 푗 ∈ 푄}) ≥ 푓 trivially holds with than the coordinator’s proposal (line 15). As a consequence, 푗 푓 = 1, and thus Tempo 푓 = 1 always takes the fast path. the coordinator’s proposal is only the highest proposal 푡 Note that when the Clock at a fast-quorum process is when all processes propose the same timestamp, in which below the proposal푚 sent by the coordinator, i.e., Clock < 푚, case a single process suffices to recover 푡. It follows that 푡 the process makes the same proposal as the coordinator. can be obtained without 푓 fast-quorum processes including This is not the case when Clock ≥ 푚, which can happen the initial coordinator by selecting the highest proposal sent when commands are submitted concurrently to the partition. by the remaining (⌊ 푟 ⌋ + 푓 ) − 푓 = ⌊ 푟 ⌋ quorum members. This 2 2 Nonetheless, Tempo is able to take the fast path in some of observation is captured by the following property. these situations, as illustrated in Table 1. Property 4. Any timestamp committed on the fast path Slow path. When the fast-path condition does not hold, the can be obtained by selecting the highest proposal sent in timestamp computed by the coordinator is not yet guaran- MPropose by at least ⌊ 푟 ⌋ fast-quorum processes distinct 2 teed to be persistent: if the coordinator fails before sending from the initial coordinator. all the MCommit messages, a process taking over its job Fast path examples. Table 1 contains several examples that may compute a different timestamp. To maintain Property 1 illustrate the fast-path condition of Tempo and Property 4. in this case, the coordinator first reaches an agreement on All examples consider 푟 = 5 processes. We highlight time- the computed timestamp with other processes replicating stamp proposals in bold. Process A acts as the coordinator the same partition. This is implemented using single-decree and sends 6 in its MPropose message. The fast quorum 푄 Flexible Paxos [20]. For each identifier we allocate ballot is {A, B, C} when 푓 = 1 and {A, B, C, D} when 푓 = 2. The numbers to processes round-robin, with ballot 푖 reserved for example in Table 1 푎) considers Tempo 푓 = 2. Once process B the initial coordinator 푖 and ballots higher than 푟 for pro- receives the MPropose with timestamp 6, it bumps its Clock cesses performing recovery. Every process stores for each from 6 to 7 and sends a proposal 7 in the MProposeAck. Sim- identifier id the ballot bal[id] it is currently participating in ilarly, processes C and D bump their Clock from 10 to 11 and the last ballot abal[id] in which it accepted a consensus and propose 11. Thus, A receives proposals 푡A = 6, 푡B = 7, proposal (if any). When the initial coordinator 푖 decides to go 푡C = 11 and 푡D = 11, and computes the command’s time- onto the slow path, it performs an analog of Paxos Phase 2: it stamp as 푡 = max{6, 7, 11} = 11. Since count(11) = 2 ≥ 푓 , sends an MConsensus message with its consensus proposal the coordinator takes the fast path, even though the pro- and ballot 푖 to a slow quorum that includes itself. Following posals did not match. In order to understand why this is Flexible Paxos, the size of the slow quorum is only 푓 +1, rather safe, assume that the coordinator fails (before sending all the than a majority like in classical Paxos. As usual in Paxos, a MCommit messages) along with another fast-quorum pro- process accepts an MConsensus message only if its bal[id] 푟 cess. Independently of which ⌊ 2 ⌋ = 2 fast-quorum processes is not greater than the ballot in the message (line 27). Then survive ({B, C} or {B, D} or {C, D}), timestamp 11 is always it stores the consensus proposal, sets bal[id] and abal[id] present and can be recovered as stated by Property 4. This is to the ballot in the message, and replies to the coordinator not the case for the example in Table 1 푏). Here A receives with MConsensusAck. Once the coordinator gathers 푓 + 1 푡A = 6, 푡B = 7, 푡C = 11 and 푡D = 6, and again computes such replies (line 31), it is sure that its consensus proposal 푡 = max{6, 7, 11} = 11. Since count(11) = 1 < 푓 , the coordi- will survive the allowed number of failures 푓 , and it thus nator cannot take the fast path: timestamp 11 was proposed broadcasts the proposal in an MCommit message. solely by C and would be lost if both this process and the coordinator fail. The examples in Table 1 푐) and 푑) consider 3.2 Execution Protocol 푓 = 1, and the fast path is taken in both, independently of A process executes committed commands in the timestamp 1 order. To this end, as required by Property 2, a process exe- In line 20 we send the message to I푐 even though this set is equal to I푝 in the single-partition case. We do this to reuse the pseudocode when cutes a command only after its timestamp becomes stable, i.e., presenting the multi-partition protocol in §4. all commands with a lower timestamp are known. To detect 5 EuroSys ’21, April 26–28, 2021, Online, United Kingdom Vitor Enes, Carlos Baquero, Alexey Gotsman, and Pierre Sutra

푠 푖 Algorithm 2: Execution protocol at process 푖 ∈ I푝 . Theorem 1. A timestamp is stable at a process if the variable Promises contains all the promises up to 푠 by some 44 periodically set of processes 푄 with |푄| ≥ ⌊ 푟 ⌋ + 1. 45 send MPromises(Detached, Attached) to I푝 2 휏 46 receive MPromises(퐷, 퐴) Proof. Assume that at some time the variable Promises at a Ð 47 퐶 ← {푎 | ⟨id, 푎⟩ ∈ 퐴 ∧ id ∈ commit ∪ execute} process 푖 contains all the promises up to 푠 by some set of pro- ← ∪ ∪ 푄 푄 푟 48 Promises Promises 퐷 퐶 cesses with | | ≥ ⌊ 2 ⌋ +1. Assume further that a command 푐 49 periodically with identifier id is eventually committed with timestamp 푡 푠 푗 푗 , 푡 50 ℎ ← sort{highest_contiguous_promise(푗) | 푗 ∈ I푝 } ≤ at some process , i.e., receives an MCommit(id ). 푟 푐 푖 51 ids ← {id ∈ commit | ts[id] ≤ ℎ[⌊ 2 ⌋]} We need to show that command is committed at at time 휏 푡 푡 푘 푄 ′ 52 for id ∈ ids ordered by ⟨ts[id], id⟩ . By Property 3 we have = max{ 푘 | ∈ }, where ( [ ]) [ ] ← 푄 ′ 푟 푡 , 53 execute푝 cmd id ; phase id execute | | ≥ ⌊ 2 ⌋+1 and 푘 is the output of function proposal(id _) 푘 푄 푄 ′ 54 highest_contiguous_promise(푗) at a process . As and are majorities, there exists some 푙 푄 푄 ′ 55 max{푐 ∈ N | ∀푢 ∈ {1 . . . 푐}·⟨푗,푢⟩ ∈ Promises} process ∈ ∩ . Then this process attaches a promise ⟨푙, 푡푙 ⟩ to 푐 (line 37) and 푡푙 ≤ 푡 ≤ 푠. Since the variable Promises at process 푖 contains all the promises up to 푠 by process 푙, it 푙, 푡 stability, Tempo tracks which timestamp ranges have been also contains the promise ⟨ 푙 ⟩. According to line 47, when used up by each process using the following mechanism. this promise is incorporated into Promises, command 푐 has been already committed at 푖, as required. □ Promise collection. A promise is a pair ⟨푗,푢⟩ ⊆ I푝 × N where 푗 is a process and 푢 a timestamp. Promises can be A process periodically computes the highest contiguous attached to some command or detached. A promise ⟨푗,푢⟩ promise for each process replicating the same partition, and ℎ attached to command 푐 means that process 푗 proposed time- stores these promises in a sorted array (line 50). It deter- stamp 푢 for command 푐, and thus will not use this timestamp mines the highest stable timestamp according to Theorem 1 푟 ℎ again. A detached promise ⟨푗,푢⟩ means that process 푗 will as the one at index ⌊ 2 ⌋ in . The process then selects all the never propose timestamp 푢 for any command. committed commands with a timestamp no higher than the The function proposal is responsible for collecting the stable one and executes them in the timestamp order, break- promises issued when computing a timestamp proposal 푡 ing ties using their identifiers. After a command is executed, (line 34). This function generates a single attached promise it is moved to the execute phase, which ends its journey. for the proposal 푡, stored in a mapping Attached (line 37). To gain more intuition about the above mechanism, con- 푟 The function also generates detached promises for the times- sider Figure 2, where = 3. There we represent the variable tamps ranging from Clock + 1 up to 푡 − 1 (line 36): since the Promises of some process as a table, with processes as col- , process bumps the Clock to 푡 (line 38), it will never assign a umns and timestamps as rows. For example, a promise ⟨A 2⟩ timestamp in this range. Detached promises are accumulated is in Promises if it is present in column A, row 2. There are 푋 푌 푍 in the Detached set. In Table 1 푑), process B generates an three sets of promises, , and , to be added to Promises. attached promise ⟨B, 6⟩, while C generates ⟨C, 6⟩. Process B For each combination of these sets, the right hand side of Fig- does not issue detached promises, since its Clock is bumped ure 2 shows the highest stable timestamp if all the promises only by 1, from 5 to 6. However, process C bumps its Clock in the combination are in Promises. For instance, assume 푌 푍 by 5, from 1 to 6, generating four detached promises: ⟨C, 2⟩, that Promises = ∪ , so that the set contains promise 2 by ⟨C, 3⟩, ⟨C, 4⟩, ⟨C, 5⟩. A, all promises up to 3 by B, and all promises up to 2 by C. Algorithm 2 specifies the Tempo execution protocol at a As Promises contains all promises up to 2 by the majority , 푐 process replicating a partition 푝. Periodically, each process {B C}, timestamp 2 is stable: any uncommitted command broadcasts its detached and attached promises to the other must be committed with a timestamp higher than 2. Indeed, 푐 processes replicating the same partition by sending them in since is not yet committed, Promises does not contain any 푐 an MPromises message (line 45)2. When a process receives promise attached to (line 47). Moreover, to get committed, 푐 the promises (line 46), it adds them to a set Promises. De- must generate attached promises at a majority of processes 푐 tached promises are added immediately. An attached promise (Property 3), and thus, at either B or C. If generates an associated with a command identifier id is only added once attached promise at B, its coordinator will receive at least id is committed or executed (line 47). proposal 4 from B; if at C, its coordinator will receive at least proposal 3. In either case, and since the committed timestamp Stability detection. Tempo determines when a timestamp is the highest timestamp proposal, the committed timestamp is stable (Property 2) according to the following theorem. of 푐 must be at least 3 > 2, as required. 2To minimize the size of these messages, a promise is sent only once in the In our implementation, promises generated by fast- absence of failures. Promises can be garbage-collected as soon as they are quorum processes when computing their proposal for a com- received by all the processes within the partition. mand (line 34) are piggybacked on the MProposeAck mes- 6 Efficient Replication via Timestamp Stability EuroSys ’21, April 26–28, 2021, Online, United Kingdom

. . . 푋 = {⟨A, 1⟩, ⟨C, 3⟩} → 0 . . . ~ 푌 = {⟨B, 1⟩, ⟨B, 2⟩, ⟨B, 3⟩} → 0 w y z x 3 ⟨B, 3⟩ ⟨C, 3⟩ 3 ⟨A, 3⟩ / / / 푍 = {⟨A, 2⟩, ⟨C, 1⟩, ⟨C, 2⟩} → 0 2 ⟨A, 2⟩ ⟨B, 2⟩ ⟨C, 2⟩ 2 ⟨B, 2⟩ ⟨C, 2⟩ / “depends on” 푋 ∪ 푌 → 1 1 ⟨A, 1⟩ ⟨B, 1⟩ ⟨C, 1⟩ 1 ⟨A, 1⟩ ⟨B, 1⟩ ⟨C, 1⟩ 푋 ∪ 푍 푌 ∪ 푍 → timestamps and 2 w y z x A B C timestamps A B C / / / 푋 ∪ 푌 ∪ 푍 → 3 processes processes / “blocked on” Figure 2. Stable timestamps for different sets of promises. Figure 3. Comparison between timestamp stability (left) and two approaches using explicit dependencies (right). sage, and then broadcast by the coordinator in the MCommit right of Figure 3. Since the dependency graph may be cyclic message (omitted from the pseudocode). This speeds up sta- (as in Figure 3), commands cannot be simply executed in the bility detection and often allows a timestamp of a command order dictated by the graph. Instead, the protocol waits until to become stable immediately after it is decided. Notice that it forms strongly connected components of the graph and when committing a command, Tempo generates detached then executes these components one at a time. As we show promises up to the timestamp of that command (line 25). in §D, the size of such components is a priori unbounded. This helps ensuring the liveness of the execution mecha- This can lead to pathological scenarios where the proto- nism, since the propagation of these promises contributes to col continuously commits commands but can never execute advancing the highest stable timestamp. them, even under a synchronous network [31, 36]. It may also significantly delay the execution of committed com- 3.3 Timestamp Stability vs Explicit Dependencies mands, as illustrated by our example: since command 푥 has Prior leaderless protocols [1, 5, 13, 31, 43] commit each com- not yet been committed, and the strongly connected compo- mand 푐 with a set of explicit dependencies dep[푐]. In contrast, nent formed by the committed commands 푤,푦 and 푧 depends Tempo does not track explicit dependencies, but uses time- on 푥, no command can be executed – unlike in Tempo. As stamp stability to decide when to execute a command. This we demonstrate in our experiments (§6), execution delays in allows Tempo to ensure progress under synchrony. Protocols such situations lead to high tail latencies. using explicit dependencies do not offer such a guarantee, as they can arbitrarily delay the execution of a command. In Dependency-based stability. Caesar [1] associates each 푐 푐 practice, this translates into a high tail latency. command not only with a set of dependencies dep[ ], but 푐 Figure 3 illustrates this issue using four commands also with a unique timestamp ts[ ]. Commands are executed 푤, 푥,푦, 푧 and 푟 = 3 processes. Process A submits w and x, B in timestamp order, and dependencies are used to determine submits y, and C submits z. Commands arrive at the pro- when a timestamp is stable, and thus when the command cesses in the following order: w, x, z at A; y, w at B; and z, y at can be executed. For this, dependencies have to be consis- C. Because in this example only process A has seen command tent with timestamps in the following sense: for any two 푐 푐 ′ 푐 푐 ′ 푐 푐 ′ 푥, this command is not yet committed. In Tempo, the above commands and , if ts[ ] < ts[ ], then ∈ dep[ ]. Then command arrival order generates the following attached the timestamp of a command can be considered stable when promises: {⟨A, 1⟩, ⟨B, 2⟩} for 푤, {⟨A, 2⟩} for 푥, {⟨B, 1⟩, ⟨C, 2⟩} the transitive dependencies of the command are committed. for 푦, and {⟨C, 1⟩, ⟨A, 3⟩} for 푧. Commands 푤, 푦 and 푧 are Caesar determines the predecessors of a command while then committed with the following timestamps: ts[푤] = 2, agreeing on its timestamp. To this end, the coordinator of a ts[푦] = 2, and ts[푧] = 3. On the left of Figure 3 we present command sends the command to a quorum together with the Promises variable of some process once it receives the a timestamp proposal. The proposal is committed when promises attached to the three committed commands. Given enough processes vote for it. Assume that in our example A 푤 푥 these promises, timestamp 2 is stable at the process. Even proposes and with timestamps 1 and 4, respectively, B 푦 푧 though command 푥 is not committed, timestamp stability proposes with 2, and C proposes with 3. When B receives 푤 ensures that its timestamp must be greater than 2. Thus, command with timestamp proposal 1, it has already pro- 푦 commands 푤 and 푦, committed with timestamp 2, can be posed with timestamp 2. If these proposals succeed and are 푤 safely executed. We now show how two approaches that use committed, the above invariant is maintained only if is a 푦 푦 explicit dependencies behave in the above example. dependency of . However, because has not yet been com- mitted, its dependencies are unknown and thus B cannot yet Dependency-based ordering. EPaxos [31] and follow- ensure that 푤 is a dependency of 푦. For this reason, B must ups [5, 13, 43] order commands based on their committed block its response about 푤 until 푦 is committed. Similarly, dependencies. For example, in EPaxos, the above command command 푦 is blocked at C waiting for 푧, and 푧 is blocked at arrival order results in commands 푤, 푦 and 푧 committed with A waiting for 푥. This situation, depicted in the bottom right the following dependencies: dep[푤] = {푦}, dep[푦] = {푧}, of Figure 3, results in no command being committed – again, dep[푧] = {푤, 푥}. These form the graph shown on the top unlike in Tempo. In fact, as we show in §D, the blocking 7 EuroSys ’21, April 26–28, 2021, Online, United Kingdom Vitor Enes, Carlos Baquero, Alexey Gotsman, and Pierre Sutra

mechanism of Caesar allows pathological scenarios where Algorithm 3: Multi-partition protocol at process 푖 ∈ I푝 . commands are never committed at all. Similarly to EPaxos, 푖 56 receive MCommit(id, 푡 ) from 푗 ∈ I in practice this leads to high tail latencies (§6). In contrast 푗 cmd[id] ∈ to Caesar, Tempo computes the predecessors of a command 57 pre: id pending [ ] ← { | ∈ } [ ] ← separately from agreeing on its timestamp, via background 58 ts id max 푡푗 푗 푃 ; phase id commit ( [ ]) stability detection. This obviates the need for artificial delays 59 bump ts id in agreement, allowing Tempo to offer low tail latency (§6). 60 periodically 61 ℎ ← sort{highest_contiguous_promise(푗) | 푗 ∈ I푝 } 푟 Limitations of timestamp stability. Protocols that track 62 ids ← {id ∈ commit | ts[id] ≤ ℎ[⌊ 2 ⌋]} explicit dependencies are able to distinguish between read 63 for id ∈ ids ordered by ⟨ts[id], id⟩ and write commands. In these protocols writes depend on 64 send MStable(id) to Icmd[id] ( ) ∀ ∈ I푖 both reads and writes, but reads only have to depend on 65 wait receive MStable id from 푗 cmd[id] writes. The latter feature improves the performance in read- 66 execute푝 (cmd[id]); phase[id] ← execute dominated workloads. In contrast, Tempo does not distin- 67 receive MPropose(id, 푐, Q, 푡) from 푗 guish between read and write commands, so that its perfor- ... mance is not affected by the ratio of reads in the workload. I푖 68 send MBump(id, ts[id]) to 푐 We show in §6 that this limitation does not prevent Tempo 69 receive MBump(id, 푡) from providing similar throughput as the best-case scenario 70 pre: id ∈ propose (i.e., a read-only workload) of protocols such as EPaxos and 71 bump(푡) Janus. Adapting techniques that exploit the distinction be- tween reads and writes is left as future work. 푐 4 Multi-Partition Protocol case. The handler signals that a command is stable at a par- tition by sending an MStable message (line 64). Once such a Algorithm 3 extends the Tempo commit and execution pro- message is received from all the partitions accessed by 푐, the tocols to handle commands that access multiple partitions. command is executed. The exchange of MStable messages This is achieved by submitting a multi-partition command follows the approach in [4] and ensures and the real-time at each of the partitions it accesses using Algorithm 1. Once order constraint in the Ordering property of PSMR (§2). committed with some timestamp at each of these partitions, the command’s final timestamp is computed as the maxi- Example. Figure 4 shows an example of Tempo 푓 = 1 with mum of the committed timestamps. A command is executed 푟 = 5 and 2 partitions. Only 3 processes per partition are once it is stable at all the partitions it accesses. As previously, depicted. Partition 0 is replicated at A, B and C, and partition commands are executed in the timestamp order. 1 at F, G and H. Processes with the same color (e.g., B and G) In more detail, when a process 푖 submits a multi-partition are located nearby each other (e.g., in the same machine or command 푐 on behalf of a client (line 1), it sends an MSubmit data center). Process A and F are the coordinators for some I푖 푝 푐 message to a set 푐 . For each partition accessed by , the set command that accesses the two partitions. At partition 0, I푖 푝 푖 푐 contains a responsive replica of close to (e.g., located A computes 6 as its timestamp proposal and sends it in an I푖 , , in the same data center). The processes in 푐 then serve as MPropose message to the fast quorum {A B C} (the down- coordinators of 푐 in the respective partitions, following the ward arrows in Figure 4). These processes make the same steps in Algorithm 1. This algorithm ends with the coordi- proposal, and thus the command is committed at partition 0 nator in each partition sending an MCommit message to with timestamp 6. Similarly, at partition 1, F computes 10 as I푐 , i.e., all processes that replicate a partition accessed by 푐 its proposal and sends it to {F, G, H}, all of which propose (lines 20 and 33; note that I푐 ≠ I푝 because 푐 accesses multi- the same. The command is thus committed at partition 1 ple partitions). Hence, each process in I푐 receives as many with timestamp 10. The final timestamp of the command is MCommits as the number of partitions accessed by 푐. Once then computed as max{6, 10} = 10. this happens, the process executes the handler at line 56 in Assume that the stable timestamp at A is 5 and at F Algorithm 3, which replaces the previous MCommit handler is 9 when they compute the final timestamp for the com- in Algorithm 1. The process computes the final timestamp mand. Once F receives the attached promises by the majority of the multi-partition command as the highest of the times- {F, G, H}, timestamp 10 becomes stable at F. This is not the tamps committed at each partition, moves the command to case at A, as the attached promises by the majority {A, B, C} the commit phase and bumps the Clock to the computed only make timestamp 6 stable. However, processes A, B and timestamp, generating detached promises. C also generate detached promises up to timestamp 10 when Commands are executed using the handler at line 60, receiving the MCommit messages for the command (line 59). which replaces that at line 49. This detects command stabil- When A receives these promises, it declares timestamp 10 sta- ity using Theorem 1, which also holds in the multi-partition ble. This occurs after two extra message delays: an MCommit 8 Efficient Replication via Timestamp Stability EuroSys ’21, April 26–28, 2021, Online, United Kingdom

6 → 10 shared-memory operations. Since Tempo runs an indepen- 6 10 dent instance of the protocol for each partition replicated at A F the process, the resulting protocol is highly parallel.

B G C H 5 → 6 9 → 10 5 → 6 9 → 10 5 Recovery Protocol 6 → 10 6 → 10 The initial coordinator of a command at some partition 푝 may fail or be slow to respond, in which case Tempo allows Figure 4. Example of Tempo with 2 partitions. Next to each a process to take over its role and recover the command’s process we show the clock updates upon receiving MPropose timestamp. We now describe the protocol Tempo follows messages and, in dashed boxes, the updates upon receiving in this case, which is inspired by that of Atlas [13]. This MCommit or MBump messages (whichever occurs first). protocol at a process 푖 ∈ I푝 is given in Algorithm 4. We use initial푝 (id) to denote a function that extracts from the command identifier id its initial coordinator at partition 푝. from A and F to B and C, and then MPromises from B and C A process takes over as the coordinator for some com- back to A. Since the command’s timestamp is stable at both mand with identifier id by calling function recover(id) at A and F, once these processes exchange MStable messages, line 72. Only a process with id ∈ pending can take over as the command can finally be executed at each. a coordinator (line 73): this ensures that the process knows the command payload and fast quorums. In order to find Faster stability. Tempo avoids the above extra delays by out if a decision on the timestamp of id has been reached generating the detached promises needed for stability ear- in consensus, the new coordinator first performs an analog lier than in the MCommit handler. For this we amend the of Paxos Phase 1. It picks a ballot number it owns higher MPropose handler as shown in Algorithm 3. When a process than any it participated in so far (line 74) and sends an MRec receives an MPropose message, it follows the same steps as in message with this ballot to all processes. Algorithm 1. It then additionally sends an MBump message As is standard in Paxos, a process accepts an MRec mes- containing its proposal to the nearby processes that replicate sage only if the ballot in the message is greater than its a partition accessed by the command (line 68). Upon receiv- bal[id] (line 77). If bal[id] is still 0 (line 78), the process ing this message (line 69), a process bumps its Clock to the checks the command’s phase to decide if it should compute timestamp in the message, generating detached promises. its timestamp proposal for the command. If phase[id] = In Figure 4, MBump messages are depicted by horizontal payload (line 79), the process has not yet computed a time- dashed arrows. When G computes its proposal 10, it sends stamp proposal, and thus it does so at line 80. It also sets an MBump message containing 10 to process B. Upon recep- the command’s phase to recover-r, which records that the tion, B bumps its Clock to 10, generating detached promises timestamp proposal was computed in the MRec handler. Oth- up to that value. The same happens at A and C. Once the erwise, if phase[id] = propose (line 82), the process has al- detached promises by the majority {A, B, C} are known at A, ready computed a timestamp proposal at line 15. In this case, the process again declares 10 stable. In this case, A receives the process simply sets the command’s phase to recover-p, the required detached promises in two message delays earlier which records that the timestamp proposal was computed than when these promises are generated via MCommit. This in the MPropose handler. Finally, the process sets bal[id] to strategy often reduces the number of message delays neces- the new ballot and replies with an MRecAck message con- sary to execute a multi-partition command. However, it is not taining the timestamp (ts), the command’s phase (phase) and always sufficient (e.g., imagine that H proposed 11 instead of the ballot at which the timestamp was previously accepted 10), and thus, the promises issued in the MCommit handler in consensus (abal). Note that abal[id] = 0 if the process (line 59) are still necessary for multi-partition commands. has not yet accepted any consensus proposal. Also note that lines 79 and 82 are exhaustive: these are the only possible Genuineness and parallelism. The above protocol is gen- phases when id ∈ pending (line 77) and bal[id] = 0 (line 78), uine: for every command 푐, only the processes in I푐 take as recovery phases have non-zero ballots (line 84). steps to order and execute 푐 [16]. This is not the case for In the MRecAck handler (line 86), the new coordinator existing leaderless protocols for partial replication, such as computes the command’s timestamp given the information Janus [32]. With a genuine protocol, partitioning the ap- in the MRecAck messages and sends it in an MConsensus plication state brings scalability in parallel workloads: an message to all processes. As in Flexible Paxos, the new co- increase in the number of partitions (and thereby of avail- ordinator waits for 푟 − 푓 such messages. This guarantees able machines) leads to an increase in throughput. When that, if a quorum of 푓 +1 processes accepted an MConsensus partitions are colocated in the same machine, the message message with a timestamp (which could have thus been sent passing in Algorithm 3 can be optimized and replaced by in an MCommit message), the new coordinator will find out 9 EuroSys ’21, April 26–28, 2021, Online, United Kingdom Vitor Enes, Carlos Baquero, Alexey Gotsman, and Pierre Sutra

Algorithm 4: Recovery protocol at process 푖 ∈ I푝 . either. This is because the command’s phase at such a process is set to recover-r (line 81), which invalidates the MPropose 72 recover(id) precondition (line 13). Then, since the MProposeAck precon- 73 pre: id ∈ pending bal[id]−1 dition requires a reply from all fast-quorum processes, the 74 ← + (⌊ ⌋ + 1) 푏 푖 푟 푟 initial coordinator will not take the fast path. Thus, in either 75 (id ) I send MRec ,푏 to 푝 case, the initial coordinator never takes the fast path. For 76 receive MRec(id,푏) from 푗 this reason, the new coordinator can choose the command’s 77 pre: id ∈ pending ∧ bal[id] < 푏 timestamp in any way, as long as it maintains Property 3. 78 if bal[id] = 0 then Since |푄| = 푟 − 푓 ≥ 푟 − ⌊ 푟−1 ⌋ ≥ ⌊ 푟 ⌋ + 1, the new coordina- [ ] = 2 2 79 if phase id payload then tor has the output of proposal by a majority of processes, 80 ts[id] ← proposal(id, 0) and thus it computes the command’s timestamp with max 81 phase[id] ← recover-r (line 95), respecting Property 3. 82 else if phase[id] = propose then 2) The initial coordinator does not reply and all processes in 83 phase[id] ← recover-p 퐼 have computed their timestamp proposal in the MPropose 84 bal[id] ← 푏 handler (푠 = false, line 93). In this case the initial coordi- 85 send MRecAck(id, ts[id], phase[id], abal[id],푏) to 푗 nator could have taken the fast path with some timestamp 86 receive MRecAck(id, 푡푗 , ph푗 , ab푗 ,푏) from ∀푗 ∈ 푄 푡 = max{푡푗 | 푗 ∈ quorums[id][푝]} and, if it did, the new 87 bal[id] = 푏 ∧ |푄| = 푟 − 푓 pre: coordinator must choose that same timestamp 푡. Given that 88 ∃푘 ∈ 푄 · ab ≠ 0 if 푘 then the recovery quorum 푄 has size 푟 − 푓 and the fast quo- 89 let 푘 be such that ab is maximal 푘 rum quorums[id][푝] has size ⌊ 푟 ⌋ + 푓 , the set of processes 90 send MConsensus(id, 푡 ,푏) to I 2 푘 푝 퐼 = 푄 ∩ quorums[id][푝] contains at least ⌊ 푟 ⌋ processes 91 else 2 (distinct from the initial coordinator, as it did not reply). 92 퐼 ← 푄 ∩ quorums[id][푝] Furthermore, recall that the processes from 퐼 have the com- 93 푠 ← initial푝 (id) ∈ 퐼 ∨ ∃푘 ∈ 퐼 · ph푘 = recover-r ′ mand’s phase set to recover-p (line 83), which invalidates 94 푄 ← if 푠 then 푄 else 퐼 ′ the MPropose precondition (line 13). Hence, if the initial 95 푡 ← max{푡푗 | 푗 ∈ 푄 } coordinator took the fast path, then each process in 퐼 must 96 send MConsensus(id, 푡,푏) to I푝 have processed its MPropose before the MRec of the new co- ordinator, and reported in the latter the timestamp from the former. Then using Property 4, the new coordinator recovers about this timestamp. To maintain Property 1, if any process 푡 by selecting the highest timestamp reported in 퐼 (line 95). previously accepted a consensus proposal (line 88), by the standard Paxos rules [20, 28], the coordinator selects the Additional liveness mechanisms. As is standard, to en- proposal accepted at the highest ballot (line 89). sure the progress of recovery, Tempo nominates a single If no consensus proposal has been accepted before, the process to call recover using a partition-wide failure detec- new coordinator first computes at line 92 the set of processes tor [6], and ensures that this process picks a high enough bal- 퐼 that belong both to the recovery quorum 푄 and the fast lot. Tempo additionally includes a mechanism to ensure that, quorum quorums[id][푝]. Then, depending on whether the if a correct process receives an MPayload or an MCommit initial coordinator replied and in which handler the processes message, then all correct process do; this is also necessary for in 퐼 have computed their timestamp proposal, there are two recovery to make progress. For brevity, we defer a detailed possible cases that we describe next. description of these mechanisms to §B. 1) The initial coordinator replies or some process in 퐼 has com- We have rigorously proved that Tempo satis- puted its timestamp proposal in the MRec handler (푠 = true, Correctness. fies the PSMR specification (§2), even in case of failures. Due line 93). In either of these two cases the initial coordinator to space constraints, we defer the proof to §C. could not have taken the fast. If the initial coordinator replies (initial푝 (id) ∈ 퐼), then it has not taken the fast path before receiving the MRec message from the new one, as it would 6 Performance Evaluation have id ∈ commit ∪ execute and the MRec precondition re- In this section we experimentally evaluate Tempo in deploy- quires id ∈ pending (line 77). It will also not take the fast ments with full replication (i.e., each partition is replicated path in the future, since when processing the MRec message at all processes) and partial replication. We compare Tempo it sets the command’s phase to recover-p (line 83), which with Flexible Paxos (FPaxos) [20], EPaxos [31], Atlas [13], invalidates the MProposeAck precondition (line 18). On the Caesar [1] and Janus [32]. FPaxos is a variant of Paxos that, other hand, even if the initial coordinator replies but some like Tempo, allows selecting the allowed number of failures fast-quorum process in 퐼 has computed its timestamp pro- 푓 separately from the replication factor 푟: it uses quorums of posal in the MRec handler, the fast path will not be taken size 푓 + 1 during normal operation and quorums of size 푟 − 푓 10 Efficient Replication via Timestamp Stability EuroSys ’21, April 26–28, 2021, Online, United Kingdom during recovery. EPaxos, Atlas and Caesar are leaderless cluster contains machines with 6 physical cores and 32GB protocols that track explicit dependencies (§3.3). EPaxos and of RAM connected by a 10GBit network. 3푟 3푟 Caesar use fast quorums of size ⌊ 4 ⌋ and ⌈ 4 ⌉, respectively. 푟 푓 Benchmarks. We first evaluate full replication deploy- Atlas uses fast quorums of the same size as Tempo, i.e., ⌊ 2 ⌋+ . Atlas also improves the condition EPaxos uses for taking the ments (§6.3) using a microbenchmark where each command fast path: e.g., when 푟 = 5 and 푓 = 1, Atlas always processes carries a key of 8 bytes and (unless specified otherwise) a commands via the fast path, unlike EPaxos. To avoid clutter, payload of 100 bytes. Commands access the same partition we exclude the results for EPaxos from most of our plots when they carry the same key, in which case we say that they 휌 since its performance is similar to (but never better than) conflict. To measure performance under a conflict rate of 휌 Atlas 푓 = 1. Janus is a leaderless protocol that generalizes commands, a client chooses key 0 with a probability , and EPaxos to the setting of partial replication. It is based on an some unique key otherwise. We next evaluate partial replica- unoptimized version of EPaxos whose fast quorums contain tion deployments (§6.4) using YCSB+T [10], a transactional all replicas in a given partition. Our implementation of Janus version of the YCSB benchmark [7]. Clients are closed-loop is instead based on Atlas, which yields quorums of the same and always deployed in separate machines located in the size as Tempo and a more permissive fast-path condition. same regions as servers. Machines are connected via 16 TCP We call this improved version Janus*. This protocol is repre- sockets, each with a 16MB buffer. Sockets are flushed every sentative of the state-of-the-art for partial replication, and 5ms or when the buffer is filled, whichever is earlier. the authors of Janus have already compared it extensively to prior approaches (including MDCC [25], Tapir [45] and 6.3 Full Replication Deployment 2PC over Paxos [8]). Fairness. We first evaluate a key benefit of leaderless SMR, its fairness: the fairer the protocol, the more uniformly it 6.1 Implementation satisfies different sites. We compare Tempo, Atlas and FPaxos To improve the fairness of our comparison, all protocols when the protocols are deployed over 5 EC2 sites under two are implemented in the same framework which consists fault-tolerance levels: 푓 ∈ {1, 2}. We also compare with of 33K lines of Rust and contains common functional- Caesar which tolerates 푓 = 2 failures in this setting. At each ity necessary to implement and evaluate the protocols. site we deploy 512 clients that issue commands with a low This includes a networking layer, an in-memory key-value conflict rate (2%). store, dstat monitoring, and a set of benchmarks (e.g. Figure 5 depicts the per-site latency provided by each YCSB [7]). The source code of the framework is available at protocol. The FPaxos leader site is Ireland, as we have deter- github.com/vitorenesduarte/fantoch. mined that this site produces the fairest latencies. However, The framework provides three execution modes: cloud, even with this leader placement, FPaxos remains significantly cluster and simulator. In the cloud mode, the protocols run in unfair. When 푓 = 1, the latency observed by clients at the wide area on Amazon EC2. In the cluster mode, the protocols leader site is 82ms, while in São Paulo and Singapore it is run in a local-area network, with delays injected between 267ms and 264ms, respectively. When 푓 = 2, the clients in the machines to emulate wide-area latencies. Finally, the sim- Ireland, São Paulo and Singapore observe respectively the la- ulator runs on a single machine and computes the observed tency of 142ms, 325ms and 323ms. Overall, the performance client latency in a given wide-area configuration when CPU at non-leader sites is up to 3.3x worse than at the leader site. and network bottlenecks are disregarded. Thus, the output Due to their leaderless nature, Tempo, Atlas and Caesar of the simulator represents the best-case latency for a given satisfy the clients much more uniformly. With 푓 = 1, Tempo scenario. Together with dstat measurements, the simulator and Atlas offer similar average latency – 138ms for Tempo allows us to determine if the latencies obtained in the cloud and 155ms for Atlas. However, with 푓 = 2 Tempo clearly or cluster modes represent the best-case scenario for a given outperforms Atlas – 178ms versus 257ms. Both protocols 푟 푓 protocol or are the effect of some bottleneck. use fast quorums of size ⌊ 2 ⌋ + . But because quorums for 푓 = 2 are larger than for 푓 = 1, the size of the dependency 6.2 Experimental Setup sets in Atlas increases. This in turn increases the size of the Testbeds. As our first testbed we use Amazon EC2 with strongly connected components in execution (§3.3). Larger c5.2xlarge instances (machines with 8 virtual CPUs and 16GB components result in higher average latencies, as reported of RAM). Experiments span up to 5 EC2 regions, which we in Figure 5. Caesar provides the average latency of 195ms, call sites: Ireland (eu-west-1), Northern California (us-west- which is 17ms higher than Tempo 푓 = 2. Although Caesar 1), Singapore (ap-southeast-1), Canada (ca-central-1), and and Tempo 푓 = 2 have the same quorum size with 푟 = São Paulo (sa-east-1). The average ping latencies between 5, the blocking mechanism of Caesar delays commands in these sites range from 72ms to 338ms; we defer precise num- the critical path (§3.3), resulting in slightly higher average bers to (§A). Our second testbed is a local cluster where we latencies. As we now demonstrate, both Caesar and Atlas inject wide-area delays similar to those observed in EC2. The have much higher tail latencies than Tempo. 11 EuroSys ’21, April 26–28, 2021, Online, United Kingdom Vitor Enes, Carlos Baquero, Alexey Gotsman, and Pierre Sutra

Tempo f = 1 Atlas f = 1 FPaxos f = 1 Caesar Tempo f = 1 Atlas f = 1 Caesar Tempo f = 2 Atlas f = 2 FPaxos f = 2 Tempo f = 2 Atlas f = 2 EPaxos 99.99 300 99.0

250 97.0

200 percentiles 95.0 150 99.99 99.0 latency (ms) 100 97.0 50 percentiles 0 95.0 Singapore Canada Ireland S. Paulo N. California average 100 230 550 1200 2800 6500 15000 latency (ms) [log-scale] Figure 5. Per-site latency with 5 sites and 512 clients per Figure 6. Latency percentiles with 5 sites and 256 (top) and site under a low conflict rate (2%). 512 clients (bottom) per site under a low conflict rate (2%).

Tail latency. Figure 6 shows the latency distribution of var- ious protocols from the 95th to the 99.99th percentiles. At contention increases. This experiment, reported in Figure 7, the top we give results with 256 clients per site, and at the runs over 5 sites. It employs a growing number of clients per bottom with 512, i.e., the same load as in Figure 5. site (from 32 to 20K), where each client submits commands The tail of the latency distribution in Atlas, EPaxos and with a payload of 4KB. The top scenario of Figure 7 uses Caesar is very long. It also sharply deteriorates when the the same conflict rate as in the previous experiments (2%), load increases from 256 to 512 clients per site. For Atlas while the bottom one uses a moderate conflict rate of 10%. 푓 = 1, the 99th percentile increases from 385ms to 586ms The heatmap shows the hardware utilization (CPU, inbound while the 99.9th percentile increases from 1.3s to 2.4s. The and outbound network bandwidth) for the case when the trend is similar for Atlas 푓 = 2, making the 99.9th percentile conflict rate is 2%. For leaderless protocols, we measure the increase from 4.5s to 8s. The performance of EPaxos lies hardware utilization averaged across all sites, whereas for in between Atlas 푓 = 1 and Atlas 푓 = 2. This is because FPaxos, we only show this measure at the leader site. The with 5 sites EPaxos has the same fast quorum size as Atlas experiment runs on a local cluster with emulated wide-area 푓 = 1, but takes the slow path with a similar frequency to latencies, to have a full control over the hardware. Atlas 푓 = 2. For Caesar, increasing the number of clients also As seen in Figure 7, the leader in FPaxos quickly becomes increases the 99th percentile from 893ms to 991ms and 99.9th a bottleneck when the load increases since it has to broadcast percentile from 1.6s to 2.4s. Overall, the tail latency of Atlas, each command to all the processes. For this reason, FPaxos EPaxos and Caesar reaches several seconds, making them provides the maximum throughput of only 53K ops/s with 푓 = 푓 = impractical in these settings. These high tail latencies are 1 and of 45K ops/s with 2. The protocol saturates caused by ordering commands using explicit dependencies, at around 4K clients per site, when the outgoing network which can arbitrarily delay command execution (§3.3). bandwidth at the leader reaches 95% usage. The fact that the In contrast, Tempo provides low tail latency and pre- leader can be a bottleneck in leader-based protocol has been dictable performance in both scenarios. When 푓 = 1, the reported by several prior works [13, 23, 24, 31, 43]. 99th, 99.9th and 99.99th percentiles are respectively 280ms, FPaxos is not affected by contention and the protocol 361ms and 386ms (averaged over the two scenarios). When has identical behavior for the two conflict rates. On the 푓 = 2, these values are 449ms, 552ms and 562ms. This rep- contrary, Atlas performance degrades when contention in- resents an improvement of 1.4-8x over Atlas, EPaxos and creases. With a low conflict rate (2%), the protocol provides 푓 = Caesar with 256 clients per site, and an improvement of 4.3- the maximum throughput of 129K ops/s with 1 and of 푓 = 14x with 512. The tail of the distribution is much shorter 127K ops/s with 2. As observed in the heatmap (bot- with Tempo due to its efficient execution mechanism, which tom of Figure 7), Atlas cannot fully leverage the available uses timestamp stability instead of explicit dependencies. hardware. CPU usage reaches at most 59%, while network We have also run the above scenarios in our wide-area utilization reaches 41%. This low value is due to a bottleneck simulator. In this case the latencies for Atlas, EPaxos and in the execution mechanism: its implementation, which fol- Caesar are up to 30% lower, since CPU time is not accounted lows the one by the authors of EPaxos, is single-threaded. for. The trend, however, is similar. This confirms that the Increasing the conflict rate to 10% further decreases hard- latencies reported in Figure 6 accurately capture the effect ware utilization: the maximum CPU usage decreases to 40% of long dependency chains and are not due to a bottleneck and network to 27% (omitted from Figure 7). This sharp de- in the execution mechanism of the protocols. crease is due to the dependency chains, whose sizes increase with higher contention, thus requiring fewer clients to bot- Increasing load and contention. We now evaluate the tleneck execution. As a consequence, the throughput of Atlas performance of the protocols when both the client load and decreases by 36% with 푓 = 1 (83K ops/s) and by 48% with 12 Efficient Replication via Timestamp Stability EuroSys ’21, April 26–28, 2021, Online, United Kingdom

Tempo f = 1 Atlas f = 1 FPaxos f = 1 Caesar* Tempo f = 1 FPaxos f = 1 Tempo f = 2 Atlas f = 2 FPaxos f = 2 800 1500 950 600 600 400 390 250 200 160 max. throughput (K ops/s) latency (ms) [log-scale] 100 0 0 50 100 150 200 1500 OFF ON OFF ON OFF ON 256 bytes 1024 bytes 4096 bytes 950 600 Figure 8. Maximum throughput with batching disabled 390 (OFF) and enabled (ON) for 256, 1024 and 4096 bytes. 250 Tempo Janus* w = 5% 160 Janus* w = 0% Janus* w = 50%

latency (ms) [log-scale] 100 1000 0 50 100 150 200 throughput (K ops/s) 800 cpu net_in net_out Tempo f = 1 600 Tempo f = 2 Atlas f = 1 400 Atlas f = 2 FPaxos f = 1 FPaxos f = 2 200

Caesar* max. throughput (K ops/s) 0 32 512 0 25 50 75 100 zipf = 0.5 zipf = 0.7 zipf = 0.5 zipf = 0.7 zipf = 0.5 zipf = 0.7 1024204840968192 1638420480 utilization (%) 2 shards 4 shards 6 shards Figure 7. Throughput and latency with 5 sites as the load Figure 9. Maximum throughput with 3 sites per shard under increases from 32 to 20480 clients per site under a low (2% – low (zipf = 0.5) and moderate contention (zipf = 0.7). Three top) and moderate (10% – bottom) conflict rate. The heatmap workloads are considered for Janus*: 0% writes as the best- shows the hardware utilization when the conflict rate is 2%. case scenario, 5% writes and 50% writes.

푓 = 2 (67K ops/s). As before, EPaxos performance (omitted in the network (Figure 7), enabling batching does not help. from Figure 7) lies between Atlas 푓 = 1 and 푓 = 2. When the payload size is reduced further to 256B, the bot- As we mentioned in §3.3, Caesar exhibits inefficiencies tleneck shifts to the leader thread. In this case, enabling even in its commit protocol. For this reason, in Figure 7 batching allows FPaxos to increase its performance by 4x. we study the performance of Caesar in an ideal scenario Since Tempo performs heavier computations than FPaxos, where commands are executed as soon as they are committed. the use of batches in Tempo only brings a moderate improve- Caesar’s performance is capped respectively at 104K ops/s ment: 1.6x with 256B and 1.3x with 1KB. In the worst case, with 2% conflicts and 32K ops/s with 10% conflicts. This with 4KB, the protocol can even perform less efficiently. performance decrease is due to Caesar’s blocking mechanism While batching can boost leader-based SMR protocols, the (§3.3) and is in line with the results reported in [1]. benefits are limited for leaderless ones. However, because Tempo delivers the maximum throughput of 230K ops/s. leaderless protocols already efficiently balance resource us- This value is independent of the conflict rate and fault- age across replicas, they can match or even outperform the performance of leader-based protocols, as seen in Figure 8. tolerance level (i.e., 푓 ∈ {1, 2}). Moreover, it is 4.3-5.1x better than FPaxos and 1.8-3.4x better than Atlas. Saturation occurs 6.4 Partial Replication Deployment with 16K clients per site, when the CPU usage reaches 95%. At this point, network utilization is roughly equal to 80%. We now compare Tempo with Janus* using the YCSB+T Latency in the protocol is almost unaffected until saturation. benchmark. We define a shard as set of several partitions co-located in the same machine. Each partition contains a Batching. We now compare the effects of batching in leader- single YCSB key. Each shard holds 1M keys and is replicated based and leaderless protocols. Figure 8 depicts the maximum at 3 sites (Ireland, N. California and Singapore) emulated throughput of FPaxos and Tempo with batching disabled and in our cluster. Clients submit commands that access two enabled. In this experiment, a batch is created at a site af- keys picked at random following the YCSB access pattern ter 5ms or once 105 commands are buffered, whichever is (a zipfian distribution). In Figure 9 we show the maximum earlier. Thus, each batch consists of several single-partition throughput for both Tempo and Janus* under low (zipf = 0.5) commands aggregated into one multi-partition command. and moderate contention (zipf = 0.7). For Janus*, we consider We consider 3 payload sizes: 256B, 1KB and 4KB. The num- 3 YCSB workloads that vary the percentage of write com- bers for 4KB with batching disabled correspond to the ones mands (denoted by w): read-only (w = 0%, YCSB workload C), in Figure 7. Because with 4KB and 1KB FPaxos bottlenecks read-heavy (w = 5%, YCSB workload B), and update-heavy 13 EuroSys ’21, April 26–28, 2021, Online, United Kingdom Vitor Enes, Carlos Baquero, Alexey Gotsman, and Pierre Sutra

(w = 50%, YCSB workload A). The read-only workload is a Timestamping has been used in two previous leaderless rare workload in SMR deployments. It represents the best- SMR protocols. Caesar [1], which we discussed in §3.3 and case scenario for Janus*, which we use as a baseline. Since §6, suffers from similar problems to EPaxos. Clock-RSM [11] Tempo does not distinguish between reads and writes (§3.3), timestamps each newly submitted command with the co- we have a single workload for this protocol. ordinator’s clock, and then records the association at 푓 + 1 Janus* performance is greatly affected by the ratio of processes using consensus. Stability occurs when all the pro- writes and by contention. More writes and higher contention cesses indicate that their clocks have passed the command’s translate into larger dependency sets, which bottleneck ex- timestamp. As a consequence, the protocol cannot transpar- ecution faster. This is aggravated by the fact that Janus* is ently mask failures, like Tempo; these have to be handled via non-genuine, and thus requires cross-shard messages to or- reconfiguration. Its performance is also capped by the speed der commands. With zipf = 0.5, increasing w from 0% to 5% of the slowest replica, similarly to Mencius [30]. reduces throughput by 25-26%. Increasing w from 0% to 50% Partial replication is a common way of scaling services reduces throughput by 49-56%. When contention increases that do not fit on a single machine. Some partially replicated (zipf = 0.7), the above reductions on throughput are larger, systems use a central node to manage access to data, made reaching 36-60% and 87%-94%, respectively. fault-tolerant via standard SMR techniques [15]. Spanner [8] Tempo provides nearly the same throughput as the best- replaces the central node by a distributed protocol that layers case scenario for Janus* (w = 0%). Moreover, its performance two-phase commit on top of Paxos. Granola [9] follows a is virtually unaffected by the increased contention. This similar schema using Viewstamped Replication [33]. Other comes from the parallel and genuine execution brought by approaches rely on atomic multicast, a primitive ensuring the the use of timestamp stability (§4). Overall, Tempo provides consistent delivery of messages across arbitrary groups of 385K ops/s with 2 shards, 606K ops/s with 4 shards, and 784K processes [16, 37]. Atomic multicast can be seen as a special ops/s with 6 shards (averaged over the two zipf values). Com- case of PSMR as defined in §2. pared to Janus* w = 5% and Janus* w = 50%, this represents Janus [32] generalizes EPaxos to the setting of partial repli- respectively a speedup of 1.2-2.5x and 2-16x. cation. Its authors shows that for a large class of applications The tail latency issues demonstrated in Figure 6 also carry that require only one-shot transactions, Janus improves upon over to partial replication. For example, with 6 shards, zipf prior techniques, including MDCC [25], Tapir [45] and 2PC = 0.7 and w = 5%, the 99.99th percentile for Janus* reaches over Paxos [8]. Our experiments demonstrate that Tempo 1.3s, while Tempo provides 421ms. We also ran the same set significantly outperforms Janus due to its use of timestamps of workloads for the full replication case and the speed up instead of explicit dependencies. Unlike Janus, Tempo is also of Tempo with respect to EPaxos and Atlas is similar. genuine, which translates into better performance.

8 Conclusion We have presented Tempo – a new SMR protocol for geo- 7 Related Work distributed systems. Tempo follows a leaderless approach, Timestamping (aka sequencing) is widely used in distributed ordering commands in a fully decentralized manner and thus systems. In particular, many storage systems orchestrate offering similar quality of service to all clients. In contrast to data access using a fault-tolerant timestamping service [2, previous leaderless protocols, Tempo determines the order 3, 35, 41, 46], usually implemented by a leader-based SMR of command execution solely based on scalar timestamps, protocol [28, 34]. As reported in prior works, the leader is a and cleanly separates timestamp assignment from detecting potential bottleneck and is unfair with respect to client loca- timestamp stability. Moreover, this mechanism easily extends tions [13, 23, 24, 31, 43]. To sidestep these problems, leader- to partial replication. As shown in our evaluation, Tempo’s less protocols order commands in a fully decentralized man- approach enables the protocol to offer low tail latency and ner. Early protocols in this category, such as Mencius [30], high throughput even under contended workloads. rotated the role of leader among processes. However, this made the system run at the speed of the slowest replica. Acknowledgments. We thank our shepherd, Natacha More recent ones, such as EPaxos [31] and its follow-ups Crooks, as well as Antonios Katsarakis, Ricardo Macedo, [5, 13, 43], order commands by agreeing on a graph of de- Georges Younes for comments and suggestions. We also pendencies (§3.3). Tempo builds on one of these follow-ups, thank Balaji Arun, Roberto Palmieri and Sebastiano Peluso Atlas [13], which leverages the observation that correlated for discussions about Caesar. Vitor Enes was supported by failures in geo-distributed systems are rare [8] to reduce the an FCT PhD Fellowship (PD/BD/142927/2018). Pierre Su- quorum size in leaderless SMR. As demonstrated by our eval- tra was supported by EU H2020 grant No 825184 and ANR uation (§6.3), dependency-based leaderless protocols exhibit grant 16-CE25-0013-04. Alexey Gotsman was supported by high tail latency and suffer from bottlenecks due to their an ERC Starting Grant RACCOON. This work was partially expensive execution mechanism. supported by the AWS Cloud Credit for Research program. 14 Efficient Replication via Timestamp Stability EuroSys ’21, April 26–28, 2021, Online, United Kingdom

References [19] JoAnne Holliday, Divyakant Agrawal, and Amr El Abbadi. 2002. Partial [1] Balaji Arun, Sebastiano Peluso, Roberto Palmieri, Giuliano Losa, and Database Replication using Epidemic Communication. In International Binoy Ravindran. 2017. Speeding up Consensus by Chasing Fast Deci- Conference on Systems (ICDCS). sions. In International Conference on Dependable Systems and Networks [20] Heidi Howard, Dahlia Malkhi, and Alexander Spiegelman. 2016. Flexi- (DSN). ble Paxos: Quorum Intersection Revisited. In International Conference [2] Mahesh Balakrishnan, Dahlia Malkhi, Vijayan Prabhakaran, Ted Wob- on Principles of Distributed Systems (OPODIS). ber, Michael Wei, and John D. Davis. 2012. CORFU: A Shared Log [21] Cheng Huang, Huseyin Simitci, Yikang Xu, Aaron Ogus, Brad Calder, Design for Flash Clusters. In Symposium on Networked Systems Design Parikshit Gopalan, Jin Li, and Sergey Yekhanin. 2012. Erasure Coding and Implementation (NSDI). in Windows Azure Storage. In USENIX Annual Technical Conference [3] Mahesh Balakrishnan, Dahlia Malkhi, Ted Wobber, Ming Wu, Vijayan (USENIX ATC). Prabhakaran, Michael Wei, John D. Davis, Sriram Rao, Tao Zou, and [22] Robert Kallman, Hideaki Kimura, Jonathan Natkins, Andrew Pavlo, Aviad Zuck. 2013. Tango: Distributed Data Structures over a Shared Alex Rasin, Stanley B. Zdonik, Evan P. C. Jones, Samuel Madden, Log. In Symposium on Operating Systems Principles (SOSP). Michael Stonebraker, Yang Zhang, John Hugg, and Daniel J. Abadi. [4] Carlos Eduardo Benevides Bezerra, Fernando Pedone, and Robbert van 2008. H-Store: A High-Performance, Distributed Main Memory Trans- Renesse. 2014. Scalable State-Machine Replication. In International action Processing System. Proc. VLDB Endow. (2008). Conference on Dependable Systems and Networks (DSN). [23] Antonios Katsarakis, Vasilis Gavrielatos, M. R. Siavash Katebzadeh, [5] Matthew Burke, Audrey Cheng, and Wyatt Lloyd. 2020. Gryff: Unifying Arpit Joshi, Aleksandar Dragojevic, Boris Grot, and Vijay Nagarajan. Consensus and Shared Registers. In Symposium on Networked Systems 2020. Hermes: A Fast, Fault-Tolerant and Linearizable Replication Design and Implementation (NSDI). Protocol. In Architectural Support for Programming Languages and [6] Tushar Deepak Chandra, Vassos Hadzilacos, and Sam Toueg. 1996. Operating Systems (ASPLOS). The Weakest Failure Detector for Solving Consensus. J. ACM (1996). [24] Marios Kogias and Edouard Bugnion. 2020. HovercRaft: Achieving [7] Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, Scalability and Fault-tolerance for microsecond-scale Datacenter Ser- and Russell Sears. 2010. Benchmarking Cloud Serving Systems with vices . In European Conference on Computer Systems (EuroSys). YCSB. In Symposium on Cloud Computing (SoCC). [25] Tim Kraska, Gene Pang, Michael J. Franklin, Samuel Madden, and [8] James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christo- Alan D. Fekete. 2013. MDCC: Multi-Data Center Consistency. In pher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christo- European Conference on Computer Systems (EuroSys). pher Heiser, Peter Hochschild, Wilson C. Hsieh, Sebastian Kanthak, [26] Avinash Lakshman and Prashant Malik. 2010. Cassandra: A Decentral- Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David ized Structured Storage System. ACM SIGOPS Oper. Syst. Rev. (2010). Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Ya- [27] Leslie Lamport. 1978. Time, Clocks, and the Ordering of Events in a sushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Distributed System. Commun. ACM (1978). Dale Woodford. 2012. Spanner: Google’s Globally-Distributed Data- [28] Leslie Lamport. 1998. The Part-Time Parliament. ACM Trans. Comput. base. In Symposium on Operating Systems Design and Implementation Syst. (1998). (OSDI). [29] Haonan Lu, Christopher Hodsdon, Khiem Ngo, Shuai Mu, and Wyatt [9] James A. Cowling and . 2012. Granola: Low-Overhead Lloyd. 2016. The SNOW Theorem and Latency-Optimal Read-Only Distributed Transaction Coordination. In USENIX Annual Technical Transactions. In Symposium on Operating Systems Design and Imple- Conference (USENIX ATC). mentation (OSDI). [10] Akon Dey, Alan D. Fekete, Raghunath Nambiar, and Uwe Röhm. 2014. [30] Yanhua Mao, Flavio Paiva Junqueira, and Keith Marzullo. 2008. Men- YCSB+T: Benchmarking Web-scale Transactional Databases. In Inter- cius: Building Efficient Replicated State Machine for WANs. In Sympo- national Conference on Data Engineering Workshops (ICDEW). sium on Operating Systems Design and Implementation (OSDI). [11] Jiaqing Du, Daniele Sciascia, Sameh Elnikety, Willy Zwaenepoel, and [31] Iulian Moraru, David G. Andersen, and Michael Kaminsky. 2013. There Fernando Pedone. 2014. Clock-RSM: Low-Latency Inter-datacenter Is More Consensus in Egalitarian Parliaments. In Symposium on Oper- State Machine Replication Using Loosely Synchronized Physical ating Systems Principles (SOSP). Clocks. In International Conference on Dependable Systems and Net- [32] Shuai Mu, Lamont Nelson, Wyatt Lloyd, and Jinyang Li. 2016. Con- works (DSN). solidating Concurrency Control and Consensus for Commits under [12] , Nancy A. Lynch, and Larry J. Stockmeyer. 1988. Con- Conflicts. In Symposium on Operating Systems Design and Implementa- sensus in the Presence of Partial Synchrony. J. ACM (1988). tion (OSDI). [13] Vitor Enes, Carlos Baquero, Tuanir França Rezende, Alexey Gotsman, [33] Brian M. Oki and Barbara Liskov. 1988. Viewstamped Replication: Matthieu Perrin, and Pierre Sutra. 2020. State-Machine Replication for A General Primary Copy. In Symposium on Principles of Distributed Planet-Scale Systems. In European Conference on Computer Systems Computing (PODC). (EuroSys). [34] Diego Ongaro and John K. Ousterhout. 2014. In Search of an Under- [14] FaunaDB. [n.d.]. What is FaunaDB? https://docs.fauna.com/fauna/ standable Consensus Algorithm. In USENIX Annual Technical Confer- current/introduction.html ence (USENIX ATC). [15] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The [35] Daniel Peng and Frank Dabek. 2010. Large-scale Incremental Process- Google File System. In Symposium on Operating Systems Principles ing Using Distributed Transactions and Notifications. In Symposium (SOSP). on Operating Systems Design and Implementation (OSDI). [16] Rachid Guerraoui and André Schiper. 2001. Genuine Atomic Multicast [36] Tuanir França Rezende and Pierre Sutra. 2020. Leaderless State- in Asynchronous Distributed Systems. Theor. Comput. Sci. (2001). Machine Replication: Specification, Properties, Limits. In International [17] Raluca Halalai, Pierre Sutra, Etienne Riviere, and Pascal Felber. 2014. Symposium on Distributed Computing (DISC). ZooFence: Principled Service Partitioning and Application to the [37] Nicolas Schiper, Pierre Sutra, and Fernando Pedone. 2010. P-Store: ZooKeeper Coordination Service. In Symposium on Reliable Distributed Genuine Partial Replication in Wide Area Networks. In Symposium on Systems (SRDS). Reliable Distributed Systems (SRDS). [18] Maurice Herlihy and Jeannette M. Wing. 1990. Linearizability: A [38] Fred B. Schneider. 1990. Implementing Fault-Tolerant Services Using Correctness Condition for Concurrent Objects. ACM Trans. Program. the State Machine Approach: A Tutorial. ACM Comput. Surv. (1990). Lang. Syst. (1990). 15 EuroSys ’21, April 26–28, 2021, Online, United Kingdom Vitor Enes, Carlos Baquero, Alexey Gotsman, and Pierre Sutra

[39] Rebecca Taft, Irfan Sharif, Andrei Matei, Nathan VanBenschoten, Jor- dan Lewis, Tobias Grieger, Kai Niemi, Andy Woods, Anne Birzin, Raphael Poss, Paul Bardea, Amruta Ranade, Ben Darnell, Bram Gruneir, Justin Jaffray, Lucy Zhang, and Peter Mattis. 2020. CockroachDB: The Resilient Geo-Distributed SQL Database. In International Conference on Management of Data (SIGMOD). [40] Alexander Thomson and Daniel J. Abadi. 2010. The Case for Deter- minism in Database Systems. Proc. VLDB Endow. (2010). [41] Alexander Thomson, Thaddeus Diamond, Shu-Chun Weng, Kun Ren, Philip Shao, and Daniel J. Abadi. 2012. Calvin: Fast Distributed Trans- actions for Partitioned Database Systems. In International Conference on Management of Data (SIGMOD). [42] Alexandru Turcu, Sebastiano Peluso, Roberto Palmieri, and Binoy Ravindran. 2014. Be General and Don’t Give Up Consistency in Geo- Replicated Transactional Systems. In International Conference on Prin- ciples of Distributed Systems (OPODIS). [43] Michael Whittaker, Neil Giridharan, Adriana Szekeres, Joseph M. Hellerstein, and Ion Stoica. 2020. Bipartisan Paxos: A Modular State Machine Replication Protocol. arXiv CoRR abs/2003.00331 (2020). https://arxiv.org/abs/2003.00331 [44] YugabyteDB. [n.d.]. Architecture > DocDB replication layer > Replication. https://docs.yugabyte.com/latest/architecture/docdb- replication/replication/ [45] Irene Zhang, Naveen Kr. Sharma, Adriana Szekeres, Arvind Krishna- murthy, and Dan R. K. Ports. 2015. Building Consistent Transactions with Inconsistent Replication. In Symposium on Operating Systems Principles (SOSP). [46] Jianjun Zheng, Qian Lin, Jiatao Xu, Cheng Wei, Chuwei Zeng, Pingan Yang, and Yunfan Zhang. 2017. PaxosStore: High-availability Storage Made Practical in WeChat. Proc. VLDB Endow. (2017).

16 Efficient Replication via Timestamp Stability EuroSys ’21, April 26–28, 2021, Online, United Kingdom

A Latency Data Table 2 shows the average ping latencies between the 5 EC2 sites used in our evaluation (§6): Ireland (eu-west-1), Northern California (us-west-1), Singapore (ap-southeast-1), Canada (ca-central-1), and São Paulo (sa-east-1).

Table 2. Ping latency (milliseconds) between sites.

N. California Singapore Canada S. Paulo Ireland 141 186 72 183 N. California 181 78 190 Singapore 221 338 Canada 123

B The Full Tempo Protocol Algorithms 5 and 6 specify the full multi-partition Tempo protocol, including the liveness mechanisms omitted from §5. The latter are described in the §B.1. Table 3 summarizes the data maintained by each process, where O denotes the set of all partitions. In the following, line numbers refer to the pseudocode in Algorithms 5 and 6. An additional optimization. Notice that if a process in the fast quorum is slow, Tempo needs to execute recovery. We can avoid this by having the processes receiving an MPayload message (line 9) also generate timestamp proposals and send them to the coordinator. If the fast path quorum is slow to answer, these additional replies can be used to take the slow path, as long as a majority of processes replies (Property 3).

Table 3. Tempo variables at a process from partition 푝.

cmd[푖푑] ← ⊥ ∈ C Command ts[푖푑] ← 0 ∈ N Timestamp phase[푖푑] ← start Phase quorums[푖푑] ← ∅ ∈ O ↩→ P(I푝 ) Fast quorum used per partition bal[푖푑] ← 0 ∈ N Current ballot abal[푖푑] ← 0 ∈ N Last accepted ballot Clock ← 0 ∈ N Current clock Detached ← ∅ ∈ P(I푝 × N) Detached promises Attached[푖푑] ← ∅ ∈ P(I푝 × N) Attached promises Promises ← ∅ ∈ P(I푝 × N) Known promises

B.1 Liveness Protocol Tempo uses Ω, the leader election failure detector [6], which ensures that from some point on, all correct processes nominate the same correct process as the leader. Tempo runs an instance of Ω per partition 푝. In Algorithm 6, leader푝 denotes the current leader nominated for partition 푝 at process 푖. We say that leader푝 stabilizes when it stops changing at all correct processes in 푝. I푖 푖 푐 I푖 Tempo also uses 푐 , which we call the partition covering failure detector (§4). At each process and for every command , 푐 returns a set of processes 퐽 such that, for every partition 푝 accessed by 푐, 퐽 contains one process close to 푖 that replicates 푝. I푖 Eventually, 푐 only returns correct processes. In the definition above, the closeness between replicas is measured is in termsof I푖 latency. Returning close replicas is needed for performance but not necessary for the liveness of the protocol. Both 푐 and Ω are easily implementable under our assumption of eventual synchrony (§2). For every command id with id ∈ pending (line 75), process 푖 is allowed to invoke recover(id) at line 78 only if it is the leader of partition 푝 according to leader푝 . Furthermore, it only invokes recover(id) at line 78 if it has not yet participated in consensus (i.e., bal[id] = 0) or, if it did, the consensus was lead by another process (i.e., bal_leader(bal[id]) ≠ 푖). In particular, process 푖 does not invoke recover(id) at line 78 if it is the leader of bal[id] (i.e., if bal_leader(bal[id]) = 푖). This ensures that process 푖 does disrupt a recovery lead by itself.

17 EuroSys ’21, April 26–28, 2021, Online, United Kingdom Vitor Enes, Carlos Baquero, Alexey Gotsman, and Pierre Sutra

Algorithm 5: Tempo commit and recovery protocol at process 푖 ∈ I푝 .

1 submit(푐) 38 recover(id) 2 pre: 푖 ∈ I푐 39 pre: id ∈ pending bal[id]−1 3 id ← next_id(); Q ← fast_quorums(푖, I ) 푏 ← 푖 + 푟 (⌊ ⌋ + ) 푐 40 푟 1 , 푐, I푖 4 send MSubmit(id Q) to 푐 41 send MRec(id,푏) to I푝 , 푐, 5 receive MSubmit(id Q) 42 receive MRec(id,푏) from 푗 푡 6 ← Clock + 1 43 pre: id ∈ pending ∧ bal[id] < 푏 , 푐, , 푡 푝 7 send MPropose(id Q ) to Q[ ] 44 if bal[id] = 0 then 8 send MPayload(id, 푐, Q) to I푝 \Q[푝] 45 if phase[id] = payload then 46 ts[id] ← proposal(id, 0) 9 receive MPayload(id, 푐, Q) [ ] ← 10 pre: id ∈ start 47 phase id recover-r 11 cmd[id] ← 푐; quorums[id] ← Q; phase[id] ← payload 48 else if phase[id] = propose then 49 phase[id] ← recover-p 12 receive MPropose(id, 푐, Q, 푡) from 푗 50 bal[id] ← 푏 13 pre: id ∈ start 51 send MRecAck(id, ts[id], phase[id], abal[id],푏) to 푗 14 cmd[id] ← 푐; quorums[id] ← Q; phase[id] ← propose , 푡 , , ,푏 푗 푄 15 ts[id] ← proposal(id, 푡) 52 receive MRecAck(id 푗 ph푗 ab푗 ) from ∀ ∈ [ ] = 푏 ∧ |푄| = 푟 − 푓 16 send MProposeAck(id, ts[id]) to 푗 53 pre: bal id , I푖 54 ∃푘 ∈ 푄 · ab ≠ 0 17 send MBump(id ts[id]) to 푐 if 푘 then 55 let 푘 be such that ab푘 is maximal 18 receive MBump(id, 푡) 56 send MConsensus(id, 푡푘,푏) to I푝 19 pre: id ∈ propose 57 else 20 bump(푡) 58 퐼 ← 푄 ∩ quorums[id][푝] 21 receive MProposeAck(id, 푡푗 ) from ∀푗 ∈ 푄 푠 퐼 푘 퐼 59 ← initial푝 (id) ∈ ∨ ∃ ∈ · ph푘 = recover-r 22 pre: id ∈ propose ∧ 푄 = quorums[id][푝] ′ 60 푄 ← if 푠 then 푄 else 퐼 23 푡 ← max{푡푗 | 푗 ∈ 푄} ′ 61 푡 ← max{푡푗 | 푗 ∈ 푄 } 24 if count(푡) ≥ 푓 then send MCommit(id, 푡) to Icmd[id] 62 send MConsensus(id, 푡,푏) to I푝 25 else send MConsensus(id, 푡, 푖) to I푝 ,푚 ( , 푡 ) 푗 ∈ I푖 63 proposal(id ) 26 receive MCommit id 푗 from cmd[id] 64 푡 ← max(푚, Clock + 1) 27 pre: id ∈ pending 65 Detached ← Detached ∪ {⟨푖,푢⟩ | Clock + 1 ≤ 푢 ≤ 푡 − 1} 28 ts[id] ← max{푡푗 | 푗 ∈ 푃}; phase[id] ← commit 66 Attached[id] ← {⟨푖, 푡⟩} 29 bump(ts[id]) 67 Clock ← 푡 30 receive MConsensus(id, 푡,푏) from 푗 68 return 푡 31 pre: bal[id] ≤ 푏 69 bump(푡) 32 ts[id] ← 푡; bal[id] ← 푏; abal[id] ← 푏 70 푡 ← max(푡, Clock) 33 bump(푡) 71 Detached ← Detached ∪ {⟨푖,푢⟩ | Clock + 1 ≤ 푢 ≤ 푡} 34 send MConsensusAck(id,푏) to 푗 72 Clock ← 푡 35 receive MConsensusAck(id,푏) from 푄 푏 푄 푓 73 bal_leader(푏) 36 pre: bal[id] = ∧ | | = + 1 j k 푏 − 푟 ∗ 푏−1 37 send MCommit(id, ts[id]) to Icmd[id] 74 return 푟

18 Efficient Replication via Timestamp Stability EuroSys ’21, April 26–28, 2021, Online, United Kingdom

Algorithm 6: Tempo liveness and execution protocol at process 푖 ∈ I푝 .

75 periodically 90 periodically 76 for id ∈ pending 91 send MPromises(Detached, Attached) to I푝 77 MPayload(id, cmd[id], quorums[id]) I [ ] send to cmd id 92 receive MPromises(퐷, 퐴) from 푗 푖 ∧ ( [ ] ∨ ( [ ]) 푖) 78 if leader푝 = bal id = 0 bal_leader bal id ≠ 93 퐶 ← Ð{푎 | ⟨id, 푎⟩ ∈ 퐴 ∧ id ∈ commit ∪ execute} ( ) then recover id 94 Promises ← Promises ∪ 퐷 ∪ 퐶 79 receive MConsensus(id, _,푏) or MRec(id,푏) from 푗 95 for ⟨id, _⟩ ∈ 퐴 · id ∉ commit ∪ execute 80 pre: bal[id] > 푏 96 send MCommitRequest(id) to Icmd[id] ( , [ ]) 푗 81 send MRecNAck id bal id to 97 periodically ℎ 푗 푗 I 82 receive MRecNAck(id,푏) 98 ← sort{highest_contiguous_promise( ) | ∈ 푝 } 푖 푏 ℎ 푟 83 pre: leader푝 = ∧ bal[id] < 99 ids ← {id ∈ commit | ts[id] ≤ [⌊ 2 ⌋]} 84 bal[id] ← 푏 100 for id ∈ ids ordered by ⟨ts[id], id⟩ 85 recover(id) 101 send MStable(id) to Icmd[id] 푖 102 MStable(id) ∀푗 ∈ I 86 receive MCommitRequest(id) from 푗 wait receive from cmd[id] 87 pre: id ∈ commit ∪ execute 103 execute푝 (cmd[id]); phase[id] ← execute , , 푗 88 send MPayload(id cmd[id] quorums[id]) to 104 highest_contiguous_promise(푗) 89 send MCommit(id, ts[id]) to 푗 105 max{푐 ∈ N | ∀푢 ∈ {1 . . . 푐}·⟨푗,푢⟩ ∈ Promises}

For a leader to make progress with some MRec(id,푏) message, it is required that 푛 − 푓 processes (line 53) have their bal[id] < 푏 (line 43). This may not always be the case because before the variable leader푝 stabilizes, any process can invoke recover(id) at line 78 if it thinks it is the leader. To help the leader select a high enough ballot, and thus ensure it will make progress, we introduce a new message type, MRecNAck. A process sends an MRecNAck(id, bal[id]) at line 81 when it receives an MConsensus(id, _,푏) or MRec(id,푏) (line 79) with some ballot number 푏 lower than its bal[id] (line 80). When process 푖 receives an MRecNAck(id,푏) with some ballot number 푏 higher than its bal[id], if it is still the leader (line 83), it joins ballot 푏 (line 84) and invokes recover(id) again. This results in process 푖 sending a new MRec with some ballot higher than 푏 lead by itself. As only leader푝 is allowed to invoke recover(id) at line 78 and line 83, and since leader푝 eventually stabilizes, this mechanism ensures that eventually the stable leader will start a high enough ballot in which enough processes will participate. Given a command 푐 with identifier id submitted by a correct process, a correct process in I푐 can only commit 푐 if it has id ∈ pending and receives an MCommit(id, _) from every partition accessed by 푐 (line 26). Next, we let 푝 be one such partition, and we explain how every correct process in I푐 eventually has id ∈ pending and receives such an MCommit(id, _) sent by some correct process in I푝 . We consider two distinct scenarios. In the first scenario, some correct process 푖 ∈ I푝 already has id ∈ commit ∪ execute. This scenario is addressed by adding two new lines to the MPromises handler: line 95 and line 96. Since id ∈ commit ∪ execute at 푖, by Property 3 there is a majority of processes in each partition 푞 accessed by 푐 that have called proposal(id, _), and have thus generated a promise attached to id. Moreover, given that at most 푓 processes can fail, at least one process from such a majority is correct. Let one of these processes be 푗 ∈ I푞. Due to line 91, process 푗 periodically sends an MPromises message that contains a promise attached to id. When a process 푘 ∈ I푞 receives such a message, if id is neither committed nor executed locally (line 95), 푘 sends an MCommitRequest(id) to I푐 . In particular, it sends such message to process 푖 ∈ I푝 . As id ∈ commit ∪ execute at process 푖, when 푖 receives the MCommitRequest(id) (line 86), it replies with an MPayload(id, _, _) and an MCommit(id, _). In this way, any process 푘 ∈ I푞 will eventually have id ∈ pending and receive an MCommit(id, _) sent by some correct process in I푝 , as required. In the second scenario, no correct process in I푝 has id ∈ commit ∪ execute but some correct process in I푐 has id ∈ pending. Due to line 77, such process sends an MPayload message to I푐 . Thus, every correct process in I푝 eventually has id ∈ pending. In particular, this allows leader푝 to invoke recover(id) and the processes in I푝 to react to the MRec(id, _) message by leader푝 . Hence, after picking a high enough ballot using the mechanism described earlier, leader푝 will eventually send an MCommit(id, _) to I푐 , as required. Finally, note that calling recover(id) for any command id such that id ∈ pending (line 75) can disrupt the fast path. This does not affect liveness but may degrade performance. Thus, to attain good performance, any implementation of Tempo should only start calling recover(id) after some reasonable timeout on id. In order to save bandwidth, the sending of MCommitRequest messages can also be delayed in the hope that such information will be received anyway. 19 EuroSys ’21, April 26–28, 2021, Online, United Kingdom Vitor Enes, Carlos Baquero, Alexey Gotsman, and Pierre Sutra

C Correctness In this section we prove that the Tempo protocol satisfies the PSMR specification (§2). We omit the trivial proof of Validity, and prove Ordering in §C.1 and Liveness in §C.2.

C.1 Proof of Ordering Consider the auxiliary invariants below: 1. Assume MConsensus(_, _,푏) has been sent by process 푖. Then 푏 = 푖 or 푏 > 푟. 2. Assume MConsensus(id, 푡,푏) and MConsensus(id, 푡 ′,푏′) have been sent. If 푏 = 푏′, then 푡 = 푡 ′. 3. Assume MRecAck(_, _, _, 푎푏,푏) has been sent by some process. Then 푎푏 < 푏. 4. Assume MConsensusAck(id,푏) and MRecAck(id, _, _, 푎푏,푏′) have been sent by some process. If 푏′ > 푏, then 푏 ≤ 푎푏 < 푏′ and 푎푏 ≠ 0. 5. If id ∉ start at some process then the process knows cmd[id] and quorums[id]. 6. If a process executes the MConsensusAck(id, _) or MRecAck(id, _, _, _, _) handlers then id ∉ start. 7. Assume a slow quorum has received MConsensus(id, 푡,푏) and responded to it with MConsensusAck(id,푏). For any MConsensus(id, 푡 ′,푏′) sent, if 푏′ > 푏, then 푡 ′ = 푡. 8. Assume MCommit(id, 푡) has been sent at line 24. Then for any MConsensus(id, 푡 ′, _) sent, 푡 ′ = 푡. Invariants 1-6 easily follow from the structure of the protocol. Next we prove the rest of the invariants. Then we prove Property 1 and Property 3 (the latter is used in the proof of Theorem 1). Finally, we introduce and prove two lemmas that are then used to prove the Ordering property of the PSMR specification. Proof of Invariant 7. Assume that at some point (*) a slow quorum has received MConsensus(id, 푡,푏) and responded to it with MConsensusAck(id,푏). We prove by induction on 푏′ that, if a process 푖 sends MConsensus(id, 푡 ′,푏′) with 푏′ > 푏, then 푡 ′ = 푡. Given some 푏∗, assume this property holds for all 푏′ < 푏∗. We now show that it holds for 푏′ = 푏∗. We make a case split depending on the transition of process 푖 that sends the MConsensus message. First, assume that process 푖 sends MConsensus at line 25. In this case, 푏′ = 푖. Since 푏′ > 푏, we have 푏 < 푖. But this contradicts Invariant 1. Hence, this case is impossible. The remaining case is when process 푖 sends MConsensus during the transition at line 52. In this case, 푖 has received ′ MRecAck(id, 푡푗, _, ab푗,푏 ) 푅 푅 ′ from all processes 푗 in a recovery quorum 푄 . Let abmax = max{ab푗 | 푗 ∈ 푄 }; then by Invariant 3 we have abmax < 푏 . Since the recovery quorum 푄푅 has size 푟 − 푓 and the slow quorum from (*) has size 푓 + 1, we get that at least one process in 푄푅 must have received the MConsensus(id, 푡,푏) message and responded to it with MConsensusAck(id,푏). Let one of these ′ processes be 푙. Since 푏 > 푏, by Invariant 4 we have ab푙 ≠ 0, and thus process 푖 executes line 56. By Invariant 4 we also have 푏 ≤ ab푙 and thus 푏 ≤ abmax. 푅 Consider an arbitrary process 푘 ∈ 푄 , selected at line 55, such that ab푘 = abmax. We now prove that 푡푘 = 푡. If abmax > 푏, then ′ since abmax < 푏 , by induction hypothesis we have 푡푘 = 푡, as required. If abmax = 푏, then since abmax ≠ 0, process 푘 has received some MConsensus(id, _, abmax) message. By Invariant 2, process 푘 must have received the same MConsensus(id, 푡, abmax) received by process 푙. Upon receiving this message, process 푘 stores 푡 in ts and does not change this value at line 46: abmax ≠ 0 ′ and thus bal[푖푑] cannot be 0 at line 44. Thus process 푘 must have sent MRecAck(id, 푡푘, _, abmax,푏 ) with 푡푘 = 푡, which concludes the proof. □ Proof of Invariant 8. Assume MCommit(id, 푡) has been sent at line 24. Then the process that sent this MCommit message 퐹 퐹 must be process initial푝 (id). Moreover, we have that for some fast quorum mapping 푄 such that initial푝 (id) ∈ 푄 [푝]: 퐹 퐹 (*) every process 푗 ∈ 푄 [푝] has received MPropose(id, 푐, 푄 ,푚) and responded with MProposeAck(id, 푡푗 ) such 퐹 that 푡 = max{푡푗 | 푗 ∈ 푄 [푝]}. We prove by induction on 푏 that, if a process 푖 sends MConsensus(id, 푡 ′,푏), then 푡 ′ = 푡. Given some 푏∗, assume this property holds for all 푏 < 푏∗. We now show that it holds for 푏 = 푏∗. First note that process 푖 cannot send MConsensus at line 25, since in this case we would have 푖 = initial푝 (id), and initial푝 (id) took the fast path at line 24. Hence, process 푖 must have sent MConsensus during the transition at line 52. In this case, 푖 has received , 푡 , , ,푏 MRecAck(id 푗 ph푗 ab푗 ) from all processes 푗 in a recovery quorum 푄푅. 20 Efficient Replication via Timestamp Stability EuroSys ’21, April 26–28, 2021, Online, United Kingdom

푅 If MConsensus is sent at line 56, then we have ab푘 > 0 for the process 푘 ∈ 푄 selected at line 55. In this case, before sending MRecAck, process 푘 must have received

MConsensus(id, 푡푘, ab푘 ) ′ with ab푘 < 푏. Then by induction hypothesis we have 푡 = 푡푘 = 푡. This establishes the required. 푅 If MConsensus is not sent in line 56, then we have ab푘 = 0 for all processes 푘 ∈ 푄 . In this case, process 푖 sends MConsensus 푄푅 푟 푓 푄퐹 푝 푟 푓 in line 62. Since the recovery quorum has size − and the fast quorum [ ] from (*) has size ⌊ 2 ⌋ + , we have that 푟 푄푅 푄퐹 푝 , 푐, 푄퐹 ,푚 (**) at least ⌊ 2 ⌋ processes in are part of [ ] and thus must have received MPropose(id ) and responded to it with MProposeAck. 푅 퐹 Let 퐼 be the set of processes 푄 ∩ 푄 [푝] (line 58). By our assumption, process initial푝 (id) sent an MCommit(id, 푡) at line 24, and thus id ∈ commit ∪execute at this process. Then due to the check id ∈ pending at line 43, this process did not reply to MRec. Hence, initial푝 (id) is not part of the recovery quorum, i.e., initial푝 (id) ∉ 퐼 at line 59. Moreover, since the initial coordinator takes the fast path at line 24, all fast quorum processes have set phase[푖푑] to propose when processing the MPropose from the coordinator (line 14). Due to this and to the check at line 48, their phase[id] value is set to recover-p in line 49, and thus, 푘 퐼 we have that all fast quorum processes that replied report recover-p in their MRecAck message, i.e., ∀ ∈ · ph푘 = recover-p at line 59. It follows that the condition at line 59 does not hold and thus the quorum selected is 퐼 (line 60). By Property 4, 퐹 the fast path proposal 푡 = max{푡푗 | 푗 ∈ 푄 [푝]} can be obtained by selecting the highest proposal sent in MPropose by any 푟 푘 퐼 ⌊ 2 ⌋ fast-quorum processes (excluding the initial coordinator). By (**), and since all processes ∈ have ab푘 = 0, then all processes in 퐼 replied with the timestamp proposal that was sent to the initial coordinator. Thus, by Property 4 we have 퐹 ′ 푡 = max{푡푗 | 푗 ∈ 푄 [푝]} = max{푡푗 | 푗 ∈ 퐼 } = 푡 , which concludes the proof. □ Proof of Property 1. Assume that MCommit(id, 푡) and MCommit(id, 푡 ′) have been sent. We prove that 푡 = 푡 ′. Note that, if an MCommit(id, 푡) was sent at line 89, then some process sent an MCommit(id, 푡) at line 24 or line 37. Hence, without loss of generality, we can assume that the two MCommit under consideration were sent at line 24 or at line 37. We can also assume that the two MCommit have been sent by different processes. Only one process can send an MCommit at line 24 and only once. Hence, it is sufficient to only consider the following two cases. Assume first that both MCommit messages are sent at line 37. Then for some 푏, a slow quorum has received MConsensus(id, 푡,푏) and responded to it with MConsensusAck(id,푏). Likewise, for some 푏′, a slow quorum has received MConsensus(id, 푡 ′,푏′) and responded to it with MConsensusAck(id,푏′). Assume without loss of generality that 푏 ≤ 푏′. If 푏 < 푏′, then 푡 ′ = 푡 by Invariant 7. If 푏 = 푏′, then 푡 ′ = 푡 by Invariant 2. Hence, in this case 푡 ′ = 푡, as required. Assume now that MCommit(id, 푡) was sent at line 24 and MCommit(id, 푡 ′) at line 37. Then for some 푏, a slow quorum has received MConsensus(id, 푡 ′,푏) and responded to it with MConsensusAck(id,푏). Then by Invariant 8, we must have 푡 ′ = 푡, as required. □ , 푡 푄 푄 푟 Proof of Property 3. Assume that MCommit(id ) has been sent. We prove that there exists a quorum b with | b| ≥ ⌊ 2 ⌋ + 1 푡 푡 푡 푗 푄 푗 푄 푡 and bsuch that = max{b푗 | ∈ b}, where each process ∈ b computes its b푗 in either line 15 or line 46 using function , 푡 , 푡 , , , proposal and sends it in either MProposeAck(id b푗 ) or MRecAck(id b푗 _ 0 _). The computation of 푡 occurs either in the transition at line 23 or at line 61. If the computation of 푡 occurs at line 23, then 푄 푟 푓 푄 푄 푗 푄 푡 푡 the quorum defined at line 22 is a fast quorum with size ⌊ 2 ⌋ + . In this case, we let b = and ∀ ∈ · b푗 = 푗 , where 푡 푄 푟 푓 푓 푄 푟 푡 푗 is defined at line 21. Since | | = ⌊ 2 ⌋ + and ≥ 1, we have | b| ≥ ⌊ 2 ⌋ + 1, as required. If the computation of occurs at line 61, we have two situations depending on the condition at line 59. Let 퐼 be the set computed at line 58, i.e., the intersection between the recovery quorum 푄 (defined at line 52) and the fast quorum quorums[id][푝] (quorums[id][푝] is known by Invariant 5 and Invariant 6). First, consider the case in which the condition at line 59 holds. In this case, we let 푄b = 푄 and 푗 푄 푡 푡 푡 푄 푟 푓 푓 푟−1 푄 푟 ∀ ∈ · b푗 = 푗 , where 푗 is defined at line 52. Since | | = − and ≤ ⌊ 2 ⌋, we have | b| ≥ ⌊ 2 ⌋ + 1, as required. Now 푟 푓 consider the case in which the condition at line 59 does not hold. Given that the fast quorum size is ⌊ 2 ⌋ + and the size of the 푄 푟 푓 퐼 푟 푓 푓 푟 푡 recovery quorum is − , we have that contains at least (⌊ 2 ⌋ + ) − = ⌊ 2 ⌋ fast quorum processes. Note that, since is computed at line 61, we have that ∀푘 ∈ 푄 · ab푘 = 0 (line 54). For this reason, each process in 푄 had bal[id] = 0 (line 44) when , 퐼 푘 퐼 it received the first MRec(id _) message. Moreover, as each process in reported recover-p (i.e., ∀ ∈ · ph푘 = recover-p at line 59), the phase[id] was propose (line 48) when the process received the first MRec(id, _) message. It follows that each of these processes computed their timestamp proposal in the MPropose handler at line 15 (not in the MRec handler at line 46). Thus, these processes have proposed a timestamp at least as high as the one from the initial coordinator. In this case, we let 푄 = 퐼 ∪ { ( )} ∀푗 ∈ 퐼 · 푡 = 푡 푡 푡 ( ) b initial푝 id , b푗 푗 where 푗 is defined at line 52 and binitial푝 (id) be the timestamp sent by initial푝 id in its , 퐼 푟 퐼 푄 푟 MProposeAck(id _) message. Since | | ≥ ⌊ 2 ⌋ and initial푝 (id) ∉ , we have | b| ≥ ⌊ 2 ⌋ + 1, as required. □ 21 EuroSys ’21, April 26–28, 2021, Online, United Kingdom Vitor Enes, Carlos Baquero, Alexey Gotsman, and Pierre Sutra

We now prove that Tempo ensures Ordering. Consider two commands 푐 and 푐 ′ submitted during a run of Tempo with identifiers id and id′. By Property 1, all the processes agree on the final timestamp of a command. Let 푡 and 푡 ′ be the final timestamp of 푐 and 푐 ′, respectively. ′ ′ ′ Lemma 1. If 푐 ↦→푖 푐 then ⟨푡, 푖푑⟩ < ⟨푡 , 푖푑 ⟩. ′ ′ ′ Proof. By contradiction, assume that 푐 ↦→푖 푐 , but ⟨푡 , 푖푑 ⟩ < ⟨푡, 푖푑⟩. Consider the point in time when 푖 executes 푐 (line 103). ′ ′ By Validity, this point in time is unique. Since 푐 ↦→푖 푐 , process 푖 cannot have executed 푐 before this time. Process 푖 may only 푐 푡 ℎ 푟 푡 푖 푡 ′, 푖푑 ′ 푡, 푖푑 execute once it is in ids (line 100). Hence, ≤ [⌊ 2 ⌋] (line 99). From Theorem 1, is stable at . As ⟨ ⟩ < ⟨ ⟩, we get 푡 ′ 푡 ′ 푖 푡 ′ 푡 ℎ 푟 푐 ′ 푐 ≤ , and by Property 2, id ∈ commit ∪ execute at . Then, since ≤ ≤ [⌊ 2 ⌋] and cannot have been executed before at process 푖, we have id′ ∈ ids. But then as ⟨푡 ′, 푖푑 ′⟩ < ⟨푡, 푖푑⟩, 푐 ′ is executed before 푐 at process 푖 (line 100). This contradicts ′ 푐 ↦→푖 푐 . □ Lemma 2. If 푐 ↦→ 푐 ′ then whenever a process 푖 executes 푐 ′, some process has already executed 푐. ′ ′ ′ Proof. Assume that a process 푖 executed 푐 . By the definition of ↦→, either 푐 { 푐 , or 푐 ↦→푗 푐 at some process 푗. Assume first that 푐 { 푐 ′. By definition of {, 푐 returns before 푐 ′ is submitted. This requires that command 푐 is executed at least at one replica for each of the partition it accesses. Hence, at the time 푐 ′ is executed at 푖, command 푐 has already executed elsewhere, as required. ′ ′ Assume now that 푐 ↦→푗 푐 for some process 푗. Then 푐 and 푐 access a common partition, say 푝, and by Lemma 1, ⟨푡, 푖푑⟩ < ⟨푡 ′, 푖푑 ′⟩. Consider the point in time 휏 when 푖 executes 푐 ′. Before this, according to line 102, process 푖 receives an MStable(id′) ′ message from some process 푘 ∈ I푝 . Let 휏 < 휏 be the moment when 푘 sends this message by executing line 101. According to ′ 휏 ′ 푡 ′ ℎ 푟 푡 ′ 푘 푡, 푖푑 푡 ′, 푖푑 ′ line 100, id must belong to ids at time . Hence, ≤ [⌊ 2 ⌋] (line 99). From Theorem 1, is stable at . As ⟨ ⟩ < ⟨ ⟩, 푡 푡 ′ 휏 ′ 푘 푡 푡 ′ ℎ 푟 푘 we have ≤ , and by Property 2, id ∈ commit ∪ execute holds at time at process . Since ≤ ≤ [⌊ 2 ⌋], either already executed 푐, or id ∈ ids. In the latter case, because ⟨푡, 푖푑⟩ < ⟨푡 ′, 푖푑 ′⟩, command 푐 is executed before 푘 sends the MStable message for id′. Hence, in both cases, 푐 is executed no later than 휏 ′ < 휏, as required. □ Proof of Ordering. By Validity, cycles of size one are prohibited. By Lemma 2, so are cycles of size two or greater. □

C.2 Proof of Liveness We now prove the Liveness property of the PSMR specification. For simplicity, we assume that links are reliable, i.e.,ifa message is sent between two correct processes then it is eventually delivered. In the following, we use dom(푚) to denote the domain of mapping 푚 and img(푚) to denote its image. We let id be the identifier of some command 푐, so that I푐 denotes the set of processes replicating the partitions accessed by 푐. Lemma 3. Assume that id ∈ commit ∪ execute at some process from a partition 푝. Then id ∈ dom(Attached) and id ∉ start at some correct process from each partition 푞 accessed by command 푐. Proof. Since id ∈ commit ∪ execute at a correct process from 푝, an MCommit(id, _) has been sent by some process from each partition accessed by 푐 (line 26). In particular, it has been sent by some process from partition 푞. By Property 3, there is a majority of processes 푄 from partition 푞 that called proposal(id, _), generating a promise attached to id (line 66), and thus, have id ∈ dom(Attached). Since at most 푓 of these processes can fail, at least some process 푗 ∈ 푄 is correct. Moreover, since 푗 has generated a promise attached to id, it is impossible to have id ∈ start at 푗 (see the MPropose and MRec handlers where proposal(id) is called). Thus id ∈ dom(Attached) and id ∉ start at 푗, as required. □ Lemma 4. Assume that id ∉ start at some correct process 푖 from partition 푝. Then eventually every correct process 푗 from some partition 푞 accessed by command 푐 has id ∉ start and receives an MCommit(id, _) sent by some process from partition 푝. Proof. We consider the case where id ∉ commit ∪ execute never holds at 푗 (if it does hold, then 푗 has id ∉ start and received an MCommit(id, _) sent by some process from partition 푝, as required). First, assume that id ∈ commit ∪ execute eventually holds at process 푖. By Lemma 3, there exists a correct process 푘 from 푞 that has id ∈ dom(Attached). Due to line 91, 푘 continuously sends an MPromises message to all the processes from 푞, including 푗. Note that, since id ∈ dom(Attached) at 푘, this MPromises message contains a promise attached to id. Once 푗 receives such a message, since it has id ∉ commit ∪ execute (line 95), it sends an MCommitRequest(id) to I푐 (line 96), and in particular to process 푖. Since id ∈ commit ∪ execute at 푖 (line 87), process 푖 replies with an MPayload(id, _, _) and MCommit(id, _). Once 푗 processes such messages, it has id ∉ start and received an MCommit(id, _) by process 푖, a correct process from partition 푝, as required. Now, assume that id ∈ pending holds forever at process 푖. Consider the moment 휏0 when the variable leader푝 stabilizes. Let process 푙 be leader푝 . Assume that id ∉ commit ∪ execute at processes 푖 and 푙 forever (the case where it changes to 22 Efficient Replication via Timestamp Stability EuroSys ’21, April 26–28, 2021, Online, United Kingdom id ∈ commit ∪ execute at process 푖 is covered above; the case for process 푙 can be shown analogously). Due to line 76, process 푖 sends an MPayload(id, _, _) message to I푐 (line 77), in particular to 푙 and 푗. Once 푙 processes this message, it has id ∈ pending forever (since we have assumed that id ∉ commit ∪ execute at 푙 forever). Once 푗 processes this message, it has id ∉ start, as required. We now prove that eventually process 푙 sends an MCommit(id, _) to 푗. First, we show by contradiction that the number of MRec(id, _) messages sent by 푙 is finite. Assume the converse. After 휏0, due to the check leader푝 = 푙 at line 78 and at line 83, only process 푙 sends MRec(id, _) messages. Since MRec(id, _) messages by processes other than 푙 are all sent before 휏0, their number is finite. For this same reason, the number of MConsensus(id, _, _) messages by processes other than 푙 are also finite. Thus, each correct process joins only a finite number of ballots thatarenot owned by 푙. It follows that the number of MRec(id, _) messages sent by 푙 at line 78 because it joined a ballot owned by other processes (i.e., when bal_leader(bal[id]) ≠ 푙) is finite. (Note that a single MRec(id, _) can be sent here due to bal[id] = 0, as process 푙 sets bal[id] to some non-zero ballot when processing its first MRec(id, _) message). For an MRec(id, _) to be sent at line 85, process 푙 has to receive an MRecNAck(id,푏) with bal[id] < 푏 (line 83). If bal_leader(푏) ≠ 푙, the number of such MRecNAck messages is finite as the number of MRec(id, _) and MConsensus(id, _, _) by processes other than 푙 are finite. If bal_leader(푏) = 푙, the MRecNAck(id,푏) must be in response to an MRec(id,푏) or MConsensus(id, _,푏) by 푙. Note that when 푙 sends such a message, it sets bal[id] to 푏. For this reason, process 푙 cannot have bal[id] < 푏, and hence this case is impossible. Thus, there is a point in time 휏1 ≥ 휏0 after which the condition at line 83 does not hold at process 푙, and consequently, process 푙 stops sending new MRec(id, _) messages at line 85, which yields a contradiction. We have established above that id ∈ pending at process 푙 forever. We now show that process 푙 sends at least one MRec(id, _) message. Since id ∈ pending forever, process 푙 executes line 78 for this id infinitely many times. If 푙 does not send at least one MRec(id, _) at this line it is because bal[id] > 0 and bal_leader(bal[id]) = 푙 forever. If so, we have two cases to consider depending on the value of bal[id]. In the first case, bal[id] = 푙 at 푙 forever. In this case, process 푙 took the slow path by sending to I푝 an MConsensus message with ballot 푙 (line 25). We now have two sub-cases. If any process sends an MRecNAck to process 푙 (line 81), this will make process 푙 send an MRec(id, _) (line 85), as required. Otherwise, process 푙 will eventually gather 푓 + 1 MConsensusAck messages and commit the command. As we have established that id ∈ pending at process 푙 forever, this sub-case is impossible. In the second case, we eventually have bal[id] > 푙. Since from some point on, bal_leader(bal[id]) = 푙 at 푙, this means that this process sends an MRec(id, bal[id]) (line 50), as required. We have now established that process 푙 sends a finite non-zero number of MRec(id, _) messages. Let 푏 be the highest ballot for which process 푙 sends an MRec(id,푏) message. We now prove that 푙 eventually sends an MCommit(id, _) to I푐 , in particular to 푗. Given that at most 푓 processes can fail, there are enough correct processes to eventually satisfy the preconditions of MRecAck and MConsensus. First we consider the case where the preconditions of MRecAck and MConsensus eventually hold at process 푙. Since the precondition of MRecAck eventually holds, process 푙 eventually sends an MConsensus(id, _,푏). Since the precondition of MConsensus eventually holds, process 푙 eventually sends an MCommit(id, _) to 푗, as required. Consider now the opposite case where the preconditions of MRecAck or MConsensusAck never hold at process 푙. Since there are enough correct processes to eventually satisfy these preconditions, the fact that they never hold at process 푙 implies that there is some correct process 푗 with bal[id] > 푏 (otherwise 푗 would eventually reply to process 푙). Thus, the precondition of MRecNAck holds at 푗 (line 80), which causes 푗 to send an MRecNAck(id,푏′) with 푏′ = bal[id] > 푏 to 푙. When 푙 receives such a message, it sends a new MRec(id,푏′′) with some ballot 푏′′ > 푏′ (line 85). It follows that 푏′′ > 푏′ > 푏, which contradicts the fact that 푏 is the highest ballot sent by 푙, and hence this case is impossible. □ Lemma 5. Assume that id ∉ start at some correct process 푖 from some partition 푝 accessed by command 푐. Then eventually id ∈ commit ∪ execute at every correct process 푗 ∈ I푐 . Proof. For 푐 to be committed at process 푗, 푗 has to have id ∉ start and to receive an MCommit(id, _) from each of the partitions accessed by 푐 (line 26). We prove that process 푗 eventually receives such a message from each of these partitions. To this end, fix one such partition 푞. By Lemma 4, it is enough to prove that some correct process from partition 푞 eventually has id ∉ start. By contradiction, assume that all the correct processes from partition 푞 have id ∈ start forever. We have two scenarios to consider. First, consider the scenario where 푝 = 푞. But this contradicts the fact that id ∉ start at process 푖. Now, consider the scenario where 푝 ≠ 푞. In this scenario we consider two sub-cases. In the first case, eventually id ∈ commit ∪execute at process 푖. By Lemma 3, there is a correct process from each partition accessed by 푐, in particular from partition 푞, that has id ∉ start, which contradicts our assumption. In the second case, id ∈ pending at process 푖 forever (the case where it changes to id ∈ commit ∪ execute is covered above). Due to line 77, process 푖 periodically sends an MPayload message to the processes in partition 푞. Once this message is processed, the correct processes in partition 푞 will have id ∈ payload. Since partition 푞 contains at least one correct process, this also contradicts our assumption. □

Definition 1. We define the set of proposals issued by some process 푖 ∈ I푝 as LocalPromises푖 = img(Attached) ∪ Detached.

23 EuroSys ’21, April 26–28, 2021, Online, United Kingdom Vitor Enes, Carlos Baquero, Alexey Gotsman, and Pierre Sutra

Lemma 6. For each process 푖 ∈ I푝 we have that ⟨푖, 푡⟩ ∈ LocalPromises푖 ⇒ (∀푢 ∈ {1, . . . , 푡}·⟨푖,푢⟩ ∈ LocalPromises푖 ). Proof. Follows trivially from Algorithm 5. □

Lemma 7. Consider a command 푐 with an identifier id and the final timestamp 푡, and assume that id ∉ start at some correct process in I푐 . Then at every correct process in I푐 , eventually variable Promises contains all the promises up to 푡 by some set of 퐶 퐶 푟 processes with | | ≥ ⌊ 2 ⌋ + 1.

Proof. By Lemma 5, eventually id ∈ commit ∪ execute at all the correct processes in I푐 . Consider a point in time 휏0 when this happens and fix a process 푖 ∈ I푐 from a partition 푝 that has id ∈ commit. Let 퐶 be the set of correct processes from partition 푝 퐷0, 퐴0 푗 퐶 휏 , MPromises( 푗 푗 ) be the MPromises message sent by each process ∈ in the next invocation of line 91 after 0, and Ð 퐴0 푗 퐶 ids = {dom( 푗 ) | ∈ }. Due to line 29, by Lemma 6 we have that 푗 퐶,푢 , . . . , 푡 푗,푢 퐴0 퐷0 (*) ∀ ∈ ∈ {1 }·⟨ ⟩ ∈ img( 푗 ) ∪ 푗 . Note that, for each id′ ∈ ids, we have id′ ∉ start at some correct process (in particular, at the processes 푗 that sent such id′ 퐷0, 퐴0 휏 휏 in their MPromises( 푗 푗 ) messages). Thus, by Lemma 5, there exists 1 ≥ 0 at which ids ⊆ commit ∪ execute at process 푖 퐷1, 퐴1 푗 퐶 휏 . Let MPromises( 푗 푗 ) be the MPromises message sent by each process ∈ in the next invocation of line 91 after 1. Since at process 푖 we have that ids ⊆ commit ∪ execute, once all these MPromises messages are processed by process 푖, for 푗 퐶 Ð 퐴1 ′ ′ 퐷1 푖 each ∈ we have that { 푗 [id ] | id ∈ ids} ⊆ Promises and 푗 ⊆ Promises at process . Given the definition of ids and 퐴0 퐴1 퐷0 퐷1 퐴0 퐷0 푖 since 푗 ⊆ 푗 and 푗 ⊆ 푗 , we also have that img( 푗 ) ⊆ Promises and 푗 ⊆ Promises at process . From (*), it follows that ∀푗 ∈ 퐶,푢 ∈ {1, . . . , 푡}·⟨푗,푢⟩ ∈ Promises at process 푖. □

Lemma 8. Consider a command 푐 with an identifier id, and assume that id ∉ start at some correct process in I푐 . Then every correct process 푖 ∈ I푐 eventually executes 푐.

Proof. Consider a command 푐 with an identifier id, and assume that id ∉ start at some correct process in I푐 . By contradiction, assume further that some correct process 푖 ∈ I푐 never executes 푐. Let 푐0 = 푐, id0 = id, and 푡0 be the timestamp assigned to 푐. Then by Lemma 7, eventually in every invocation of the periodic handler at line 97, we have ℎ ≤ 푡0. By Lemma 5 and since 푖 never executes 푐0, eventually 푖 has id0 ∈ commit. Hence, either 푖 never executes another command 푐1 preceding 푐0 in ids, 푖 ( ) 푗 I푖 or never receives an MStable id0 message from some correct process eventually indicated by 푐0 . The latter case can only be due to the loop in the handler at line 97 being stuck at an earlier command 푐1 preceding 푐0 in ids at 푗. Hence, in both cases there exists a command 푐1 with an identifier id1 and a final timestamp 푡1 such that ⟨푡1, id1⟩ < ⟨푡0, id0⟩ and some correct 푖 I 푐 process 1 ∈ 푐1 never executes 1 despite eventually having id1 ∈ commit. Continuing the above reasoning, we obtain an infinite sequence of commands 푐0, 푐1, 푐2,... with decreasing timestamp-identifier pairs such that each of these commands is never executed by some correct process. But such a sequence cannot exist because the set of timestamp-identifier pairs is well-founded. This contradiction shows the required. □ Proof of Liveness. Assume that some command 푐 with identifier id is submitted by a correct process or executed at some process. We now prove that it is eventually executed at all correct processes in I푐 . By Lemma 7 and Lemma 8, it is enough to prove that eventually some correct process in I푐 has id ∉ start. First, assume that id is submitted by a correct process 푖. Due to the precondition at line 2, 푖 ∈ I푐 . Then phase[id] is set to propose at process 푖 when 푖 sends the initial MPropose message, and thus id ∉ start as required. Assume now that id is executed at some (potentially faulty) process. By Lemma 3, there exists some correct process in I푐 with id ∉ start, as required. □

D Pathological Scenarios with Caesar and EPaxos Let us consider 3 processes: A, B, C. Assume that all the commands are conflicting. Assume further, that if a process proposes a command 푥, then this command is proposed with timestamp 푥. In Caesar, timestamps proposed must be unique. In the example below, this is ensured by assigning timestamps to processes round-robin. Further details about Caesar are provided in §3.3. The (infinite) example we consider is as follows: • A proposes 1, 4, 7, ... • B proposes 2, 5, 8, ... • C proposes 3, 6, 9, ... First, A proposes 1. When command 1 reaches B, B has already proposed 2, and thus command 1 is blocked on it. When 2 reaches C, C has already proposed 3, and thus command 2 is blocked on it. This keeps going forever, as depicted in the diagram below (where ← denotes "blocked on"). 24 Efficient Replication via Timestamp Stability EuroSys ’21, April 26–28, 2021, Online, United Kingdom

A : 1 4 ← 3 7 ← 6 ... B : 2 ← 1 5 ← 4 8 ← 7 ... C : 3 ← 2 6 ← 5 9 ← 8 ... As a result, no command is ever committed. This liveness issue is due to Caesar’s wait condition [1]. In EPaxos, the command arrival order from above results in the following dependency sets being committed: • dep[1] = {2} • dep[2] = {3} • dep[3] = {1, 4} • dep[4] = {1, 2, 5} • dep[5] = {2, 3, 6} • dep[6] = {1, 3, 4, 7} • ... These committed dependencies form a strongly connected component of unbounded size [31, 36]. As result, commands are never executed.

25