UNISTORE: A fault-tolerant marriage of causal and strong consistency

Manuel Bravo Alexey Gotsman Borja de Régil Hengfeng Wei ∗ IMDEA Software Institute Nanjing University

Abstract An alternative approach is to relax synchronization: the Modern online services rely on data stores that replicate their data store executes an operation at a single data center, with- data across geographically distributed data centers. Providing out any communication with others, and propagates updates strong consistency in such data stores results in high latencies to other data centers in the background [20, 65]. This min- and makes the system vulnerable to network partitions. The imizes the latency and makes the system highly available, alternative of relaxing consistency violates crucial correctness i.e., operational even during network partitionings. But on properties. A compromise is to allow multiple consistency the downside, the systems following this approach provide levels to coexist in the data store. In this paper we present weaker consistency models: e.g., eventual consistency [65,68] UNISTORE, the first fault-tolerant and scalable data store that or causal consistency [2]. The latter is particularly appealing: combines causal and strong consistency. The key challenge it guarantees that clients see updates in an order that respects we address in UNISTORE is to maintain liveness despite data the potential causality between them. For example, assume center failures: this could be compromised if a strong transac- that in a banking application Alice deposits $100 into Bob’s tion takes a dependency on a causal transaction that is later account (u1) and then posts a notification about it into Bob’s lost because of a failure. UNISTORE ensures that such situa- inbox (u2). Under causal consistency, if Bob sees the notifica- tions do not arise while paying the cost of durability for causal tion (u3), and then checks his account balance (u4), he will see transactions only when necessary. We evaluate UNISTORE the deposit. This is not guaranteed under eventual consistency, on Amazon EC2 using both microbenchmarks and a sample which does not respect causality relationships, such as those application. Our results show that UNISTORE effectively and between u1 and u2. In some settings, causal consistency has scalably combines causal and strong consistency. been shown to be the strongest model that allows availability during network partitionings [7, 44]. It has been a subject 1 Introduction of active research in recent years, with scalable implementa- tions [3, 42, 47] and some industrial deployments [54, 66]. Many of today’s Internet services rely on geo-distributed data However, even causal consistency is often too weak to stores, which replicate data in different geographical loca- preserve critical application invariants. For example, con- tions. This improves user experience by allowing accesses sider a banking application that disallows overdrafts and arXiv:2106.00344v2 [cs.DC] 29 Jul 2021 to the closest site and ensures disaster-tolerance. However, thus maintains an invariant that an account balance is always geo-distribution also makes it more challenging to keep the non-negative. Assume that the balance of an account stored data consistent. The classical approach is to make replication at two replicas is 100, and clients concurrently issue two transparent to clients by providing strong consistency models, withdraw(100) operations (u5 and u6) at different replicas. such as linearizability [32] or serializability [69]. The down- Since causal consistency executes operations without syn- side is that this approach requires synchronization between chronization, both withdrawals will succeed, and once the data centers in the critical path. This significantly increases replicas exchange the updates, the balance will go negative. latency [1] and makes the system unavailable during network To ensure integrity invariants in examples such as this, the pro- partitionings [27]. Thus, even though several commercial geo- grammer has to introduce synchronization between replicas, distributed systems follow this approach [17,19,25,62,70], and, since synchronization is expensive, it pays off to do this the associated cost has prevented it from being adopted more sparingly. To this end, several research [9,40,41,64] and com- widely. mercial [5,6, 28, 48, 57] data stores allow the programmer to ∗Also with the State Key Laboratory for Novel Software Technology, choose whether to execute a particular operation under weak Software Institute. or strong consistency. For example, to preserve the integrity

1 invariant in our banking application, only withdrawals need to tion of transactional causal consistency. Our protocol extends use strong consistency, and hence, synchronize; deposits may Cure with a novel mechanism that distributes the task of track- use weaker consistency and proceed without synchronization. ing the set of uniform transactions among the machines of a Given the benefits of causal consistency, it is particularly ap- data center. We also add the ability for data centers to forward pealing to marry it with strong consistency in a geo-distributed transactions they receive from others, so that a transaction can data store. But like real-life marriages, to be successful this propagate through the system even if its origin data center one needs to hold together both in good times and in bad – fails. Finally, we carefully integrate an existing fault-tolerant when data centers fail due to catastrophic events or power out- atomic commit for strong transactions [18] into the protocol ages. Unfortunately, none of the existing data stores meant for for causal consistency. geo-replication combine causal and strong consistency while We have rigorously proved the correctness of the UNI- providing such fault tolerance [9, 40, 41]. In this paper we STORE protocol (§7 and §D). We have also evaluated it on present UNISTORE – the first fault-tolerant and scalable data Amazon EC2 using both microbenchmarks and a more re- store that combines causal and strong consistency. More pre- alistic RUBiS benchmark. Our evaluation demonstrates that cisely, UNISTORE implements a transactional variant of Par- UNISTORE scalably combines causal and strong transactions, tial Order-Restrictions consistency (PoR consistency) [30,41]. with the latter not affecting the performance of the former. This guarantees transactional causal consistency by default [3] Under the RUBiS mix workload, causal transactions exhibit and allows the programmer to additionally specify which pairs a low latency (1.2ms on average), and the overall average la- of transactions conflict, i.e., have to synchronize. For instance, tency is 3.7× lower than that of a strongly consistent system. to preserve the integrity invariant in our previous example, the programmer should declare that withdrawals from the same 2 System Model account conflict. Then one of the withdrawals u5 and u6 will observe the other and will fail. We consider a geo-distributed system consisting of a set of The key challenge we have to address in UNISTORE is to data centers D = {1,...,D} that manage a large set of data maintain liveness despite data center failures. Just adding a items. A data item is uniquely identified by its key. For scal- Paxos-based commit protocol for strong transactions [18,19, ability, the key space is split into a set of logical partitions 34] to an existing causally consistent protocol does not yield P = {1,...,N}. Each data center stores replicas of all parti- m a fault-tolerant data store. In such a data store, a committed tions, scattered among its servers. We let pd be the replica strong transaction t2 may never become visible to clients if of partition m at data center d, and we refer to replicas of the a causal transaction t1 on which it depends is lost due to a same partition as sibling replicas. As is standard, we assume failure of its origin data center. This compromises the liveness that D = 2 f +1 and at most f data centers may fail. We call a of the system, because no transaction t3 conflicting with t2 data center that does not fail correct. If a data center fails, all can commit from now on: according to the PoR model, one partition replicas it stores become unavailable. For simplicity, of the transactions t2 and t3 has to observe the other, but t2 we do not consider the failures of individual replicas within a will never be visible and t2 did not observe t3. data center: these can be masked using standard state-machine UNISTORE addresses this problem by ensuring that, be- replication protocols executing within a data center [37, 55]. fore a strong transaction commits, all its causal dependencies Replicas have physical clocks, which are loosely synchro- are uniform, i.e., will eventually become visible at all correct nized by a protocol such as NTP. The correctness of UNI- data centers. This adapts the classical notion of uniformity STORE does not depend on the precision of clock synchroniza- in distributed computing to causal consistency [15]. UNI- tion, but large drifts may negatively impact its performance. STORE does so without defeating the benefits of causal con- Any two replicas are connected by a reliable FIFO channel, sistency. Causal transactions remain highly available at the so that messages between correct data centers are guaranteed cost of increasing the latency of strong transactions: a strong to be delivered. As is standard, to implement strong consis- transaction may have to wait for some of its dependencies to tency we require the network to be eventually synchronous, so become uniform before committing. To minimize this cost, that message delays between sibling replicas in correct data UNISTORE executes causal transactions on a snapshot that is centers are eventually bounded by some constant [23]. slightly in the past, such that a strong transaction will mostly depend on causal transactions that are already uniform before 3 Consistency Model committing. Furthermore, UNISTORE reuses the mechanism for tracking uniformity to let clients make causal transaction A client interacts with UNISTORE by executing a stream of durable on demand and to enable consistent client migration. transactions at the data center it is connected to. A transaction In addition to being fault tolerant, UNISTORE scales hori- consists of a sequence of operations, each on a single data zontally, i.e., with the number of machines in each data center; item, and can be interactive: the data items it accesses are not this also goes beyond previous proposals [9,40,41]. To this known a priori. A transaction that modifies at least one data end, UNISTORE builds on Cure [3] – a scalable implementa- item is an update transaction; otherwise it is read-only.

2 A consistency model defines a contract between the data ensures session guarantees such as read your writes [63]. The store and its clients that specifies which values the data store consistency model disallows the causality violation anomaly is allowed to return in response to client operations. UNI- from §1. Indeed, since ≺ includes the session order, we have STORE implements a transactional variant of Partial Order- u1 ≺ u2 and u3 ≺ u4. Moreover, Bob sees Alice’s message, Restrictions consistency (PoR consistency) [30,41], which we and by Return Value Consistency this can only happen if now define informally; we give a formal definition in §B. The u2 ≺ u3. Then since ≺ is transitive, u1 ≺ u4, and by Return PoR model enables the programmer to classify transactions Value Consistency, Bob has to see Alice’s deposit. as either causal or strong. Causal transactions satisfy transac- Causal consistency nevertheless allows the overdraft tional causal consistency, which guarantees that clients see anomaly from §1: the withdrawals u5 and u6 may not be transactions in an order that respects the potential causality related by ≺, and thus may both execute on the balance 100 between them [2,3]. However, clients can observe causally and succeed. The Conflict Ordering property can be used to independent transactions in arbitrary order. Strong transac- disallow this anomaly by declaring that withdraw operations tions give the programmer more control over their visibility. on the same account conflict and labeling transactions contain- To this end, the programmer provides a symmetric conflict ing these as strong. Then one of the withdrawal transactions relation ./ on operations that is lifted to strong transactions as will be guaranteed to causally precede the other. The latter follows: two transactions conflict if they perform conflicting will be executed on the account balance 0 and will fail. operations on the same data item. Then the PoR model guar- Finally, Eventual Visibility ensures that strong transactions antees that, out of two conflicting strong transactions, one has and those causal ones that originate at correct data centers are to observe the other. durable, i.e., will eventually propagate through the system. More precisely, a transaction t1 precedes a transaction t2 in To facilitate the use of causal transactions, UNISTORE in- the session order if they are executed by the same client and cludes replicated data types (aka CRDTs), which implement t1 is executed before t2. A set of transactions T committed by policies for merging concurrent updates to the same data the data store satisfies PoR consistency if there exists a causal item [56]. Each data item in the store is associated with a type order relation ≺ on T such that the following properties hold: (e.g., counter, set), which is backed by a CRDT implementa- Causality Preservation. The relation ≺ is transitive, ir- tion managing updates to it. For example, the programmer can reflexive, and includes the session order. use a counter CRDT to represent an account balance. Then if two clients concurrently deposit 100 and 200 into an empty Return Value Consistency. Consider an operation u on a account using causal transactions, eventually the balance at data item k in a transaction t ∈ T. The return value of u can all replicas will be 300. Using ordinary writes here would be computed from the state of k obtained as follows: first yield 100 or 200, depending on the order in which the writes execute all operations on k by transactions preceding t in ≺ are applied. More generally, CRDTs ensure that two replicas in an order consistent with ≺; then execute all operations receiving the same set of updates are in the same state, regard- on k that precede u in t. less of the receipt order. Together with Eventual Visibility, this Conflict Ordering. For any distinct strong transactions implies the expected guarantee of eventual consistency [65]. t1,t2 ∈ T, if t1 ./ t2, then either t1 ≺ t2 or t2 ≺ t1. Due to space constraints, we omit details about the use of Eventual Visibility. A transaction t ∈ T that is either strong CRDTs from our protocol descriptions. or originates at a correct data center eventually becomes visible at all correct data centers: from some point on, t 4 Key Design Decisions in UNISTORE precedes in ≺ all transactions issued at correct data centers. If all transactions are causal, then the above definition special- Baseline causal consistency. A causal transaction in UNI- izes to transactional causal consistency [3, 16]. If all transac- STORE first executes at a single data center on a causally tions are strong and all pairs of operations conflict, then we consistent snapshot. After this it immediately commits, and obtain (non-strict) serializability. its updates are replicated to all other data centers in the back- When t1 ≺ t2, we say that t1 is a causal dependency of ground. This minimizes the latency of causal transactions and t2. Return Value Consistency ensures that all operations in a makes them highly available, i.e., they can be executed even transaction t execute on a snapshot consisting of its causal de- when the network connections between data centers fail. pendencies (as well as prior operations by t). Transactions are As is common in causally consistent data stores [3, 22, 42], atomic, so that either all of their operations are included into to ensure that causal transactions execute on consistent snap- the snapshot or none at all. The transitivity of ≺, mandated shots, a data center exposes a remote transaction to clients by Causality Preservation, ensures that the snapshot a trans- only after exposing all its dependencies. Then to satisfy the action executes on is causally consistent: if a transaction t1 is Eventual Visibility property under failures, a data center re- included into the snapshot, then so is any other transaction t2 ceiving a remote causal transaction may need to forward it on which t1 depends (i.e., t2 ≺ t1). The inclusion of the ses- to other data centers, as in reliable broadcast [11] and anti- sion order into ≺, also mandated by Causality Preservation, entropy protocols for replica reconciliation [53].

3 1 submit(t ) 2 3 1 submit(t ) 2 submit(t ) 4 commit(t ) 5 1 x 1 2 2 d1 d1 x dep[t1] = ∅ dep[t ] = ∅ dep[t ] = {t } t1 1 2 1 3 7 4 submit(t2) 5 4 commit(t2) 8 abort(t3) d2 d 2 certify(t2) certify(t3) dep[t2] = {t1} t2 4 commit(t2) 6 submit(t3) 8 abort(t3) d3 d3 dep[t3] = ∅

Figure 1: Why UNISTORE may need to for- Figure 2: Why UNISTORE needs to ensure that the dependencies of a strong transaction are ward remote causal transactions. uniform before committing it.

Figure1 depicts a scenario that demonstrates how Eventual replicating t1 (event 5 ), no remote data center will be able to Visibility could be violated in the absence of this mechanism. expose t2, because its dependency t1 is missing. This violates Let t1 be a causal transaction submitted at a data center d1 the Eventual Visibility property, and even worse, no strong (event 1 ). Assume that d1 replicates t1 to a correct data center transaction conflicting with t2 can commit from now on. For d2 (event 2 ) and then fails (event 3 ), so that t1 does not get instance, let t3 be such a transaction, submitted at d3 (event replicated anywhere else. Let t2 be a transaction submitted 6 ). Because d3 cannot expose t2, transaction t3 executes on a at d2 after t1 becomes visible there, so that t2 depends on t1 snapshot excluding t2. Hence, t3 will abort during certification (event 4 ). Transaction t2 will eventually be replicated to all (events 7 and 8 ): committing it would violate the Conflict correct data centers (event 5 ). But it will never be exposed Ordering property, since transactions t2 and t3 conflict, but at any of them, because its dependency t1 is missing. If data neither of them is visible to the other. Ensuring that t1 is centers can forward remote causal transactions, then d2 can uniform before committing t2 prevents this problem, because eventually replicate t1 to all correct data centers, preventing it guarantees that t1 will eventually be replicated at d3. After this problem. this t2 will be exposed to conflicting transactions at this data center, which will allow them to commit. On-demand strong consistency. UNISTORE uses optimistic concurrency control for strong transactions: they are first ex- Minimizing the latency of strong transactions. Ensuring ecuted speculatively and the results are then certified to de- that all the causal dependencies of a strong transaction are termine whether the transaction can commit, or must abort uniform before committing it may significantly increase its due to a conflict with a concurrent strong transaction [69]. latency, since this requires additional communication between Certifying a strong transaction requires synchronization be- data centers. UNISTORE mitigates this problem by executing tween the replicas of partitions it accessed, located in different causal transactions on a snapshot that is slightly in the past, data centers. UNISTORE implements this using an existing which is allowed by causal consistency. Namely, UNISTORE fault-tolerant protocol that combines two-phase commit and makes a remote causal transaction visible to the clients only Paxos [18] while minimizing commit latency. However, just after it is uniform. This minimizes the latency of a strong using such a protocol is not enough to make the overall system transaction, since to commit it only needs to wait for causal fault tolerant: for this, before a strong transaction commits, transactions originating at the local data center to become uni- all its causal dependencies must be uniform in the following form. We cannot delay the visibility of the latter transactions sense. due to the need to guarantee read your writes to local clients. DEFINITION 1. A transaction is uniform if both the transac- tion and its causal dependencies are guaranteed to be eventu- On-demand durability of causal transactions. Client ap- ally replicated at all correct data centers. plications interacting with the external world require hard durability guarantees: e.g., a banking application has to en- This adapts the classical notion of uniformity in distributed sure that a withdrawal is durably recorded before authoriz- computing to causal consistency [15]. UNISTORE considers ing the operation. UNISTORE guarantees that, once a strong a transaction to be uniform once it is visible at f + 1 data transaction commits, the transaction and its dependencies are centers, because at least one of these must be correct, and data durable. However, UNISTORE returns from a causal transac- centers can forward causal transactions to others. tion before it is replicated, and thus the transaction may be The following scenario, depicted in Figure2, demonstrates lost if its origin data center fails. Ensuring the durability of why committing a strong transaction before its dependencies every single causal transaction would require synchronization become uniform can compromise the liveness of the system. between data centers on its critical path, defeating the benefits Assume that a causal transaction t1 and a strong transaction of causal consistency. Instead, UNISTORE reuses the mech- t2 are submitted at a data center d1 in such a way that t1 anism for tracking uniformity to let the clients pay the cost becomes a dependency of t2 (events 1 and 2 ). Assume also of durability only when necessary. Even though UNISTORE that t2 is certified, committed and delivered to all relevant replicates causal transactions asynchronously, it allows clients replicas (events 3 and 4 ) before t1 is replicated to any data to execute a uniform barrier, which ensures that the transac- center, and thus before it is uniform. Now if d1 fails before tions they have observed so far are uniform, and thus durable.

4 Client migration. Clients may need to migrate between data on which a transaction operates: a snapshot vector V repre- centers, e.g., because of roaming or for load balancing. UNI- sents all transactions with a commit vector ≤ V. This snap- STORE also uses the uniformity mechanism to preserve ses- shot is causally consistent. Indeed, consider a transaction t1 sion guarantees during migration. A client wishing to migrate included into it, i.e., commitVec1 ≤ V. Since any causal de- from its local data center d to another data center i first invokes pendency t0 of t1 is such that commitVec0 < commitVec1, we a uniform barrier at d. This guarantees that the transactions have commitVec0 < V, so that t0 is also included into the snap- the client has observed or issued at d are durable and will shot. A client also maintains a vector pastVec that represents eventually become visible at i, even if d fails. The client then its causal past: a causally consistent snapshot including the makes an attach call at the destination data center i that waits update transactions the client has previously observed. until i stores all the above transactions. After this, the client pm can operate at i knowing that the state of the data center is Tracking what is replicated where. Each replica d main- consistent with the client’s previous actions. tains three vectors that are used to compute which transactions are uniform. These respectively track the sets of transactions Currently UNISTORE does not support consistent client pm d f + migration in response to a data center failure: if the data center replicated at d , the local data center , and 1 data centers. V a client is connected to fails, the client will have to restart Each of these vectors represents the set of update transac- i its session when connecting to a different data center. As tions originating at a data center with a local timestamp ≤ V[i] shown in [71], this limitation can be lifted without defeating . Note that this set may not form a causally consistent pm knownVec the benefits of causal consistency. We leave integrating the snapshot. The first vector maintained by d is . For each data center i, it defines the prefix of update transactions corresponding mechanisms into UNISTORE for future work. m originating at i (in the order of local timestamps) that pd knows about. 5 Fault-Tolerant Causal Consistency Protocol m PROPERTY 1. For each data center i, the replica pd stores the We first describe the UNISTORE protocol for the case when updates to partition m by all transactions originating at i with all transactions are causal. We give its pseudocode in Algo- local timestamps ≤ knownVec[i]. rithms1 and2; for now the reader should ignore highlighted Our protocol ensures that knownVec[d] ≤ clock at any lines, which are needed for strong transactions. For simplic- m replica in data center d. The vector knownVec at pd records ity, we assume that each handler in the algorithms executes whether the updates to partition m by a given transaction are atomically (although our implementation is parallelized). We stored at this replica. In contrast, the next vector stableVec reference pseudocode lines using the format algorithm#:line#. records whether the updates to all partitions by a transaction are stored at the local data center d. 5.1 Metadata Most metadata in our protocol are represented by vectors PROPERTY 2. For each data center i, the data center d stores with an entry per each data center, where each entry stores a the updates by all transactions originating at i with local times- scalar timestamp. However, different pieces of metadata use tamps ≤ stableVec[i]. More precisely, we are guaranteed that m the vectors in different ways, which we now describe. knownVec[i] at any replica of d ≥ stableVec[i] at any pd . Tracking causality. Finally, the last vector uniformVec defines the set of update The first use of the vectors is as vector m clocks [26,46], to track causality between transactions. Given transactions that pd knows to have been replicated at f + 1 data centers, including d. vectors V1 and V2, we write V1 < V2 if each entry of V1 is no greater than the corresponding entry of V , and at least m 2 PROPERTY 3. Consider uniformVec[i] at pd . All update trans- one is strictly smaller. Each update transaction is tagged with actions originating at i with local timestamps ≤ uniformVec[i] a commit vector commitVec. The order on these vectors is are replicated at f + 1 datacenters including d. More pre- consistent with the causal order ≺ from §3: if commitVec1 and cisely: knownVec[i] at any replica of these data centers commitVec are the commit vectors of two update transactions m 2 ≥ uniformVec[i] at pd . t1 and t2 such that t1 ≺ t2, then commitVec1 < commitVec2. For a transaction originating at a data center d with a commit When uniformVec is reinterpreted as a causally consistent pm vector commitVec, we call commitVec[d] its local timestamp. snapshot, it defines transactions that d knows to be uniform m according to Definition1: Each replica pd maintains a log opLog[k] of update oper- m ations performed on each data item k stored at the replica. PROPERTY 4. Consider uniformVec at pd . All update trans- Each log entry stores, together with the operation, the com- actions with commit vectors ≤ uniformVec are uniform. mit vector of the transaction that performed it. This allows Proof sketch. Consider a transaction t that originates at a reconstructing different versions of a data item from its log. 1 data center i with a commit vector commitVec1 ≤ uniformVec m Representing causally consistent snapshots. The protocol at pd . In particular, commitVec1[i] ≤ uniformVec[i], and by also uses a vector to represent a snapshot of the data store Property3, t1 is replicated at f + 1 data centers. We as-

5 m sume at most f failures. Then the transaction forwarding Algorithm 1 Transaction execution at pd . mechanism of our protocol (§4) guarantees that t1 will even- 1: function START_TX(V) tually be replicated at all correct data centers. Consider 2: for i ∈ D \{d} do now any causal dependency t2 of t1 with a commit vector 3: uniformVec[i] ← max{V[i], uniformVec[i]} commitVec2. Since commit vectors are consistent with causal- 4: var tid ← generate_tid() ity, commitVec2 < commitVec1 ≤ uniformVec. Then as above, 5: snapVec[tid] ← uniformVec 6: snapVec[tid][d] ← max{V[d], uniformVec[d]} we can again establish that t2 will be replicated at all correct data centers, as required by Definition1. 7: snapVec[tid][strong]← max{V[strong], stableVec[strong]} 8: return tid

5.2 Causal Transaction Execution 9: function DO_OP(tid, k, op) Starting a transaction. A client can submit a transaction to 10: var l ← partition(k) 11: send GET_VERSION(snapVec[tid], k) to pl any replica in its local data center by calling START_TX(V), d 12: wait receive VERSION(state) from pl where V is the client’s causal past pastVec (line1:1, for d 13: for all hk,op0i ∈ wbuff[tid][l] do state ← apply(op0,state) pm brevity, we omit the pseudocode of the client). A replica d 14: rset[tid] ← rset[tid] ∪ {hk,opi} receiving such a request acts as the transaction coordinator. 15: if op is an update then It generates a unique transaction identifier tid, computes a 16: wbuff[tid][l] ← wbuff[tid][l] · hk,opi snapshot snapVec[tid] on which the transaction will execute, 17: return retval(op,state) and returns tid to the client (we explain lines1:2-3 and similar 18: when received GET_VERSION(snapVec, k) from p ones later). The snapshot is computed by combining uniform 19: for i ∈ D \{d} do transactions from uniformVec (line1:5) with the transactions 20: uniformVec[i] ← max{snapVec[i], uniformVec[i]} from the client’s causal past originating at d (line1:6). The 21: wait until knownVec[d] ≥ snapVec[d] ∧ former is crucial to minimize the latency of strong transac- knownVec[strong] ≥ snapVec[strong] tions (§4), while the latter ensures read your writes. 22: var state ← ⊥ 23: for all hop0,commitVeci∈opLog[k].commitVec≤snapVec do Transaction execution. The client proceeds to execute the 24: state ← apply(op0,state) transaction tid by issuing a sequence of operations at its coor- 25: send VERSION(state) to p dinator via DO_OP (line1:9). When the coordinator receives 26: function COMMIT_CAUSAL(tid) an operation op on a data item k, it sends a GET_VERSION 27: var L ← {l | wbuff[tid][l] 6= } message with the transaction’s snapshot snapVec[tid] to the ∅ 28: if L = ∅ then return snapVec[tid] local replica responsible for k (line1:11). Upon receiving l 29: send PREPARE(tid, wbuff[tid][l], snapVec[tid]) to pd, l ∈ L the message (line1:18), the replica first ensures that it is as 30: var commitVec ← snapVec[tid] up-to-date as required by the snapshot (line1:21). It then com- 31: for all l ∈ L do l putes the latest version of k within the snapshot by applying 32: wait receive PREPARE_ACK(tid, ts) from pd the operations from opLog[k] by all transactions with commit 33: commitVec[d] ← max{commitVec[d], ts} l vectors ≤ snapVec[tid]. The result is sent to the coordinator in 34: send COMMIT(tid, commitVec) to pd, l ∈ L a VERSION message. After receiving it (line1:12), the coordi- 35: return commitVec nator further applies the operations on k previously executed 36: when received PREPARE(tid, wbuff , snapVec) from p by the transaction, which are stored in a buffer wbuff[tid]; this 37: for i ∈ D \{d} do ensures read your writes within the transaction. If the opera- 38: uniformVec[i] ← max{snapVec[i], uniformVec[i]} tion is an update, the coordinator then appends it to wbuff[tid]. 39: var ts ← clock Finally, the coordinator executes the desired operation op and 40: preparedCausal ← preparedCausal ∪ {htid,wbuff ,tsi} forwards its return value to the client. 41: send PREPARE_ACK(tid, ts) to p 42: when received COMMIT(tid, commitVec) Commit. A client commits a causal transaction by calling 43: wait until clock ≥ commitVec[d] COMMIT_CAUSAL (line1:26). This returns immediately if 44: htid,wbuff ,_i ← find(tid,preparedCausal) the transaction is read-only, since it already read a consistent 45: preparedCausal ← preparedCausal \ {htid,_,_i} snapshot (line1:28). To commit an update transaction, UNI- 46: for all hk,opi ∈ wbuff do STORE uses a variant of two-phase commit protocol (recall 47: opLog[k] ← opLog[k] · hop,commitVeci that for simplicity we only consider whole-data center fail- 48: committedCausal[d] ← committedCausal[d] ∪ ures, not those of individual replicas, §2). The coordinator {htid,wbuff ,commitVeci} first sends a PREPARE message to the replicas in the local 49: function UNIFORM_BARRIER(V) data center storing the data items updated by the transaction 50: wait until uniformVec[d] ≥ V[d] (line1:29). The message to each replica contains the part 51: function ATTACH(V) of the write buffer relevant to that replica. When a replica 52: wait until ∀i ∈ D \{d}.uniformVec[i] ≥ V[i] receives the message (line1:36), it computes the transaction’s

6 m prepare time ts from its local clock and adds the transaction Algorithm 2 Transaction replication at pd . to preparedCausal, which stores the set of causal transactions 1: function PROPAGATE_LOCAL_TXS() . Run periodically that are prepared to commit at the replica. The replica then 2: if preparedCausal = ∅ then knownVec[d] ← clock returns ts to the coordinator in a PREPARE_ACK message. 3: else knownVec[d]← min{ts|h_,_,tsi∈preparedCausal}−1 When the coordinator receives replies from all replicas up- 4: var txs ← {h_,_,commitVeci ∈ committedCausal[d] | dated by the transaction, it computes the transaction’s commit commitVec[d] ≤ knownVec[d]} vector commitVec: it sets the local timestamp commitVec[d] to 5: if txs 6= ∅ then 6: send REPLICATE(d,txs) to pm, i ∈ \{d} the maximum among the prepare times proposed by the repli- i D 7: committedCausal[d] ← committedCausal[d] \ txs cas (line1:33), and it copies the other entries of commitVec 8: else send HEARTBEAT(d,knownVec[d]) to pm, i ∈ D \{d} from the snapshot vector snapVec[tid] (line1:30). The latter i reflects the fact that the transactions in the snapshot become 9: when received REPLICATE(i, txs) causal dependencies of tid. 10: for all htid,wbuff ,commitVeci∈txs in commitVec[i] order do After computing commitVec, the coordinator sends it in a 11: if commitVec[i] > knownVec[i] then 12: for all hk,opi ∈ wbuff do COMMIT message to the relevant replicas at the local data 13: opLog[k] ← opLog[k] · hop,commitVeci center (line1:34) and returns it to the client (line1:35). The 14: committedCausal[i] ← committedCausal[i] ∪ client then sets its causal past pastVec to the commit vector. {htid,wbuff ,commitVeci} When a replica receives the COMMIT message (line1:42), it 15: knownVec[i] ← commitVec[i] removes the transaction from preparedCausal, adds the trans- 16: when received HEARTBEAT(i, ts) action’s updates to opLog, and adds the transaction to a set 17: pre: ts > knownVec[i] committedCausal[d], which stores transactions waiting to be 18: knownVec[i] ← ts replicated to sibling replicas at other data centers. 19: function FORWARD_REMOTE_TXS(i, j) 5.3 Transaction Replication 20: var txs ← {h_,_,commitVeci ∈ committedCausal[ j] | commitVec[ j] > globalMatrix[i][ j]} Each replica pm periodically replicates locally committed m d 21: if txs 6= ∅ then send REPLICATE( j,txs) to pi update transactions to sibling replicas in other data centers by m 22: else send HEARTBEAT( j,knownVec[ j]) to pi executing PROPAGATE_LOCAL_TXS (line2:1). Transactions 23: function BROADCAST_VECS() . Run periodically are replicated in the order of their local timestamps. The prefix l 24: send KNOWNVEC_LOCAL(m,knownVec) to pd, l ∈ P of transactions that is ready to be replicated is determined by m 25: send STABLEVEC(d,stableVec) to pi , i ∈ D knownVec[d]: according to Property1, pm stores updates to m d 26: send KNOWNVEC_GLOBAL(d,knownVec) to pi , i ∈ D m by all transactions originating at d with local timestamps 27: when received KNOWNVEC_LOCAL(l, knownVec) ≤ knownVec[d]. Thus, the replica first updates knownVec[d] 28: localMatrix[l] ← knownVec while preserving Property1. 29: for i ∈ D do stableVec[i] ← min{localMatrix[n][i] | n ∈ P } There are two cases of this update. If the replica does not 30: stableVec[strong] ← min{localMatrix[n][strong] | n ∈ P } have any prepared transactions (preparedCausal = ∅), it sets knownVec[d] to the current value of the clock (line2:2). This 31: when received STABLEVEC(i, stableVec) 32: stableMatrix[i] ← stableVec preserves Property1 because in this case a new transaction 33: G ← all groups with f + 1 replicas that include pm originating at d and updating m will get a prepare time at d 34: for j ∈ D do m higher than the current clock (line1:39), and thus also a 35: var ts ← max{min{stableMatrix[h][ j] | h ∈ g} | g ∈ G} higher local timestamp (line1:33). If the replica has some pre- 36: uniformVec[ j] ← max{uniformVec[ j], ts} pared transactions, then they may end up getting local times- 37: when received KNOWNVEC_GLOBAL(l, knownVec) tamps lower than the current clock. In this case, the replica 38: globalMatrix[l] ← knownVec sets knownVec[d] to just below the smallest prepared time (line2:3). This preserves Property1 because: (i) currently m prepared transactions will get a local timestamp no lower than a set of transactions txs originating at a sibling replica pi their prepare time; and (ii) as we argued above, new transac- (line2:9), it iterates over txs in commitVec[i] order. For each tions will get a prepare time higher than the current clock and, new transaction in txs with commit vector commitVec, the hence, than the smallest prepare time. replica adds the transaction’s operations to its log and sets After updating knownVec[d], the replica sends a knownVec[i] = commitVec[i]. Since communication channels m m REPLICATE message to its siblings with the transactions in are FIFO, pd processes all transactions from pi in their local committedCausal[d] such that commitVec[d] ≤ knownVec[d], timestamp order. Hence, the above update to knownVec[i] m m and then removes them from committedCausal[d]. In other preserves Property1: pd stores updates originating at pi by words, the replica sends all transactions from the prefix all transactions with commitVec[i] ≤ knownVec[i]. Finally, the determined by knownVec[d] that it has not yet replicated. replica adds the transactions to committedCausal[i], which m is used to implement transaction forwarding (§4). Due to When a replica pd receives a REPLICATE message with

7 m the forwarding, pd may receive the same transaction from of the update transactions that have been replicated at sibling different data centers. Thus, when processing transactions in replicas in other data centers. To this end, sibling replicas peri- the REPLICATE message, it checks for duplicates (line2:11). odically exchange KNOWNVEC_GLOBAL messages with their knownVec vectors, which they store in a matrix globalMatrix 5.4 Advancing the Uniform Snapshot m (lines2:26 and2:37). Thus, pi has received all update trans- m Replicas run a background protocol that refreshes the informa- actions from p j with commitVec[ j] ≤ globalMatrix[i][ j]. m tion about uniform transactions. This proceeds in two stages. A replica pd only forwards transactions when it suspects First, a replica keeps track of which transactions have been that a data center j may have failed before replicating all replicated at the replicas of other partitions in the same data the update transactions originating at it to a data center i center. To this end, replicas in the same data center peri- (this information is provided by a separate module). In this m odically exchange KNOWNVEC_LOCAL messages with their case, pd executes FORWARD_REMOTE_TXS(i, j) (line2:19). knownVec vectors, which they store in a matrix localMatrix The function forwards the set of transactions txs received m m (lines2:24 and2:27); in our implementation this is done via from p j that have not been replicated at pi according to a dissemination tree. This matrix is then used to compute the globalMatrix[i][ j]. For example, in Figure1, UNISTORE will vector stableVec, which represents the set of transactions that eventually invoke FORWARD_REMOTE_TXS(d1,d3) at repli- m have been fully replicated at the local data center as per Prop- cas in d2 to forward t1. The replica pd sends the transactions m erty2. To ensure this, a replica computes an entry stableVec[i] in txs to pi in a REPLICATE message. If there are no up- m m as the minimum of knownVec[i] it received from the replicas date transactions to forward, pd sends a heartbeat to pi with of other partitions in the same data center (line2:29). knownVec[ j]. In the second stage of the background protocol, sib- UNISTORE periodically deletes from committedCausal ling replicas periodically exchange STABLEVEC messages transactions that have been replicated at every data center with their stableVec vectors, which they store in a matrix (omitted from the pseudocode for brevity). stableMatrix (lines2:25 and2:31). This matrix is then used by a replica to compute uniformVec, which characterizes the 5.6 On-Demand Durability and Client Migration update transactions that are replicated at f + 1 data centers A client may wish to ensure that the transactions it has ob- as per Property3. To this end, a replica first enumerates all served so far are durable. To this end, the client can call groups G of f + 1 data centers that include its local data cen- UNIFORM_BARRIER(V) at any replica in its local data cen- ter (line2:33). For each data center j the replica performs ter d, where V is the client’s causal past pastVec (line1:49). the following computation. First, for each group g ∈ G, it The replica returns to the client only when all the transactions computes the minimum j-th entry in the stable vectors of from pastVec that originate at d are uniform, and thus durable. all data centers h ∈ g: min{stableMatrix[h][ j] | h ∈ g}. By Then the same holds for all transactions from pastVec, be- Property2 all update transactions originating at j with local cause the protocol only exposes remote transactions to clients timestamp ≤ the minimum have been replicated at all data when they are already uniform (§5.2). centers in g. The replica then sets uniformVec[ j] to the maxi- A client wishing to migrate from its local data center d to mum of the resulting values computed for all groups g ∈ G, another data center i first calls UNIFORM_BARRIER(V) at any to cover transactions that are replicated at any such group. replica in d with V = pastVec, to ensure that the transactions According to Property4, the transactions with commit vec- the client has observed or issued at d will eventually become tors ≤ uniformVec are uniform, and now become visible to visible at i. The client then calls ATTACH(V) at any replica in i m transactions coordinated by pd (§5.2). (line1:51). The replica returns when its uniformVec includes Replicas also update uniformVec in lines1:2-3,1:19-20 all remote transactions from V (line1:52). The client can and1:37-38 by incorporating snapVec[i] for remote data cen- then be sure that its transactions at i will operate on snapshots ters i. This is safe because a transaction executes on a snapshot including all the transactions it has observed before. that only includes uniform remote transactions. Finally, if a replica does not receive new transactions for a 6 Adding Strong Transactions long time, it sends the value of its knownVec[d] as a heartbeat (lines2:8 and2:16). This allows advancing stableVec and We now describe the full UNISTORE protocol with both causal uniformVec even under skewed load distributions. and strong transactions. It is obtained by adding the high- lighted lines to Algorithms1-2 and a new Algorithm3. 5.5 Transaction Forwarding As we explained in §4, to guarantee that a transaction originat- 6.1 Metadata ing at a correct data center eventually becomes exposed at all The Conflict Ordering property of our consistency model re- correct data centers despite failures (Eventual Visibility), repli- quires any two conflicting strong transactions to be related cas may have to forward remote update transactions. To deter- one way or another by the causal order ≺ (§3). To ensure mine which transactions to forward, each replica keeps track this, the protocol assigns to each strong transaction a scalar

8 m strong timestamp, analogous to those used in optimistic con- Algorithm 3 Committing strong transactions at pd . currency control for serializability [69]. Several vectors used 1: function COMMIT_STRONG(tid) as metadata in the causal consistency protocol (§5.1) are then 2: UNIFORM_BARRIER(snapVec[tid]) extended with an extra strong entry. 3: return CERTIFY(tid, wbuff[tid], rset[tid], snapVec[tid]) First, we extend commit vectors and those representing 4: upon DELIVER_UPDATES(W) causally consistent snapshots. Commit vectors are compared 5: for allhwbuff ,commitVeci∈W incommitVec[strong]order do using the previous order <, but considering all entries; as 6: for all hk,opi ∈ wbuff do before, this order is consistent with the causal order ≺. Fur- 7: opLog[k] ← opLog[k] · hop,commitVeci thermore, conflicting strong transactions are causally ordered 8: knownVec[strong] ← commitVec[strong] according to their strong timestamps. 9: function HEARTBEAT_STRONG() . Run periodically 10: return CERTIFY(⊥, , ,~0) PROPERTY 5. For any conflicting strong transactions t1 and t2 ∅ ∅ with commit vectors commitVec1 and commitVec2, we have: t1 ≺ t2 ⇐⇒ commitVec1[strong] < commitVec2[strong]. action to a certification service, which determines whether A consistent snapshot vector V now defines the set of trans- the transaction commits or aborts (line3:3, see §6.3). In the actions with a commit vector ≤ V, according to the new former case, the service also determines its commit vector, <. The vectors knownVec and stableVec maintained by a which the coordinator returns to the client. If the transaction m commits, the client sets its causal past pastVec to the commit replica pd are also extended with a strong entry. The entries knownVec[strong] and stableVec[strong] define the prefix of vector; otherwise, it re-executes the transaction. m The certification service also notifies replicas in all data strong transactions that have been replicated at pd and the local data center d, respectively: centers about updates by strong transactions affecting them via DELIVER_UPDATES upcalls, invoked in an order consis- m PROPERTY 6. Replica pd stores the updates to m by all strong tent with strong timestamps of the transactions (line3:4). A transactions with commitVec[strong] ≤ knownVec[strong]. replica receiving an upcall adds the new operations to its log PROPERTY 7. Data center d stores the updates by all strong and refreshes knownVec[strong] to preserve Property6. m transactions with commitVec[strong] ≤ stableVec[strong]. Finally, a replica pd that has not seen any strong transac- tions updating its partition m for a long time submits a dummy To ensure Property7, the strong entry of stableVec is up- strong transaction that acts as a heartbeat (line3:9). Similarly dated at line2:30 similarly to its other entries. We do not to heartbeats for causal transactions, this allows coping with extend uniformVec, because our commit protocol for strong skewed load distributions. transactions automatically guarantees their uniformity.

6.2 Transaction Execution 6.3 Certification Service UNISTORE uses optimistic concurrency control for strong We implement the certification service using an existing fault- transactions, with the same protocol for executing causal and tolerant protocol from [18], with transaction commit vectors speculatively executing strong transactions. To this end, Al- computed using the techniques from [29]. The protocol in- gorithm1 is modified as follows. First, the computation of tegrates two-phase commit across partitions accessed by the the snapshot vector snapVec[tid] is extended to compute the transaction and Paxos among the replicas of each partition. It strong entry (line1:7), which is now taken into account when furthermore uses white-box optimizations between the two checking that a replica state is up to date (line1:21). The protocols to minimize the commit latency. The use of Paxos strong entry of the snapshot vector is computed so as to in- ensures that a committed strong transaction is durable and its clude all strong transactions known to be fully replicated updates will eventually be delivered at all correct data centers in the local data center, as defined by stableVec[strong]. To (line3:4). For each partition, a single replica functions as the ensure read your writes, the snapshot additionally includes Paxos leader. The protocol is described and formally speci- strong transactions from the client’s causal past, as defined by fied elsewhere [18], and here we discuss it only briefly. Its V[strong]. Finally, the coordinator of a transaction now main- pseudocode and formal specification are given in §A and §C, tains not only its write set, but also its read set rset that records respectively. all operations by the transaction, including read-only ones The certification service accepts the read and write sets of (line1:14). The latter is used to certify strong transactions. a transaction and its snapshot vector (line3:3). Even though After speculatively executing a strong transaction, the client the service is distributed, it guarantees that commit/abort de- tries to commit it by calling COMMIT_STRONG at its coordi- cisions are computed like in a centralized with opti- nator (line3:1). The coordinator first waits until the snapshot mistic concurrency control – in a total certification order. To on which the transaction operated becomes uniform by call- ensure Conflict Ordering, the decisions are computed using ing UNIFORM_BARRIER (line3:2): as we argued in §4, this a concurrency-control policy similar to that for serializabil- is crucial for liveness. The coordinator next submits the trans- ity [69]: a transaction commits if its snapshot includes all

9 200 conflicting transactions that precede it in the certification or- 175 UniStore Strong 150 RedBlue Causal der. The certification service also computes a commit vector 125 for each committed transaction by copying its per-data center 100 75 entries from the transaction’s snapshot vector and assigning a 50 25 strong timestamp consistent with the certification order. 0 Average latency (ms) 0 10 20 30 40 50 60 70 80 Throughput (Ktxs/s) 7 Proof of Correctness Figure 3: RUBiS benchmark: throughput vs. average latency. We have rigorously proved that UNISTORE correctly imple- ments the specification of PoR consistency for the case when think times are 500ms. We run the bidding mix workload of the data store manages last-writer-wins registers. The proof RUBiS with 15% of update transactions, which yields 10% uses the formal framework from [13,14,16] and establishes of strong transactions. Properties1-7 stated earlier. Due to space constraints, we We compare UNISTORE with STRONG, REDBLUE and defer the proof to §D. CAUSAL. STRONG implements serializability [69] as a spe- cial case of UNISTORE where all transactions are strong and 8 Evaluation all pairs of operations conflict. REDBLUE implements red- blue consistency [40], which like PoR, combines causal and We have implemented UNISTORE and several other protocols strong consistency. However, it declares conflicts between all (listed in the following) in the same codebase, consisting of strong transactions. REDBLUE certifies strong transactions 10.3K SLOC of Erlang. We evaluate the protocols on Amazon at a centralized replicated service, with a replica at each data EC2 using m4.2xlarge VMs from 5 different regions. Each center. CAUSAL implements causal consistency as a special VM has 8 virtual cores and 32GB of RAM. The RTT between case of UNISTORE where all transactions are causal. It cannot regions ranges from 26ms to 202ms. Unless otherwise stated, preserve the integrity invariants of RUBiS, but gives an upper our experiments deploy 3 data centers, thus tolerating a single bound on the expected performance. data center failure: Virginia (US-East), California (US-West) Throughput and average latency. Figure3 evaluates aver- and Frankfurt (EU-FRA). All Paxos leaders are located in age transaction latency and throughput. As the figure shows, Virginia. By default we use 4 replica machines per data center. UNISTORE exhibits a high throughput: 72% and 183% higher Each machine stores replicas of 8 partitions, matching the than REDBLUE and STRONG respectively at their saturation number of cores. Clients are hosted on separate machines point. This is expected, as UNISTORE implements the consis- in each data center. We run each experiment for at least 5 tency model that enables the most concurrency. STRONG minutes, with the first and the last minute ignored. Replicas classifies all transactions as strong. This impacts perfor- propagate local update transactions (line2:1) and broadcast mance because executing a strong transaction is significantly vectors (line2:23) every 5ms. more expensive than executing a causal one. REDBLUE uses 8.1 Does UNISTORE combine causal and strong consis- a centralized certification service that saturates before the tency effectively? UNISTORE’s distributed service, creating a bottleneck. UNI- STORE exhibits an average latency of 16.5ms, lower than We start by analyzing the performance of UNISTORE using 80.4ms of STRONG. The latency of REDBLUE is comparable RUBiS – a popular benchmark that emulates an online auction to that of UNISTORE. This is because both systems mark the website such as eBay [40,41]. It defines 11 read-only trans- same set of transactions as strong. Still, REDBLUE declares actions and 5 update transactions, e.g., selling items, bidding conflicts between all strong transactions and thus aborts more on items, and consulting outstanding auctions. As in previ- transactions than UNISTORE: 0.12% vs 0.027%. The clients ous work [41], to make the benchmark more challenging, we whose transactions abort have to retry them, thus increasing add an extra update transaction to declare the closeAuction latency. Since the abort rate remains low in both cases, the winner of an auction. We also borrow from [41] a conflict difference in latency is negligible in our experiment. We ex- relation between RUBiS transactions that preserves key in- pect a more significant difference in workloads with higher tegrity invariants in the PoR consistency model. This marks contention. Finally, in comparison to CAUSAL, UNISTORE four transactions as strong ( , , registerUser storeBuyNow penalizes throughput by 45%. This is the unavoidable price and ) and declares three conflicts storeBid closeAuction to pay to preserve application-specific invariants. between them. For example, storeBid, which places a bid on a item, conflicts with closeAuction if both act on the same Latency of each transaction type. In UNISTORE, the la- item: this is needed to preserve the invariant that the winner tency of strong transactions is dominated by the RTT between of an auction is the highest bidder. Our RUBiS database is Virginia (the leader’s region) and California (Virginia’s clos- configured according to the benchmark specification: it is pop- est data center) – 61ms. Strong transactions exhibit a latency ulated with 33,000 items for sale and 1 million users; client of 73.9ms on average. The latency varies depending on the

10 90 160 80 UniStore-16 140 Uniform UniStore-32 70 120 CureFT 60 UniStore-64 100 50 40 80 30 60 20 40 10 20

0 Throughput (Ktxs/sec) 0 3 DCs 4 DCs 5 DCs 60 UniStore-16 50 UniStore-32 Figure 5: Throughput penalty of tracking uniformity. 40 UniStore-64

Throughput (Ktxs/sec) 30 20 Impact of contention. For this set of experiments, we set the 10 ratio of strong transactions that access a designated partition 0 to 20% to create contention. As shown by the bottom plot 0% 10% 25% 50% 100% of Figure4, UNISTORE is still able to scale fairly well un- Figure 4: Scalability when varying the ratio of strong transactions der contention. But, as expected, contention has an impact with uniform data access (top) and under contention (bottom). on scalability: a 17.15% throughput drop from the optimal scalability compared to the 9.76% throughput drop in the client’s location: from 65.4ms on average at the leader’s site to experiments without contention. 93.2ms at the site furthest from the leader (Frankfurt). Since causal transactions do not require coordination between data 8.3 What is the cost of uniformity? centers, they exhibit a very low latency – 1.2ms on average, We compare CUREFT to UNIFORM. CUREFT implements which is comparable to that of CAUSAL. This demonstrates Cure [3], a causally consistent data store, and makes it fault that UNISTORE is able to mix causal and strong consistency tolerant by adding transaction forwarding (§4). UNIFORM is effectively, as the latency of causal transactions remains low a simplified version of UNISTORE that removes all the mech- regardless of concurrently executing strong transactions. anisms related to strong transactions. UNIFORM tracks unifor- mity and makes remote transactions visible only when these are uniform; CUREFT does not. We use a microbenchmark 8.2 How does UNISTORE scale with the number of ma- with only causal transactions and 15% of update transactions. chines? Each transaction accesses three data items. We evaluate the peak throughput of UNISTORE as we in- Throughput penalty. Figure5 evaluates the cost of tracking crease the number of machines per each data center from 2 uniformity. It shows the peak throughput when the number of to 8, i.e., the number of partitions from 16 to 64. We use a data centers increases from 3 to 5. We first add Ireland and microbenchmark with 100% of update transactions, where then Brazil. As we do this, the throughput remains almost each transaction accesses three data items. We vary the ratio constant. This is because each data center stores replicas of all of strong transactions from 0% to 100% to understand their partitions and the computational power gained when adding a impact on scalability. data center is offset by the cost of replicating update transac- tions. As the figure shows, the cost of tracking uniformity is Scalability under low contention. For this set of experi- small: a 7.97% drop on average. The gap grows as we increase ments, the data items accessed by each transaction are picked the number of data centers: a 10.61% drop on average with 5 uniformly at random. This yields a very low contention: e.g., data centers. This is because, to track uniform transactions, with 16 partitions, the probability of two transactions access- sibling replicas exchange messages: the more data centers, ing the same partition is 0.031. As shown by the top plot the more messages exchanged. The penalty can be reduced by of Figure4, UNISTORE is able to scale almost linearly even decreasing the frequency at which sibling replicas exchange when the workload includes strong transactions: a 9.76% their stableVec (line2:25), at the expense of an extra delay in throughput drop compared to the optimal scalability. This is the visibility of remote transactions. because, with uniform accesses, the task of committing trans- actions is balanced among partitions. Thus, when the number Reading from a uniform snapshot. Figure6 evaluates the of partitions increases, so does the system’s capacity. The delay on the visibility of remote transactions when reading scalability is not perfect due to the cost of the background from a uniform snapshot. We deploy four data centers: Vir- protocol that computes stableVec, which grows logarithmi- ginia, California, Frankfurt and Brazil. We set f = 2 to tol- cally with the number of partitions. The plot also shows that erate 2 data center failures (when f = 1, UNIFORM shows strong transactions are expensive: 25.72% of throughput drop no delay). Under such a configuration, a data center makes on average with 10% of strong transactions. The performance a transaction visible when it knows that 3 data centers store is dominated by the number of strong transactions that a par- the transaction and its dependencies (§5.4). The figure shows tition can certify per second. the cumulative distribution of the delay before updates from

11 Uniform CureFT 1 broadcast [10, 67]. However, these systems do not provide 0.8 mechanisms for maintaining transactional data consistency. 0.6 Several papers have proposed tools that use formal verifi-

CDF 0.4 0.2 cation technology to ensure that consistency choices do not 0 violate application invariants [9, 30, 33, 39, 50, 51]. Such tools 0 5 10 15 20 25 30 35 40 0 20 40 60 80 100 120 140 Update visibility delay (ms) can make it easier for programmers to use our system.

Figure 6: Left: California to Brazil (best case). Right: California to Causal consistency implementations. Our subprotocol for Virginia (worst case). causal consistency belongs to a family of highly scalable protocols that avoid using any centralized components or de- California are visible in Brazil and Virginia. pendency check messages [3, 22, 59–61]; other alternatives The extra delay at Brazil is only of 5ms at the 90th per- are less scalable [4,8, 12, 21, 31, 42, 43, 47, 71]. While we centile. This is the best case scenario for UNIFORM because base our causal consistency subprotocol on an existing one, Brazil learns that Virginia stores a transaction originating at Cure [3], we have extended it in nontrivial ways, by integrat- California only 2ms after receiving it. The worst case sce- ing mechanisms for tracking uniformity (§5.4) and for transac- nario for UNIFORM is when the origin and the destination tion forwarding (§5.5). Some of the above protocols [31, 60] data center are the closest ones. This is why the extra delay use hybrid clocks instead of real time [35] to improve per- at Virginia is 92ms at the 90th percentile: Brazil learns that formance with large clock skews; this technique can also be Frankfurt stores a transaction originating at California 88ms integrated into UNISTORE. after receiving it. Note that when clients communicate only SwiftCloud [71] implements k-stability [45], a notion simi- via the data store, the delay is unnoticeable. Even if clients lar to uniformity, to enable client migration. SwiftCloud relies communicate out of band, as the maximum extra delay is less on a single per-data center sequencer, which makes tracking than 100ms, it is unlikely that a client will miss an update. k-stability easy, but the data store less scalable. Our protocol is more sophisticated, since we distribute the responsibility of tracking uniformity among the replicas in a data center. 9 Related Work Paxos variants. Several Paxos variants [24, 38, 49, 52] lower Systems with multiple consistency levels. A number of data the latency by allowing commutative operations to execute at stores have combined weak and strong consistency, includ- replicas in arbitrary orders. In contrast to them, UNISTORE ing several commercial and academic systems that combine implements PoR consistency, which allows causal transac- eventual and strong consistency [5,6, 28, 48, 57, 64, 72]. Sev- tions to execute without any synchronization at all. eral academic data stores combined causal and strong con- sistency [9,36,40,41,58,64]. Pileus [64] funnels all updates through a single data center. In the fault-tolerant version of 10 Conclusion lazy replication [36], causal operations require synchroniza- This paper presented UNISTORE, the first fault-tolerant and tion between replicas on its critical path. In both cases, causal scalable data store that combines causal and strong consis- operations are not highly available, defeating the benefits of tency. UNISTORE carefully integrates state-of-the-art scalable causal consistency. Walter [58] restricts causal operations protocols and extends them in nontrivial ways. To maintain to a specific type and lacks fault tolerance due to the use liveness despite data center failures, unlike previous work, of two-phase commit across data centers. The remaining UNISTORE commits a strong transaction only when all its works [9,40,41] support highly available causal operations, causal dependencies are uniform. Our results show that UNI- but are not fault tolerant. First, they do not make causal oper- STORE combines causal and strong consistency effectively: ations uniform on demand to guarantee the liveness of strong 3.7× lower latency on average than a strongly consistent sys- operations. Thus, they suffer from the liveness issue we ex- tem with 1.2ms latency on average for causal transactions. We plained in §4 (Figure2). In addition, these systems do not expect that the key ideas in UNISTORE will pave the way for use fault-tolerant mechanisms even for strong transactions. practical systems that combine causal and strong consistency. They guard the use of strong transactions using mechanisms similar to locks; if the lock holder fails before releasing it, no Acknowledgements. We thank our shepherd, Heming Cui, other data center can execute a strong transaction requiring the as well as Gregory Chockler, Vitor Enes, Luís Rodrigues same lock. This occurs even when the service handing locks is and Marc Shapiro for comments and suggestions. This work fault-tolerant, as in [41]. Finally, the above systems either do was partially supported by an ERC Starting Grant RAC- not include mechanisms for partitioning the key space among COON, the Juan de la Cierva Formación funding scheme different machines in a data center or include per-data center (FJC2018-036528-I), the CCF-Tencent Open Fund (CCF- centralized services, which limits their scalability (§8.2). Tencent RAGR20200124) and the AWS Cloud Credit for Some group communication systems mix causal and atomic Research program.

12 References [14] S. Burckhardt, A. Gotsman, H. Yang, and M. Zawirski. Replicated data types: specification, verification, opti- [1] D. Abadi. Consistency tradeoffs in modern distributed mality. In Symposium on Principles of Programming database system design: CAP is only part of the story. Languages (POPL), 2014. IEEE Computer, 45(2), 2012. [15] C. Cachin, R. Guerraoui, and L. E. T. Rodrigues. In- [2] M. Ahamad, G. Neiger, J. E. Burns, P. Kohli, and P. W. troduction to Reliable and Secure Distributed Program- Hutto. Causal memory: Definitions, implementation, ming (2nd ed.). Springer, 2011. and programming. Distributed Comput., 9(1), 1995. [16] A. Cerone, G. Bernardi, and A. Gotsman. A framework [3] D. D. Akkoorath, A. Z. Tomsic, M. Bravo, Z. Li, for transactional consistency models with atomic visibil- T. Crain, A. Bieniusa, N. Preguiça, and M. Shapiro. Cure: ity. In International Conference on Concurrency Theory Strong semantics meets high availability and low latency. (CONCUR), 2015. In International Conference on Distributed Computing Systems (ICDCS), 2016. [17] Y. L. Chen, S. Mu, J. Li, C. Huang, J. Li, A. Ogus, and D. Phillips. Giza: Erasure coding objects across global [4] S. Almeida, J. Leitão, and L. Rodrigues. ChainReaction: data centers. In USENIX Annual Technical Conference A causal+ consistent datastore based on chain replica- (USENIX ATC), 2017. tion. In European Conference on Computer Systems (EuroSys), 2013. [18] G. Chockler and A. Gotsman. Multi-shot distributed transaction commit. In Symposium on Distributed Com- [5] Amazon. Read consistency. puting (DISC), 2018. https://docs.aws.amazon.com/amazondynamodb/latest/ developerguide/HowItWorks.ReadConsistency.html, [19] J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, 2020. J. J. Furman, S. Ghemawat, A. Gubarev, C. Heiser, P. Hochschild, W. C. Hsieh, S. Kanthak, E. Kogan, H. Li, [6] . Read repair. A. Lloyd, S. Melnik, D. Mwaura, D. Nagle, S. Quin- https://cassandra.apache.org/doc/latest/operating/ lan, R. Rao, L. Rolig, Y. Saito, M. Szymaniak, C. Tay- read_repair.html, 2020. lor, R. Wang, and D. Woodford. Spanner: Google’s [7] H. Attiya, F. Ellen, and A. Morrison. Limitations of Globally-Distributed Database. In Symposium on Op- highly-available eventually-consistent data stores. IEEE erating Systems Design and Implementation (OSDI), Trans. Parallel Distributed Syst., 28(1), 2017. 2012.

[8] P. Bailis, A. Ghodsi, J. M. Hellerstein, and I. Stoica. [20] G. DeCandia, D. Hastorun, M. Jampani, G. Kakula- Bolt-on causal consistency. In International Conference pati, A. Lakshman, A. Pilchin, S. Sivasubramanian, on Management of Data (SIGMOD), 2013. P. Vosshall, and W. Vogels. Dynamo: Amazon’s highly available key-value store. In Symposium on Operating [9] V. Balegas, N. Preguiça, R. Rodrigues, S. Duarte, C. Fer- Systems Principles (SOSP), 2007. reira, M. Najafzadeh, and M. Shapiro. Putting the con- sistency back into eventual consistency. In European [21] J. Du, S. Elnikety, A. Roy, and W. Zwaenepoel. Orbe: Conference on Computer Systems (EuroSys), 2015. Scalable causal consistency using dependency matrices and physical clocks. In Symposium on Cloud Computing [10] K. Birman, A. Schiper, and P. Stephenson. Lightweight (SoCC), 2013. causal and atomic group multicast. ACM Trans. Comput. Syst., 9(3), 1991. [22] J. Du, C. Iorgulescu, A. Roy, and W. Zwaenepoel. Gen- tlerain: Cheap and scalable causal consistency with phys- [11] K. P. Birman and T. A. Joseph. Reliable communication ical clocks. In Symposium on Cloud Computing (SoCC), in the presence of failures. ACM Trans. Comput. Syst., 2014. 5(1), 1987. [23] C. Dwork, N. Lynch, and L. Stockmeyer. Consensus in [12] M. Bravo, L. Rodrigues, and P. Van Roy. Saturn: A the Presence of Partial Synchrony. Journal of the ACM, distributed metadata service for causal consistency. In 35(2), 1988. European Conference on Computer Systems (EuroSys), 2017. [24] V. Enes, C. Baquero, T. F. Rezende, A. Gotsman, M. Per- rin, and P. Sutra. State-machine replication for planet- [13] S. Burckhardt. Principles of Eventual Consistency. Now scale systems. In European Conference on Computer Publishers, 2014. Systems (EuroSys), 2020.

13 [25] FaunaDB. What is FaunaDB? [38] L. Lamport. Generalized consensus and Paxos. Tech- https://docs.fauna.com/fauna/current/introduction.html, nical Report MSR-TR-2005-33, Microsoft Research, 2020. 2005.

[26] C. Fidge. Timestamps in message-passing systems that [39] C. Li, J. Leitão, A. Clement, N. M. Preguiça, R. Ro- preserve the partial ordering. In Australian Computer drigues, and V. Vafeiadis. Automating the choice of Science Conference (ASCS), 1988. consistency levels in replicated systems. In USENIX Annual Technical Conference (USENIX ATC), 2014. [27] S. Gilbert and N. Lynch. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web [40] C. Li, D. Porto, A. Clement, R. Rodrigues, N. Preguiça, services. SIGACT News, 33(2), 2002. and J. Gehrke. Making geo-replicated systems fast if possible, consistent when necessary. In Symposium on [28] Google. Balancing strong and eventual consistency Operating Systems Design and Implementation (OSDI), with datastore. 2012. https://cloud.google.com/datastore/docs/articles/ balancing-strong-and-eventual-consistency-with- [41] C. Li, N. Preguiça, and R. Rodrigues. Fine-grained google-cloud-datastore, 2020. consistency for geo-replicated systems. In USENIX Annual Technical Conference (USENIX ATC), 2018. [29] A. Gotsman, A. Lefort, and G. Chockler. White-box atomic multicast. In International Conference on De- [42] W. Lloyd, M. J. Freedman, M. Kaminsky, and D. G. pendable Systems and Networks (DSN), 2019. Andersen. Don’t settle for eventual: Scalable causal consistency for wide-area storage with cops. In Sympo- [30] A. Gotsman, H. Yang, C. Ferreira, M. Najafzadeh, and sium on Operating Systems Principles (SOSP), 2011. M. Shapiro. ’Cause I’m strong enough: reasoning about consistency choices in distributed systems. In Sympo- [43] W. Lloyd, M. J. Freedman, M. Kaminsky, and D. G. sium on Principles of Programming Languages (POPL), Andersen. Stronger semantics for low-latency geo- 2016. replicated storage. In Conference on Networked Systems Design and Implementation (NSDI), 2013. [31] C. Gunawardhana, M. Bravo, and L. Rodrigues. Un- obtrusive deferred update stabilization for efficient geo- [44] P. Mahajan, L. Alvisi, and M. Dahlin. Consistency, replication. In USENIX Annual Technical Conference availability, and convergence. Technical Report TR-11- (USENIX ATC), 2017. 22, University of Texas at Austin, 2011.

[32] M. P. Herlihy and J. M. Wing. Linearizability: A cor- [45] P. Mahajan, S. Setty, S. Lee, A. Clement, L. Alvisi, rectness condition for concurrent objects. ACM Trans. M. Dahlin, and M. Walfish. Depot: Cloud storage with Program. Lang. Syst., 12(3), 1990. minimal trust. ACM Trans. Comput. Syst., 29(4), 2011.

[33] G. Kaki, K. Earanky, K. C. Sivaramakrishnan, and S. Ja- [46] F. Mattern. Virtual time and global clocks in distributed gannathan. Safe replication through bounded concur- systems. In International Workshop on Parallel and rency verification. Proc. ACM Program. Lang., 2(OOP- Distributed Algorithms, 1988. SLA), 2018. [47] S. A. Mehdi, C. Littley, N. Crooks, L. Alvisi, N. Bron- [34] T. Kraska, G. Pang, M. J. Franklin, S. Madden, and son, and W. Lloyd. I can’t believe it’s not causal! Scal- A. Fekete. MDCC: Multi-data center consistency. In able causal consistency with no slowdown cascades. In European Conference on Computer Systems (EuroSys), Conference on Networked Systems Design and Imple- 2013. mentation (NSDI), 2017.

[35] S. S. Kulkarni, M. Demirbas, D. Madappa, B. Avva, [48] Microsoft. Consistency levels in Azure Cosmos DB. and M. Leone. Logical physical clocks. In Interna- https://docs.microsoft.com/en-us/azure/ tional Conference on Principles of Distributed Systems cosmos-db/consistency-levels, 2020. (OPODIS), 2014. [49] I. Moraru, D. G. Andersen, and M. Kaminsky. There is [36] R. Ladin, B. Liskov, L. Shrira, and S. Ghemawat. Pro- more consensus in egalitarian parliaments. In Sympo- viding high availability using lazy replication. ACM sium on Operating Systems Principles (SOSP), 2013. Trans. Comput. Syst., 10(4), 1992. [50] S. S. Nair, G. Petri, and M. Shapiro. Proving the safety [37] L. Lamport. The part-time parliament. ACM Trans. of highly-available distributed objects. In European Comput. Syst., 16(2), 1998. Symposium on Programming (ESOP), 2020.

14 [51] M. Najafzadeh, A. Gotsman, H. Yang, C. Ferreira, and L. Zhang, and P. Mattis. CockroachDB: The Resilient M. Shapiro. The CISE tool: proving weakly-consistent Geo-Distributed SQL Database. In International Con- applications correct. In Workshop on the Principles and ference on Management of Data (SIGMOD), 2020. Practice of Consistency for Distributed Data (PaPoC), 2016. [63] D. B. Terry, A. J. Demers, K. Petersen, M. Spreitzer, M. Theimer, and B. W. Welch. Session guarantees for [52] F. Pedone and A. Schiper. Generic broadcast. In Inter- weakly consistent replicated data. In International Con- national Symposium on Distributed Computing (DISC), ference on Parallel and Distributed Information Systems 1999. (PDIS), 1994.

[53] K. Petersen, M. J. Spreitzer, D. B. Terry, M. M. Theimer, [64] D. B. Terry, V. Prabhakaran, R. Kotla, M. Balakrishnan, and A. J. Demers. Flexible update propagation for M. K. Aguilera, and H. Abu-Libdeh. Consistency-based weakly consistent replication. In Symposium on Op- service level agreements for cloud storage. In Sympo- erating Systems Principles (SOSP), 1997. sium on Operating Systems Principles (SOSP), 2013.

[54] Redis Labs. Causal consistency in an active-active [65] D. B. Terry, M. M. Theimer, K. Petersen, A. J. Demers, database. M. J. Spreitzer, and C. H. Hauser. Managing update con- https://docs.redislabs.com/latest/rs/administering/ flicts in Bayou, a weakly connected replicated storage database-operations/causal-consistency-crdb/, 2020. system. In Symposium on Operating Systems Principles (SOSP), 1995. [55] F. B. Schneider. Implementing fault-tolerant services using the state machine approach: A tutorial. ACM [66] M. Tyulenev, A. Schwerin, A. Kamsky, R. Tan, Comput. Surv., 22(4), 1990. A. Cabral, and J. Mulrow. Implementation of cluster- wide logical clock and causal consistency in MongoDB. [56] M. Shapiro, N. M. Preguiça, C. Baquero, and M. Za- In International Conference on Management of Data wirski. Conflict-free replicated data types. In Interna- (SIGMOD), 2019. tional Symposium on Stabilization, Safety, and Security of Distributed Systems (SSS), 2011. [67] R. van Renesse, K. P. Birman, and S. Maffeis. Horus: A flexible group communication system. Commun. ACM, [57] D. Shukla, S. Thota, K. Raman, M. Gajendran, A. Shah, 39(4), 1996. S. Ziuzin, K. Sundaram, M. G. Guajardo, A. Wawrzy- niak, S. Boshra, R. Ferreira, M. Nassar, M. Koltachev, [68] W. Vogels. Eventually consistent. CACM, 52(1), 2009. J. Huang, S. Sengupta, J. J. Levandoski, and D. B. Lomet. Schema-agnostic indexing with azure docu- [69] G. Weikum and G. Vossen. Transactional Information mentdb. Proc. VLDB Endow., 8(12), 2015. Systems: Theory, Algorithms, and the Practice of Con- currency Control and Recovery. Morgan Kaufmann [58] Y. Sovran, R. Power, M. K. Aguilera, and J. Li. Transac- Publishers Inc., 2001. tional storage for geo-replicated systems. In Symposium on Operating Systems Principles (SOSP), 2011. [70] YugabyteDB. Replication. https://docs.yugabyte.com/latest/architecture/ [59] K. Spirovska, D. Didona, and W. Zwaenepoel. Wren: docdb-replication/replication/, 2020. Nonblocking reads in a partitioned transactional causally consistent data store. In International Conference on [71] M. Zawirski, N. Preguiça, S. Duarte, A. Bieniusa, Dependable Systems and Networks (DSN), 2018. V. Balegas, and M. Shapiro. Write fast, read in the past: Causal consistency for client-side applications. In Inter- [60] K. Spirovska, D. Didona, and W. Zwaenepoel. Paris: national Middleware Conference (Middleware), 2015. Causally consistent transactions with non-blocking reads and partial replication. In International Confer- [72] I. Zhang, N. K. Sharma, A. Szekeres, A. Krishnamurthy, ence on Distributed Computing Systems (ICDCS), 2019. and D. R. K. Ports. Building consistent transactions with inconsistent replication. ACM Trans. Comput. Syst., [61] K. Spirovska, D. Didona, and W. Zwaenepoel. Opti- 35(4), 2018. mistic causal consistency for geo-replicated key-value stores. IEEE Transactions on Parallel and Distributed Systems, 32(3), 2020.

[62] R. Taft, I. Sharif, A. Matei, N. VanBenschoten, J. Lewis, T. Grieger, K. Niemi, A. Woods, A. Birzin, R. Poss, P. Bardea, A. Ranade, B. Darnell, B. Gruneir, J. Jaffray,

15 m A The Full UNISTORE Protocol for LWW be replicated by pd to sibling replicas at other data Registers centers than i. m • localMatrix: The set of knownVec received by pd from Algorithms A1– A10 given in this section define the full UNI- other partitions in data center d. It is used to compute STORE protocol, including parts omitted from the main text. stableVec. m This version of the protocol is specialized to the case when • stableMatrix: The set of stableVec received by pd from the data store manages last-writer-wins (LWW) registers. sibling replicas. It is used to compute uniformVec. m Algorithm A1 shows the pseudocode of clients. We assume • globalMatrix: The set of knownVec received by pd from that each client is associated with a unique client identifier. sibling replicas. It is used to track what has been repli- Each client maintains the following variables: cated at sibling replicas. • lc: The Lamport clock at this client. We specialize the UNISTORE protocol to LWW registers • d: The data center to which this client is currently con- in several ways. First, we add code for managing Lamport nected. clocks, highlighted in blue. In particular, each (committed) • p: The coordinator partition of the current ongoing trans- transaction is associated with a Lamport timestamp, equal to action. the value of lc at its client when the transaction completes • tid: The identifier of the current ongoing transaction. (lines A1:14 and A1:23). Lamport timestamps are totally • pastVec: The client’s causal past. ordered, with client identifiers used for tie-breaking (see also Definition 54). A client interacts with UNISTORE via the following proce- Second, in Algorithm A2 we replace DO_OP by the follow- dures: ing two procedures: • tid ← START(): Start a transaction and obtain an identi- fier tid. • v ← DO_READ(tid,k): Execute a read operation on key k in a transaction with identifier tid and obtain a value v. • v ← READ(k): Invoke a read operation on key k in the ongoing transaction and obtain a return value v. • DO_UPDATE(tid,k,v): Execute an update operation on key k and value v in a transaction with identifier tid. • ok ← UPDATE(k,v): Invoke an update operation on key k and value v in the ongoing transaction. Finally, in Algorithm A3 we modify the handler of message • ok ← COMMIT_CAUSAL_TX(): Commit a causal trans- GET_VERSION: action. • GET_VERSION(snapVec,k): Read the latest value from • dec ← COMMIT_STRONG_TX(): Try to commit a key k based on the snapshot vector snapVec. Specifically, strong transaction and obtain a decision dec ∈ it returns the last update to key k of the transaction with {COMMIT, ABORT}. the latest commitVec in terms of their Lamport clock • ok ← CL_UNIFORM_BARRIER(): Execute a uniform order such that commitVec ≤ snapVec (line A3:5). barrier. Algorithms A7– A10 contain the implementation of the • ok ← CL_ATTACH( j): Attach to data center j. transaction certification service (see also §C). The certifica- Algorithms A2– A6 show the pseudocode of replicas. The tion service uses an instance of the leader election failure 1 code needed for strong transactions in Algorithms A2, A3, detector Ωm for each partition m . This primitive ensures that m and A5 is highlighted in red. Each replica pd maintains a set from some point on, all correct processes nominate the same of variables as follows. correct process as the leader. For the case when the data store m manages only registers, we assume that any two writes to the • clock: The current time at pd . m same object conflict. Other data types and conflict relations • rset: The read sets of transactions coordinated by pd , indexed by transaction identifier tid. can be easily supported by modifying the certification check • wbuff: The write buffers of transactions coordinated by in Algorithm A8. m pd , indexed by transaction identifier tid, partition l, and key k. • snapVec: The snapshot vectors of transactions coordi- m nated by pd , indexed by transaction identifier tid. • opLog: The log of updates performed on keys managed m by pd , indexed by key k. • knownVec,stableVec,uniformVec: The vectors used by m pd to track what is replicated where. • preparedCausal: The set of causal transactions local to m pd that are prepared to commit. • committedCausal: For each data center i, 1Tushar Deepak Chandra, Vassos Hadzilacos, and Sam Toueg. The weak- committedCausal[i] stores transactions waiting to est failure detector for solving consensus. J. ACM, 1996.

16 Algorithm A1 Client operations at client cl 1: function START() 2: p ← an arbitrary replica in data center d p 3: tid ← remote call START_TX(pastVec) at p . ts(START) ← snapVecd[tid] 4: return tid

5: function READ(k) 6: hv,lci ← remote call DO_READ(tid,k) at p 7: if lc 6= ⊥ then 8: lc ← max{lc,lc} 9: return v

10: function UPDATE(k,v) 11: remote call DO_UPDATE(tid,k,v) at p 12: return ok

13: function COMMIT_CAUSAL_TX() 14: lc ← lc + 1 . lclock(COMMIT_CAUSAL_TX) ← lc 15: vc ← remote call COMMIT_CAUSAL(tid,lc) at p 16: pastVec ← vc . ts(COMMIT_CAUSAL_TX) ← pastVec 17: return ok

18: function COMMIT_STRONG_TX() 19: lc ← lc + 1 20: hdec,vc,lci ← remote call COMMIT_STRONG(tid,lc) at p 21: if dec = COMMIT then 22: pastVec ← vc . ts(COMMIT_STRONG_TX) ← pastVec 23: lc ← lc . lclock(COMMIT_STRONG_TX) ← lc 24: return dec

25: function CL_UNIFORM_BARRIER() 26: var p ← an arbitrary replica in data center d 27: remote call UNIFORM_BARRIER(pastVec) at p . ts(CL_UNIFORM_BARRIER) ← pastVec 28: lc ← lc + 1 . lclock(CL_UNIFORM_BARRIER) ← lc 29: return ok

30: function CL_ATTACH( j) 31: var p ← an arbitrary replica in data center j 32: remote call ATTACH(pastVec) at p . ts(CL_ATTACH) ← pastVec 33: lc ← lc + 1 . lclock(CL_ATTACH) ← lc 34: d ← j 35: return ok

17 m Algorithm A2 Transaction coordinator at pd : causal commit 1: function START_TX(V) 2: for i ∈ D \{d} do 3: uniformVec[i] ← max{V[i],uniformVec[i]} 4: var tid ← generate_tid() 5: snapVec[tid] ← uniformVec 6: snapVec[tid][d] ← max{V[d],uniformVec[d]} 7: snapVec[tid][strong] ← max{V[strong],stableVec[strong]} 8: return tid

9: function DO_READ(tid,k) 10: var l ← partition(k) 11: if wbuff[tid][l][k] 6= ⊥ then 12: return hwbuff[tid][l][k],⊥i l 13: send GET_VERSION(snapVec[tid],k) to pd l 14: wait receive VERSION(v,lc) from pd 15: rset[tid] ← rset[tid] ∪ {k} 16: return hv,lci

17: function DO_UPDATE(tid,k,v) 18: var l ← partition(k) 19: wbuff[tid][l][k] ← v 20: rset[tid] ← rset[tid] ∪ {k}

21: function COMMIT_CAUSAL(tid,lc) 22: var L ← {l | wbuff[tid][l] 6= ∅} 23: if L = ∅ then 24: return snapVec[tid] l 25: send PREPARE(tid,wbuff[tid][l],snapVec[tid]) to pd, l ∈ L 26: var commitVec ← snapVec[tid] 27: for all l ∈ L do l 28: wait receive PREPARE_ACK(tid,ts) from pd 29: commitVec[d] ← max{commitVec[d],ts} l 30: send COMMIT(tid,commitVec,lc) to pd, l ∈ L 31: return commitVec

18 m Algorithm A3 Transaction execution at pd 1: when received GET_VERSION(snapVec,k) from p 2: for i ∈ D \{d} do 3: uniformVec[i] ← max{snapVec[i],uniformVec[i]} 4: wait until knownVec[d] ≥ snapVec[d] ∧ knownVec[strong] ≥ snapVec[strong] 5: hv,commitVec,lci ← snapshot(opLog[k],snapVec) . returns the last update to key k by a transaction 6: . with the highest Lamport timestamp such that commitVec ≤ snapVec 7: send VERSION(v,lc) to p

8: when received PREPARE(tid,wbuff ,snapVec) from p 9: for i ∈ D \{d} do 10: uniformVec[i] ← max{snapVec[i],uniformVec[i]} 11: var ts ← clock 12: preparedCausal ← preparedCausal ∪ {htid,wbuff ,tsi} 13: send PREPARE_ACK(tid,ts) to p

14: when received COMMIT(tid,commitVec,lc) 15: wait until clock ≥ commitVec[d] 16: htid,wbuff ,_i ← find(tid,preparedCausal) 17: preparedCausal ← preparedCausal \ {htid,_,_i} 18: for all hk,vi ∈ wbuff do 19: opLog[k] ← opLog[k] · hv,commitVec,lci 20: committedCausal[d] ← committedCausal[d] ∪ {htid,wbuff ,commitVec,lci}

21: function UNIFORM_BARRIER(V) 22: wait until uniformVec[d] ≥ V[d]

23: function ATTACH(V) 24: wait until ∀i ∈ D \{d}. uniformVec[i] ≥ V[i]

19 m Algorithm A4 Transaction replication at pd 1: function PROPAGATE_LOCAL_TXS() . Run periodically 2: if preparedCausal = ∅ then 3: knownVec[d] ← clock 4: else 5: knownVec[d] ← min{ts | h_,_,tsi ∈ preparedCausal} − 1 6: var txs ← {h_,_,commitVec,_i ∈ committedCausal[d] | commitVec[d] ≤ knownVec[d]} 7: if txs 6= ∅ then m 8: send REPLICATE(d,txs) to pi , i ∈ D \{d} 9: committedCausal[d] ← committedCausal[d] \ txs 10: else m 11: send HEARTBEAT(d,knownVec[d]) to pi , i ∈ D \{d}

12: when received REPLICATE(i,txs) 13: for all htid,wbuff ,commitVec,lci ∈ txs in commitVec[i] order do 14: if commitVec[i] > knownVec[i] then 15: for all hk,vi ∈ wbuff do 16: opLog[k] ← opLog[k] · hv,commitVec,lci 17: committedCausal[i] ← committedCausal[i] ∪ {htid,wbuff ,commitVec,lci} 18: knownVec[i] ← commitVec[i]

19: when received HEARTBEAT(i,ts) 20: pre: ts > knownVec[i] 21: knownVec[i] ← ts

22: function FORWARD_REMOTE_TXS(i, j) . forward transactions received from data center j 6= d to data center i ∈/ {d, j} 23: var txs ← {htid,_,commitVec,_i ∈ committedCausal[ j] | commitVec[ j] > globalMatrix[i][ j]} 24: if txs 6= ∅ then m 25: send REPLICATE( j,txs) to pi 26: else m 27: send HEARTBEAT( j,knownVec[ j]) to pi

20 m Algorithm A5 Updating metadata at pd 1: function BROADCAST_VECS() . Run periodically l 2: send KNOWNVEC_LOCAL(m,knownVec) to pd, l ∈ P m 3: send STABLEVEC(d,stableVec) to pi , i ∈ D m 4: send KNOWNVEC_GLOBAL(d,knownVec) to pi , i ∈ D

5: when received KNOWNVEC_LOCAL(l,knownVec) 6: localMatrix[l] ← knownVec 7: for i ∈ D do 8: stableVec[i] ← min{localMatrix[n][i] | n ∈ P } 9: stableVec[strong] ← min{localMatrix[n][strong] | n ∈ P }

10: when received STABLEVEC(i,stableVec) 11: stableMatrix[i] ← stableVec m 12: G ← all groups with f + 1 replicas that include pd 13: for j ∈ D do 14: var ts ← max{min{stableMatrix[h][ j] | h ∈ g} | g ∈ G} 15: uniformVec[ j] ← max{uniformVec[ j],ts}

16: when received KNOWNVEC_GLOBAL(l,knownVec) 17: globalMatrix[l] ← knownVec

m Algorithm A6 Committing strong transactions at pd 1: function COMMIT_STRONG(tid,lc) 2:UNIFORM _BARRIER(snapVec[tid]) 3: hdec,vc,lci ← CERTIFY(NORMAL,tid,wbuff[tid],rset[tid],snapVec[tid],lc) 4: return hdec,vc,lci

5: upon DELIVER_UPDATES(txs) 6: for h_,wbuff ,commitVec,lci ∈ txs in commitVec[strong] order do 7: for hk,vi ∈ wbuff do 8: opLog[k] ← opLog[k] · hv,commitVec,lci 9: knownVec[strong] ← commitVec[strong]

10: function HEARTBEAT_STRONG() . Run periodically 11: return CERTIFY(NORMAL,⊥,∅,∅,~0,⊥)

21 m Algorithm A7 Certification service at coordinator pd 1: function CERTIFY(callerMode,tid,wbuff ,rset,snapVec,lc) 2: var rid ← generate_req_id() 3: var L ← {l | wbuff [l] 6= ∅} ∪ {partition(k) | k ∈ rset} 4: repeat 5: send PREPARE_STRONG(rid,callerMode,tid,wbuff ,rset,snapVec,lc) to Ωl, l ∈ L 6: async wait receive ALREADY_DECIDED(tid,decision,commitVec,lc) ∨ receive UNKNOWN_TX_ACK(l,rid,tid) from a quorum from some l ∈ L ∨ receive ACCEPT_ACK(l,bl,tid,votel,tsl,lcl) from a quorum for all l ∈ L 7: until not timeout 8: if received a quorum of UNKNOWN_TX_ACK(l,rid,tid) then 9: return UNKNOWN 10: else if received ALREADY_DECIDED(tid,decision,commitVec,lc) then 11: send DECISION(ballot,tid,decision,commitVec,lc) to Ωp 12: return hdecision,commitVec,lci 13: else 14: commitVec ← snapVec 15: commitVec[strong] ← max{tsl | l ∈ L} 16: if ∃l ∈ L. votel = ABORT then decision ← ABORT else decision ← COMMIT 17: lc ← max{lcl | l ∈ L} 18: send DECISION(bl,tid,decision,commitVec,lc) to Ωl, l ∈ L 19: return hdecision,commitVec,lci

m Algorithm A8 Strong transaction certification at pd 1: function CERTIFICATION_CHECK(W,rset,snapVec,lc) 2: for all h_,wbuff 0,rset0,_, COMMIT,_,_i ∈ preparedStrong do 3: if (∃hk,_i ∈ wbuff 0[m]. k ∈ rset) ∨ (∃k ∈ rset0. hk,_i ∈ W) then 4: return hABORT,⊥i 5: for all h_,wbuff 0, COMMIT,commitVec,lc0i ∈ decidedStrong do 6: if (∃hk,_i ∈ wbuff 0[m]. k ∈ rset) ∧ ¬(commitVec ≤ snapVec) then 7: return hABORT,⊥i 8: if lc ≤ lc0 then 9: lc ← lc0 + 1 10: return hCOMMIT,lci

22 m Algorithm A9 Atomic transaction commit protocol at pd 1: when received PREPARE_STRONG(rid,senderMode,tid,wbuff ,rset,snapVec,lc) from p 2: pre: status ∈ {LEADER, RESTORING} 3: if ∃htid,_,decision,commitVec,lci ∈ decidedStrong then 4: send ALREADY_DECIDED(tid,decision,commitVec,lc) to p 5: else if ∃htid,_,_,snapVec,vote,ts,lci ∈ preparedStrong then 6: send ACCEPT(ballot,tid,wbuff ,rset,snapVec,vote,ts,p,lc) to REPLICAS(m) 7: else if senderMode = RESTORING then 8: send UNKNOWN_TX(ballot,rid,tid,p) to REPLICAS(m) 9: else if status = LEADER then 10: wait until clock > snapVec[strong] 11: ts ← clock 12: hvote,lci ← CERTIFICATION_CHECK(wbuff [m],rset,snapVec,lc) 13: send ACCEPT(ballot,tid,wbuff ,rset,snapVec,vote,ts,p,lc) to REPLICAS(m)

14: when received ACCEPT(b,tid,wbuff ,rset,snapVec,vote,ts,p,lc) 15: pre: status ∈ {LEADER, FOLLOWER, RESTORING} ∧ ballot = b 16: preparedStrong ← preparedStrong ∪ {htid,wbuff ,rset,snapVec,vote,ts,lci} 17: send ACCEPT_ACK(m,b,tid,vote,ts,lc) to p

18: when received DECISION(b,tid,decision,commitVec,lc) 19: pre: status ∈ {LEADER, RESTORING} ∧ ballot = b 20: wait until clock ≥ commitVec[strong] 21: send LEARN_DECISION(b,tid,decision,commitVec,lc) to REPLICAS(m)

22: when received LEARN_DECISION(b,tid,decision,commitVec,lc) 23: pre: status ∈ {LEADER, FOLLOWER, RESTORING} ∧ ballot = b ∧ ∃htid,wbuff ,_,_,_,_,_i ∈ preparedStrong 24: preparedStrong ← preparedStrong \ {htid,_,_,_,_,_,_i} 25: decidedStrong ← decidedStrong ∪ {htid,wbuff ,decision,commitVec,lci}

26: upon ∃h_,_, COMMIT,commitVec,_i ∈ decidedStrong. commitVec[strong] > lastDelivered ∧ (¬∃h_,_,_,_, COMMIT,ts,_i ∈ preparedStrong. lastDelivered < ts ≤ commitVec[strong]) 0 0 ∧ (¬∃h_,_, COMMIT,commitVec ,_i ∈ decidedStrong. lastDelivered < commitVec [strong] < commitVec[strong]) 27: pre: status = LEADER 28: send DELIVER(ballot,commitVec[strong]) to REPLICAS(m)

29: when received DELIVER(b,ts) 30: pre: status ∈ {LEADER, FOLLOWER} ∧ ballot = b ∧ lastDelivered < ts 31: lastDelivered ← ts 32: var W ← {htid,wbuff [m],commitVec,lci | ∃htid,wbuff , COMMIT,commitVec,lci ∈ decidedStrong. commitVec[strong] = ts} m 33: upcall DELIVER_UPDATES(W) to pd

34: when received UNKNOWN_TX(b,rid,tid,p) 35: pre: status ∈ {LEADER, FOLLOWER, RESTORING} ∧ ballot = b 36: send UNKNOWN_TX_ACK(m,rid,tid) to p

37: function RETRY(tid) . Run periodically 38: pre: CERTIFY(_,tid,_,_,_,_) is not executing ∧ status = LEADER ∧ ∃htid,wbuff ,rs,snapVec,_,_,lci ∈ preparedStrong 39: CERTIFY(NORMAL,tid,wbuff ,rs,snapVec,lc)

23 m Algorithm A10 Atomic transaction commit protocol at pd : recovery

1: upon Ωm 6= trusted 2: trusted ← Ωm m 3: if trusted = pd then RECOVER() 4: else send NACK(ballot) to trusted

5: when received NACK(b) m 6: pre: trusted = pd ∧ b > ballot 7: ballot ← b 8:RECOVER ()

9: function RECOVER() m 10: send NEW_LEADER(any ballot b such that b > ballot ∧ leader(b) = pd ) to REPLICAS(m)

11: when received NEW_LEADER(b) from p 12: if trusted = p ∧ ballot < b then 13: status ← RECOVERING 14: ballot ← b 15: doNotWaitFor ← ∅ 16: send NEW_LEADER_ACK(ballot,cballot,preparedStrong,decidedStrong) to p 17: else send NACK(ballot) to p

18: when received {NEW_LEADER_ACK(b,cballot j,preparedStrong j,decidedStrong j) | p j ∈ Q} from a quorum Q 19: pre: status = RECOVERING ∧ ballot = b

20: var J ← the set of j with maximal cballot j S 21: decidedStrong ← decidedStrong j j∈J S 22: preparedStrong ← {htid,_,_,_,_,_,_i ∈ preparedStrong j | htid,_,_,_,_i ∈/ decidedStrong} j∈J 23: var maxPrep ← max{ts | h_,_,_,_,_,ts,_i ∈ preparedStrong} 24: var maxDec ← max{commitVec[strong] | h_,_,_,commitVec,_i ∈ decidedStrong} 25: wait until clock ≥ max{maxPrep,maxDec} 26: cballot ← b m 27: send NEW_STATE(ballot,preparedStrong,decidedStrong) to REPLICAS(m) \{pd }

28: when received NEW_STATE(b,preparedStrong,decidedStrong) from p 29: pre: status = RECOVERING ∧ b ≥ ballot 30: cballot ← b 31: preparedStrong ← preparedStrong 32: decidedStrong ← decidedStrong 33: status ← FOLLOWER 34: send NEW_STATE_ACK(b) to p

m 35: when received NEW_STATE_ACK(b) from a set of processes that together with pd form a quorum 36: pre: status = RECOVERING ∧ ballot = b 37: status ← RESTORING 38: for all t = htid,wbuff ,rs,snapVec,_,_,lci ∈ preparedStrong do 39: if CERTIFY(RESTORING,tid,wbuff ,rset,snapVec,lc) = UNKNOWN then 40: doNotWaitFor ← doNotWaitFor ∪ {t}

41: upon preparedStrong ⊆ doNotWaitFor ∧ status = RESTORING 42: status ← LEADER 43: doNotWaitFor ← ∅

24 B Consistency Model Specification • Ccausal: The set of COMMIT_CAUSAL_TX events. That is, B.1 Relations For a binary relation R ⊆ A × A and an element a ∈ A, we Ccausal = {e ∈ E | ∃tid ∈ TID. define R −1(a) = {b | (b,a) ∈ R }. For a non-empty set A and op(e) = COMMIT_CAUSAL_TX(tid)}. a total order R ⊆ A × A, we let max(A) be the maximum R • Cstrong: The set of COMMIT_STRONG_TX events with element in A according to . Formally, R decision dec = COMMIT. That is, max(A) = a ⇔ A 6= ∅ ∧ ∀b ∈ A. a = b ∨ (b,a) ∈ R . R Cstrong = {e ∈ E | ∃tid ∈ TID. op(e) = COMMIT_STRONG_TX(tid, COMMIT)}. If A is empty, then max(A) is undefined. We implicitly assume R that max(A) is defined whenever it is used. • C , Ccausal ]Cstrong: The set of all commit events. R • Q: The set of CL_UNIFORM_BARRIER events. That is, We call a binary relation a (strict) partial order if it is irreflexive and transitive. We call it a total order if it addition- Q = {e ∈ E | op(e) = CL_UNIFORM_BARRIER}. ally relates every two distinct elements one way or another. • A: The set of CL_ATTACH events. That is, B.2 Operations and Events Transactions in UNISTORE can start, read and write keys, A = {e ∈ E | ∃ j ∈ D. op(e) = CL_ATTACH( j)}. and commit. We assume that each transaction is associated with a unique transaction identifier tid from a set TID (corre- • Rk: The set of read events on key k. That is, sponding to line A2:8). Besides, clients can issue on-demand R = {e ∈ E | ∃tid ∈ TID,v ∈ Val. barriers and migrate between data centers. k Let Key and Val be the set of keys and values, respectively. op(e) = READ(tid,k,v)}. We define O as the set of all possible operations • Uk: The set of update events on key k. That is, O = {START(tid) | tid ∈ TID} ∪ Uk = {e ∈ E | ∃tid ∈ TID,v ∈ Val. {COMMIT_CAUSAL_TX(tid) | tid ∈ TID} ∪ op(e) = UPDATE(tid,k,v)}. {COMMIT_STRONG_TX(tid,dec) | tid ∈ TID,dec ∈ {COMMIT, ABORT}} ∪ For different types of events, we define {CL_UNIFORM_BARRIER} ∪ • key(e): The key that the read or update event e ∈ R ∪U accesses. {CL_ATTACH( j) | j ∈ D} ∪ • rval(e): The return value of the read event e ∈ R. {READ(tid,k,v), UPDATE(tid,k,v) | • uval(e): The value written by the update event e ∈ U. tid ∈ TID,k ∈ Key,v ∈ Val}. B.3 Transactions We denote each invocation of such an operation by an event Definition 1 (Transactions). A transaction t is a triple from a set E, usually ranged over by e. A function op : E → O (tid,Y,po), where determines the operation a given event denotes. Formally, we • tid ∈ TID is a unique transaction identifier; use the following notation to denote different types of events. • Y ⊆ E \ (Q ∪ A) is a finite, non-empty set of events; • E: The set of all events. • po ⊆ Y ×Y is the program order, which is total. • S: The set of START events. That is, We only consider well-formed transactions: according to the S = {e ∈ E | ∃tid ∈ TID. op(e) = START(tid)}. po order, t starts with a START event, then performs some number of READ/UPDATE events, and ends with a commit • R: The set of READ (read) events. That is, event (COMMIT_CAUSAL_TX or COMMIT_STRONG_TX).

R = {e ∈ E | ∃tid ∈ TID,k ∈ Key,v ∈ Val. In the following, we denote components of t as in t.tid. For simplicity, we assume a dedicated initial transaction t which op(e) = READ(tid,k,v)}. 0 installs initial values to all possible keys before the system • U: The set of UPDATE (update) events. That is, launches. We use the following notations to denote different types of U = {e ∈ E | ∃tid ∈ TID,k ∈ Key,v ∈ Val. transactions. op(e) = UPDATE(tid,k,v)}. • T: The set of all committed transactions.

25 • Tk: The set of committed transactions that update key B.4 Abstract Executions k C . We also use k to denote the set of commit events of Clients interacts with UNISTORE by issuing transactions and T transactions in k. CL_UNIFORM_BARRIER and ATTACH events. We use histo- • Tcausal: The set of transactions that end with the COM- ries to record such interactions in a single computation. Note MIT_CAUSAL_TX events. We call them causal transac- that histories only record committed transactions. tions. Causal transactions will always be committed. • Tall-strong: The set of transactions that end with the COM- Definition 2 (Histories). A history is a tuple MIT_STRONG_TX events. We call them strong transac- tions. Strong transactions can be committed or aborted. H = (X,client,dc,so) • Tstrong: The set of committed strong transactions. such that We have T = Tcausal ] Tstrong. For each transaction t ∈ T, we define • X ⊆ T ∪ Q ∪ A is a set of committed transactions and • tid(t) ∈ TID: The transaction identifier t.tid of t. CL_UNIFORM_BARRIER and ATTACH events; • events(t) ⊆ S ∪ R ∪U ∪C: The set t.Y of events in t. • client : X → C is a function that returns • ws(t) ⊆ Key×Val: The write set of t. It is the set of keys – the client client(t) which issues the transaction t ∈ with their values that t updates, which contains at most (X ∩ T), one value per key. Formally, – the client client(q) which issues the CL_UNI- FORM_BARRIER event q ∈ (X ∩ Q), or ws(t) , {(key(e),uval(e)) | e ∈ t.Y ∩U}. – the client client(a) which issues the ATTACH event a ∈ (X ∩ A); • rs(t) ⊆ Key: The read set of t. It is the set of keys that t • dc : X → D is a function that returns the original data reads. Formally, center dc(t) of transaction t ∈ (X ∩T), dc(q) of CL_UNI- FORM_BARRIER event q ∈ (X ∩Q), or dc(a) of ATTACH rs(t) , {key(e) | e ∈ t.Y ∩ R}. event a ∈ (X ∩ A); • so ⊆ X ×X is the session order on X. Consider s1,s2 ∈ X. • st(t) ∈ S: The START event of t. Formally, it is the unique We say that s1 precedes s2 in the session order, denoted so event in the set t.Y ∩ S. s1 −→ s2, if they are executed by the same client and s1 • ct(t) ∈ C: The commit event of t. Formally, it is the is executed before s2. unique event in the set t.Y ∩C. • ud(t,k) ∈ U : The last update event on key k, if any, in In the following, we denote components of H as in H.X and k S transaction t. Formally, often shorten H.X by X when it is clear. Let VH , (H.X ∩ T).Y be the set of transactional events in history H. A consistency model is specified by a set of histories. To ud(t,k) , max(events(t) ∩Uk). po define this set, we extend histories with two relations, declar- atively describing how the system processes transactions and Besides, we define CL_UNIFORM_BARRIER events.

W(t) , {k ∈ Key | hk,_i ∈ ws(t)}, (1) Definition 3 (Abstract Executions). An abstract execution is a triple R(t) , rs(t) ∪W(t). (2) A = ((X,client,dc,so),vis,ar) For a read event e on key k in transaction t, if there exist such that update events on k preceding e in t, then e is called an internal (X, , , ) read event. Otherwise, e is called an external read event. We • client dc so is a history; ⊆ X × X denote the sets of internal reads and external reads by R • Visibility vis is a partial order; INT • Arbitration ar ⊆ X × X is a total order. and REXT, respectively. That is, R = RINT ] REXT. We also distinguish commit events for read-only transac- For H = (X,client,dc,so), we often shorten tions from those for update transactions, and denote their sets ((X,client,dc,so),vis,ar) by (H,vis,ar). by CRO and CRW, respectively. That is, C = CRO ]CRW. For notational convenience, for an event e ∈ E \ (Q ∪ A), B.5 Partial Order-Restrictions Consistency we also define tx(e) to be the transaction containing e and We aim to show that UNISTORE implements a transactional variant of Partial Order-Restrictions consistency (POR consis- st(e) st(tx(e)), , tency) [40,41] for LWW registers. A history H of UNISTORE ct(e) , ct(tx(e)). satisfies POR, denoted H |= POR, if it can be extended to an

26 abstract execution that satisfies several axioms, defined in the Definition 7 (CONFLICTORDERING). following: CONFLICTORDERING , ∀t1,t2 ∈ X ∩ Tstrong. H |= POR ⇔ ∃vis,ar. (H,vis,ar) |= vis vis t1 ./ t2 ⇒ t1 −→ t2 ∨ t2 −→ t1. RVAL ∧ CAUSALCONSISTENCY ∧ The Eventual Visibility property requires that a transaction that originates at a correct data center, that is visible to some CONFLICTORDERING ∧ CL_UNIFORM_BARRIER events, or that is a strong transaction EVENTUALVISIBILITY. eventually becomes visible at all correct data centers. Let ⊆ be the set of correct data centers. Formally, UNISTORE satisfies POR, denoted UNISTORE |= POR, if all C D its histories do. Definition 8 (EVENTUALVISIBILITY). Given an abstract execution A = (H,vis,ar), the axioms are defined as follows. EVENTUALVISIBILITY , ∀t ∈ X ∩ T. vis Definition 4 (RVAL,[16]). The Return Value Consistency dc(t) ∈ C ∨ (∃q ∈ Q. t −→ q) ∨ t ∈ Tstrong n o (RVAL) specifies the return value of each read event. 0 vis 0 ⇒ t ∈ T | ¬(t −→ t ) < ∞. RVAL , INTRVAL ∧ EXTRVAL. C Transaction Certification Service Specifica- Here INTRVAL requires an internal read event e on key k to tion read from the last update event on k preceding e in the same transaction. Formally, C.1 Interface The Transaction Certification Service (TCS) [18] is respon- INTRVAL , ∀e ∈ RINT ∩ Rk ∩VH . −1  sible for certifying strong transactions issued by transaction rval(e) = uval max(po (e) ∩Uk) . po coordinators, computing commit vectors and Lamport clocks for committed transactions, and (asynchronously) delivering EXTRVAL requires an external read event e on key k to read committed transactions to replicas. from the last update event on k in the last transaction preced- Each strong transaction t ∈ Tall-strong submitted to TCS may ing tx(e) in ar, among the set of transactions visible to tx(e). be associated with its read set rs(t), write set ws(t), snap- Formally, shot vector snapshotVec(t) (Definition 10), commit vector commitVec(t) (Definition 11), and Lamport clock lclock(t) XT AL ∀e ∈ R ∩ R ∩V . E RV , EXT k H (Definition 52). From rs(t) and ws(t) we can then define W(t)  −1   rval(e) = uval ud max vis (tx(e)) ∩ Tk ,k . and R(t) according to (1) and (2), respectively. Note that we ar have W(t) ⊆ R(t). Definition 5 (CAUSALCONSISTENCY,[13]). Transaction coordinators for strong transactions interact with TCS using two types of actions. Coordinators can make CAUSALCONSISTENCY , CAUSALVISIBILITY ∧ certification requests (corresponding to procedure CERTIFY CAUSALARBITRATION, of Algorithm A7) of the form where certify(tid(t),ws(t),R(t),snapshotVec(t),C(t)),

+ where t ∈ T and C(t) ∈ denotes the contribution of CAUSALVISIBILITY , (so ∪ vis) ⊆ vis; all-strong N client(t) to the Lamport clock of t. The TCS responses are of CAUSALARBITRATION , vis ⊆ ar. the form decide(tid(t),dec,vc,lc), The Conflict Ordering property requires that out of any two conflicting strong transactions, one must be visible to the containing a decision dec from D = {COMMIT, ABORT} for other. Formally, t, a commit vector vc from V for t if dec = COMMIT, and a Lamport clock lc from N for t if dec = COMMIT. If dec = Definition 6 (Conflict Relation). The conflict relation, de- ABORT, then vc and lc are irrelevant. noted by ./, between strong transactions is a symmetric rela- Besides, TCS can deliver some payload W to a replica tion defined as follows: via upcall actions deliver(W) (corresponding to procedure DELIVER_UPDATES of Algorithm A6). We denote by 0 0 ∀t,t ∈ Tstrong. t ./ t ⇔ m m deliverd (W) the delivery of the payload W to a replica pd , 0 0 m (R(t) ∩W(t ) 6= ∅) ∨ (W(t) ∩ R(t ) 6= ∅). when the latter is relevant. The payload W in deliverd (W)

27 is a set of tuples of the form htid,wbuff ,commitVec,lci, each (R4) Every committed strong transaction is delivered at most of which corresponds to the updates wbuff ⊆ Key × Val per- once to each replica. formed at a particular partition m by a particular committed (R5) For each action deliver(W) ∈ act(h) and each strong transaction with transaction identifier tid, commit vec- htid,_,_,_i ∈ W, there is a certify(tid,_,_,_,_) action tor commitVec, and Lamport clock lc. such that

C.2 Certification Functions certify(tid,_,_,_,_) ≺h deliver(W).

TCS is specified using a certification function m (R6) At each replica pd , committed strong transactions are de- Tstrong livered in the increasing order of their strong timestamps. F : 2 × Tall-strong → D × V × N. (3) m Formally, for any two distinct actions deliverd (W1) and deliverm(W ) with payloads W and W , respectively, For a strong transaction t ∈ Tall-strong and the set Tc ⊆ Tstrong of d 2 1 2 previously committed strong transactions, F(Tc,t) returns not m m deliver (W1) ≺h deliver (W2) ⇒ only the decision dec ∈ D, but also the commit vector vc ∈ V d d ∀h_,_,commitVec ,_i ∈ W . and Lamport clock lc ∈ N for t. We use Fdec(Tc,t), Fvec(Tc,t), 1 1 and Flc(Tc,t) to select the first, second, and third component ∀h_,_,commitVec2,_i ∈ W2. of F(T ,t), respectively. c commitVec1[strong] < commitVec2[strong]. The decision Fdec(Tc,t) should satisfy A history is complete if every certify action in it has a 0 Fdec(Tc,t) = COMMIT ⇔ ∀k ∈ R(t). ∀t ∈ Tc. matching decide action. A complete history h is sequential if k ∈ W(t0) ⇒ commitVec(t0) ≤ snapshotVec(t). (4) it consists of consecutive pairs of certify and matching decide actions. For a complete history h, a permutation h0 of h is a The commit vector Fvec(Tc,t) should satisfy sequential history such that • h and h0 contain the same actions, i.e., act(h) = act(h0). (∀i ∈ D. Fvec(Tc,t)[i] = snapshotVec(t)[i]) • Transactions are certified in h0 according to their session ∧ Fvec(Tc,t)[strong] > snapshotVec(t)[strong] order. 0 0 0 ∧ ∀t ∈ Tc. t ./ t ⇒ Fvec(Tc,t) ≥ commitVec(t ). (5) 0 so 0 ∀t,t ∈ Tall-strong. t −→ t ⇒ 0 The Lamport clock Flc(Tc,t) should satisfy decide(tid(t),_,_,_) ≺h0 certify(tid(t ),_,_,_,_).

Flc(Tc,t) ≥ C(t) ∧ C.4 TCS Correctness: Safety and Liveness 0 0 0  ∀t ∈ Tc. t ./ t ⇒ Flc(Tc,t) > lclock(t ) . (6) C.4.1 Safety of TCS A complete sequential history h is legal with respect to a C.3 Histories of TCS certification function F, if its results are computed so as to TCS executions are represented by histories, which are (pos- satisfy (4)–(6) according to F: sibly infinite) sequences of certify, decide, and deliver ac- tions. For a TCS history h, we use act(h) to denote the ∀act = decide(tid(t),dec,vc,lc) ∈ act(h). 0 set of actions in h. For actions act,act ∈ act(h), we write (dec,vc,lc) = F({t0 | 0 0 act ≺h act when act occurs before act in h. A strong trans- 0 decide(tid(t ), COMMIT,_,_) ≺h act},t). action t ∈ Tall-strong commits in a history h if h contains decide(tid(t), COMMIT,_,_). We denote by committed(h) A history h is correct with respect to F if h | committed(h) the projection of h to actions corresponding to the strong has a legal permutation. A TCS implementation is correct transactions that are committed in h. with respect to F if so are all its histories. Each history h needs to meet the following requirements. (R1) For each strong transaction t ∈ Tall-strong, there is at most C.4.2 Liveness of TCS one certify(tid(t),_,_,_,_) action in h. (R2) For each action decide(tid,_,_,_) ∈ act(h), there is ex- TCS guarantees that every committed strong transaction will actly one certify(tid,_,_,_,_) action in h such that eventually be delivered by every correct data center. Formally,

certify(tid,_,_,_,_) ≺h decide(tid,_,_,_). ∀act = decide(tid, COMMIT,_,_) ∈ act(h). ∀m ∈ partitions(tid). ∀c ∈ C. (R3) For each action deliver(W) ∈ act(h) and each 0 m ∃act = deliverc (W) ∈ act(h). htid, , , i ∈ W no decide(tid, ABORT, , ) _ _ _ , there is _ _ 0 action in h. htid,_,_,_i ∈ W ∧ act ≺h act . (7)

28 Here partitions(tid) denotes the set of partitions that a partic- D.2 Notations ular transaction with transaction identifier tid accesses. We use cl to range over clients from a finite set C. We also use A TCS implementation meets the liveness requirement if the following notations to refer to different types of variables every history produced by its maximal execution satisfies (7). and their values (below are some typical examples). m m • snapVecd : The variable snapVec at replica pd . C.5 TCS Correctness m m • (snapVecd )e: The value of variable snapVecd after the m The proof of TCS correctness is an adjustment of the ones event e is performed at replica pd . m m in [18, 29]. • snapVecd (τ): The value of snapVecd at some specific time τ. Theorem 9. The TCS implementation in UNISTORE (Algo- • pastVeccl: The variable pastVec at client cl. rithms A7– A10) is correct with respect to the certification • (pastVeccl)e: The value of variable pastVeccl after the function F in (3) and meets the liveness requirement in (7). event e is performed at client cl.

• snapVec(GET_VERSION,e): The actual value of parameter snapVec of handler GET_VERSION for event e. D The Proof of UNISTORE Correctness • commitVec(COMMIT_CAUSAL,e): The value of the local vari- able commitVec in procedure COMMIT_CAUSAL after Consider an execution of UNISTORE with a history H = event e is performed. (X,client,dc,so). We prove that H satisfies POR by construct- Besides, we use coord(t) to denote the coordinator partition ing an abstract execution A (Theorem 84). We also establish of transaction t. the liveness guarantees of UNISTORE (Theorem 86). Each transaction is associated with a snapshot vector and a commit vector. D.1 Assumptions Definition 10 (Snapshot Vector). Let t ∈ T be a transaction. We take the following assumptions about UNISTORE. Let d , dc(t) and m , coord(t). We define its snapshot vector m snapshotVec(t) as ASSUMPTION 1. For any replica pd in data center d, clock m at pd is strictly increasing until d (may) crash. m snapshotVec(t) , (snapVecd )st(t)[t]. ASSUMPTION 2. Replicas are connected by reliable FIFO Definition 11 (Commit Vector). Let t ∈ T be a transaction. channels: messages are delivered in FIFO order, and messages Let d , dc(t) and m , coord(t). We define its commit vector between correct data centers are guaranteed to be eventually commitVec(t) as follows. delivered. • If t is a read-only causal transaction, then

m ASSUMPTION 3. We assume that in an execution of UNI- commitVec(t) , (snapVecd )ct(t)[t]. STORE, any clients and up to f data centers may crash and that D > 2 f . • If t is an update causal transaction, then

commitVec(t) commitVec(COMMIT_CAUSAL,ct(t)). ASSUMPTION 4. We assume fairness of procedures of UNI- , STORE: In an execution, if a procedure is enabled infinitely • If t is a committed strong transaction, then often, then it will be executed infinitely often. commitVec(t) , vc(COMMIT_STRONG,ct(t)). ASSUMPTION 5. We consider only well-formed executions, Lemma 12. in which for each client: • transactions are issued in sequence; and ∀t ∈ T. commitVec(t) ≥ snapshotVec(t). • both CL_UNIFORM_BARRIER and CL_ATTACH events can be issued only outside of transactions. Proof. We perform a case analysis according to the type of t. •C ASE I: t is a read-only causal transaction. By ASSUMPTION 6. We consider only executions where every Definition 10 of snapshotVec(t), Definition 11 of causal commit event (i.e., COMMIT_CAUSAL_TX) completes commitVec(t), and Assumption5, and every strong commit event (i.e., COMMIT_STRONG_TX) commitVec(t) = snapshotVec(t). that calls the TCS completes. •C ASE II: t is an update causal transaction. By We make the last assumption to simplify the technical lines A2:26 and A2:29, development. The other assumptions come from the system model. commitVec(t) ≥ snapshotVec(t).

29 m •C ASE III: t is a strong transaction. By line A6:3 and (5), •C ASE II: knownVecd (τ1)[d] is set at line A4:3 and m knownVecd (τ2)[d] is set at line A4:5. By line A4:3, commitVec(t) ≥ snapshotVec(t). m m knownVecd (τ1)[d] = clockd (τ1).

m By the fact that preparedCausald (τ1) = ∅, τ2 > τ1, and For client cl, we use cur_dc(cl) to denote the data center to line A3:11, which cl is currently attached. We also use T|cl to denote the m set of transactions issued by cl. Formally, ∀h_,_,tsi ∈ preparedCausald (τ2). m m ts > clockd (τ1) = knownVecd (τ1)[d]. T|cl , {t ∈ T | client(t) = cl}. Therefore, by line A4:5, For a transaction t and a partition m, we use ws(t)[m] to m denote the subset of ws(t) restricted to partition m. Formally, knownVecd (τ1)[d] m ≤ min{ts | h_,_,tsi ∈ preparedCausald (τ2)} − 1 ws(t)[m] , {hk,vi ∈ ws(t) | partition(k) = m}. m = knownVecd (τ2)[d].

For notational convenience, we also define m •C ASE III: knownVecd (τ1)[d] is set at line A4:5 and m log(t) {hk,v,commitVec(t),lclock(t)i | hk,vi ∈ ws(t)}, knownVecd (τ2)[d] is set at line A4:3. Let t1 be the trans- , m action in preparedCausald (τ1) that has the minimum ts. and Formally,

m log(t)[m] {hk,v,commitVec(t),lclock(t)i | hk,vi ∈ ws(t)[m]}. t1 , argmin{ts | htid(t),_,tsi ∈ preparedCausald (τ1)}. , t

For a key k ∈ Key and a transaction t ∈ Tk, let log(t)[k] be the By lines A4:5, A2:29, A3:15, and A4:3, unique tuple m knownVecd (τ1)[d] < commitVec(t1)[d] hk,v,commitVec(t),lclock(t)i m ≤ clockd (τ2) m in log(t). = knownVecd (τ2)[d]. m D.3 Metadata for Causal Transactions •C ASE IV: Both knownVecd (τ1)[d] and m knownVec (τ2)[d] are set at line A4:5. By lines A4:5 A causal transaction is committed when COMMIT_CAUSAL d m and A3:11, for it returns. A causal transaction is committed at replica pd when COMMIT for it at pm returns. m d knownVecd (τ1)[d] m = min{ts | h_,_,tsi ∈ preparedCausald (τ1)} − 1 D.3.1 Properties of knownVec m ≤ min{ts | h_,_,tsi ∈ preparedCausald (τ2)} − 1 m m Lemma 13. For any replica pd in data center d, = knownVecd (τ2)[d]. m knownVecd [d] is non-decreasing.

Proof. Consider two points of time τ1 and τ2 such that τ1 < τ2. m m We need to show that Lemma 14. For i ∈ D \{d}, knownVecd [i] at any replica pd in data center d is non-decreasing. m m knownVecd (τ1)[d] ≤ knownVecd (τ2)[d]. m Proof. Note that knownVecd [i] (i ∈ D \{d}) can be updated m only at lines A4:18 and A4:21. Therefore, this lemma holds Note that knownVecd [d] is updated only at lines A4:3 or A4:5. We distinguish between the following four cases. due to lines A4:14 and A4:20. m m •C ASE I: Both knownVecd (τ1)[d] and Lemma 15. For i ∈ , knownVec [i] at any replica pm in m D d d knownVecd (τ2)[d] are set at line A4:3. By line A4:3 data center d is non-decreasing. and Assumption1, Proof. By Lemmas 13 and 14. knownVecm( )[d] = clockm( ) d τ1 d τ1 m m Lemma 16. For any replica pd in data center d, < clockd (τ2) m m m = knownVecd (τ2)[d]. knownVecd [d] ≤ clockd .

30 m Proof. Note that knownVecd [d] is updated only at lines A4:3 or A4:5. m Lemma 18. Let t ∈ Tcausal be a causal transaction that origi- •C ASE I: knownVecd [d] is updated at line A4:3. By As- sumption1, nates at data center d and accesses partition m. If

m m m knownVecd [d] ≤ clockd . commitVec(t)[d] ≤ knownVecd [d], m •C ASE II: knownVecd [d] is updated at line A4:5. By then line A3:11, immediately after this update, m log(t)[m] ⊆ opLogd . m m knownVecd [d] < clockd . m Proof. Suppose that the value knownVecd [d] is set at time τ. m By Lemma 17, t is committed at pd before time τ. Therefore, by line A3:19, m m Lemma 17. Let pd be a replica in data center d. Consider log(t)[m] ⊆ opLog . m d knownVecd (τ)[d] at time τ and transaction t ∈ Tcausal com- m mitted at pd after time τ. Then m commitVec(t)[d] > knownVecd (τ)[d]. The following lemmas consider the replication and for- m warding of causal transactions. Proof. Suppose that before time τ, knownVecd [d] is last up- dated at time τ0 < τ. Therefore, m Lemma 19. Let pd be a replica in data center d. Let t1 and m m m 0 t2 be two transactions replicated by pd to sibling replicas at knownVecd (τ)[d] = knownVecd (τ )[d]. time τ1 and τ2 (line A4:8), respectively. Then We distinguish between two cases according to whether τ1 < τ2 ⇒ commitVec(t1)[d] < commitVec(t2)[d]. m 0 preparedCausald (τ ) = ∅ m 0 Proof. Since t1 is replicated at time τ1, by line A4:6, when knownVecd [d] is updated at time τ . m •C ASE I: preparedCausal (τ0) = . By line A4:3, m d ∅ commitVec(t1)[d] ≤ knownVecd (τ1)[d]. m 0 m 0 knownVecd (τ )[d] = clockd (τ ). Assume that τ1 < τ2. We distinguish between two cases ac- By line A3:11, line A2:29, and Assumption1, cording to whether

m 0 m commitVec(t)[d] > clockd (τ ). htid(t2),_,_,_i ∈ committedCausald (τ1)[d].

Therefore, m •C ASE I: htid(t2),_,_,_i ∈ committedCausald (τ1)[d]. m 0 Since t2 is not replicated at time τ1, by line A4:6, commitVec(t)[d] > knownVecd (τ )[d] m = knownVecd (τ)[d]. m commitVec(t2)[d] > knownVecd (τ1)[d]. m 0 •C ASE II: preparedCausald (τ ) 6= ∅. We further distin- m •C ASE II: htid(t2),_,_,_,i ∈/ committedCausald (τ1)[d]. guish between two cases according to whether m Thus, t2 is committed at pd after time τ1. By Lemma 17, htid(t),_,_i ∈ preparedCausalm(τ0). d m commitVec(t2)[d] > knownVec (τ1)[d]. m 0 d – CASE II-1: htid(t),_,tsi ∈ preparedCausald (τ ). By lines A4:5 and A2:29, Therefore, in either case, commitVec(t)[d] ≥ ts commitVec(t1)[d] < commitVec(t2)[d]. m 0 > knownVecd (τ )[d] m = knownVecd (τ)[d]. m 0 m – CASE II-2: htid(t),_,_i ∈/ preparedCausald (τ ). Lemma 20. Let pd be a replica in data center d. Con- m m By Lemma 16, Assumption1, line A3:11, and sider a heartbeat knownVecd (τ1)[d] sent by pd at time τ1 m line A2:29, (line A4:11). Let t be a transaction replicated by pd at time m 0 τ2 (line A4:8). Then commitVec(t)[d] > knownVecd (τ )[d] m m = knownVecd (τ)[d]. τ1 < τ2 ⇔ knownVecd (τ1)[d] < commitVec(t)[d].

31 m Proof. We first show that Proof. Since t1 is forwarded by pd at time τ1, by line A4:23,

m m τ1 < τ2 ⇒ knownVecd (τ1)[d] < commitVec(t)[d]. htid(t1),_,_,_i ∈ committedCausald (τ1)[ j].

Assume that τ1 < τ2. We distinguish between two cases ac- By Lemmas 21 and 14, cording to whether m commitVec(t1)[ j] ≤ knownVecd (τ1)[ j]. (8) m htid(t),_,_,_i ∈ committedCausald (τ1)[d]. Assume that τ1 < τ2. We first argue that m •C ASE I: htid(t),_,_,_i ∈ committedCausald (τ1)[d]. m By line A4:6, htid(t2),_,_,_i ∈/ committedCausald (τ1)[ j]. (9)

m Otherwise, by line A4:23, commitVec(t)[d] > knownVecd (τ1)[d]. m m commitVec(t2)[ j] ≤ globalMatrixd (τ1)[i][ j]. •C ASE II: htid(t),_,_,_i ∈/ committedCausald (τ1)[d]. m Thus, t is committed at pd after time τ1. By Lemma 17, By Lemma 22, m commitVec(t)[d] > knownVec (τ1)[d]. m d commitVec(t2)[ j] ≤ globalMatrixd (τ2)[i][ j].

Next we show that (note that τ1 6= τ2) m Therefore, by line A4:23, t2 would not be forwarded by pd to m m m pi at time τ2. Thus, (9) holds. Since t2 is forwarded by pd to τ2 < τ1 ⇒ commitVec(t)[d] ≤ knownVecd (τ1)[d]. m pi at time τ2, m Since t is replicated by p at time τ2, by line A4:6, m d htid(t2),_,, _i ∈ committedCausald (τ2)[ j]. m commitVec(t)[d] ≤ knownVecd (τ2)[d]. By Lemma 14 and line A4:14, Assume that τ < τ . By Lemma 13, m 2 1 commitVec(t2)[ j] > knownVecd (τ1)[ j]. (10) m m knownVecd (τ2)[d] ≤ knownVecd (τ1)[d]. Putting (8) and (10) together yields

Putting it together yields commitVec(t1)[ j] < commitVec(t2)[ j].

m commitVec(t)[d] ≤ knownVecd (τ1)[d]. m Lemma 24. Let pd be a replica in data center d. Consider m m a heartbeat knownVecd (τ1)[ j] ( j 6= d) sent by pd to sibling m m Lemma 21. Let pd be a replica in data center d. Then replica pi in data center i ∈/ {d, j} at time τ1 (line A4:27). Let t be a transaction that originates at data center j and is m m m ∀i 6= d. ∀htid(t),_,_,_i ∈ committedCausald [i]. forwarded by pd to pi at time τ2 (line A4:25). Then m commitVec(t)[i] ≤ knownVecd [i]. m τ1 < τ2 ⇔ knownVecd (τ1)[ j] < commitVec(t)[ j]. Proof. By lines A4:17 and A4:18 and Lemma 14. Proof. We first show that m Lemma 22. For j 6= d and i ∈/ {d, j}, globalMatrixd [i][ j] at m m τ1 < τ2 ⇒ knownVecd (τ1)[ j] < commitVec(t)[ j]. any replica pd in data center d is non-decreasing. m Assume that τ1 < τ2. We first argue that Proof. Note that globalMatrixd [i][ j] can be updated only at line A5:17. Therefore, by Lemma 14, it is non-decreasing. m htid(t),_,_,_i ∈/ committedCausald (τ1)[ j]. (11)

m Otherwise, since t is not forwarded at time τ1, by line A4:23, Lemma 23. Let pd be a replica in data center d. Let t1 and t2 j 6= d m be two transactions that originate at data center and are commitVec(t)[ j] ≤ globalMatrix (τ1)[i][ j]. m m d forwarded by pd to sibling replica pi in data center i ∈/ {d, j} at time τ1 and τ2 (line A4:25), respectively. Then By Lemma 22,

m τ1 < τ2 ⇒ commitVec(t1)[ j] < commitVec(t2)[ j]. commitVec(t)[ j] ≤ globalMatrixd (τ2)[i][ j].

32 m Therefore, by line A4:23, t would not be forwarded by pd to • Induction Step. Consider an execution of length k + 1. m m pi at time τ2. Thus, (11) holds. Since t is forwarded by pd to If the (k + 1)-st step of this execution does not update m m m pi at time τ2, knownVecd [i] for replica pd in any data center d 6= i, m then by the induction hypothesis, htid(t),_,_,_i ∈ committedCausald (τ2)[ j]. ∀d ∈ \{i}. ∀t ∈ T . By Lemma 14 and line A4:14, D causal commitVec(t)[i] ≤ knownVecm(k + 1)[i] m d knownVecd (τ1)[ j] < commitVec(t)[ j]. m  ⇒ log(t)[m] ⊆ opLogd (k + 1) . Next we show that (note that τ1 6= τ2) Otherwise, we perform a case analysis according to how < ⇒ (t)[ j] ≤ knownVecm( )[ j]. m m τ2 τ1 commitVec d τ1 knownVecd [i] of replica pd in data center d 6= i is up- m m dated in the (k + 1)-st step. Since t is forwarded by pd to pi at time τ2, by line A4:23, m – CASE I: knownVecd [i] is updated due to delivery m htid(t),_,_,_i ∈ committedCausald (τ2)[ j]. of a message from data center i. By Lemmas 13, 19, and 20, local transactions and heartbeats are By Lemmas 21 and 14, m propagated by pi to sibling replicas in increasing m order of their local timestamps commitVec[i] and commitVec(t)[ j] ≤ knownVecd (τ2)[ j]. m knownVeci [i] values. Therefore, by Assumption2 Assume that τ2 < τ1. By Lemma 14, and the induction hypothesis, m m knownVecd (τ2)[ j] ≤ knownVecd (τ1)[ j]. ∀t ∈ Tcausal. m Putting it together yields commitVec(t)[i] ≤ knownVecd (k + 1)[i] m  m ⇒ log(t)[m] ⊆ opLogd (k + 1) . commitVec(t)[ j] ≤ knownVecd (τ1)[ j]. m – CASE II: knownVecd [i] is updated due to deliv- ery of a message from a third data center j 6= i. Lemma 25. Let t ∈ Tcausal be a causal transaction that origi- By Lemmas 14, 23, and 24, transactions originat- nates at data center i and accesses partition m. If ing at data center i and heartbeats are forwarded by some replica, say pm( j 6= i), to sibling repli- commitVec(t)[i] ≤ knownVecm[i] j d cas in increasing order of their local timestamps m commitVec[i] and knownVecm[i] values. Therefore, for replica pd in data center d 6= i, then j by Assumption2 and the induction hypothesis, m log(t)[m] ⊆ opLogd . ∀t ∈ T . m causal Proof. Note that for i ∈ D \{d}, knownVecd [i] can be up- m dated only at lines A4:18 or A4:21 due to replication of trans- commitVec(t)[i] ≤ knownVecd (k + 1)[i] m  actions or heartbeats respectively, either directly from data ⇒ log(t)[m] ⊆ opLogd (k + 1) . center i (line A4:1) or indirectly from a third data center j 6= i (line A4:22). We proceed by induction on the length of the execution. In m Lemma 26 (PROPERTY 1). Let t ∈ Tcausal be a causal trans- the following, for replica pd in data center d ∈ D, we denote m m action that originates at data center i and accesses partition m. the value of knownVecd (resp. opLogd ) after k steps in an m m If execution by knownVecd (k) (resp. opLogd (k)). m m commitVec(t)[i] ≤ knownVecd [i] • Base Case. k = 0. It holds trivially, since for replica pd m in any data center d 6= i, for replica pd in data center d, then knownVecm(0)[i] = 0. m d log(t)[m] ⊆ opLogd . • Induction Hypothesis. Suppose that for any execution of Proof. By Lemmas 18 and 25. length k, we have

∀d ∈ D \{i}. ∀t ∈ Tcausal. D.3.2 Properties of stableVec m commitVec(t)[i] ≤ knownVecd (k)[i] m m Lemma 27. For i ∈ D, stableVecd [i] at any replica pd in m  ⇒ log(t)[m] ⊆ opLogd (k) . data center d is non-decreasing.

33 m Proof. Note that stableVecd [i] (i ∈ D) can be updated only •C ASE I: e ∈ S. By line A2:3, at line A5:8. By Lemma 15 and Assumption2, stableVecm[i] d m is non-decreasing. ∀i ∈ D \{d}. (pastVeccl)e[i] ≤ (uniformVecd )e[i].

m Lemma 28 (PROPERTY 2). For any replica pd in data center •C ASE II: e ∈ R ∪U. In this case, d, (pastVeccl)e = (pastVeccl)st(e). m n ∀i ∈ D. ∀n ∈ P . stableVecd [i] ≤ knownVecd[i]. By CASE I, m Proof. Note that stableVecd [i] (i ∈ D) can be updated only m m at line A5:8. By the way stableVecd [i] is updated and Lem- ∀i ∈ D \{d}. (pastVeccl)st(e)[i] ≤ (uniformVecd )st(e)[i]. mas 13 and 14, By Lemma 30, m n ∀n ∈ P . stableVecd [i] ≤ knownVecd[i]. m m ∀i ∈ D \{d}.(uniformVecd )st(e) ≤ (uniformVecd )e.

Putting it together yields Lemma 29. Let t ∈ Tcausal be a causal transaction that origi- nates at data center i and accesses partition n. If m ∀i ∈ D \{d}. (pastVeccl)e[i] ≤ (uniformVecd )e[i]. (t)[i] ≤ stableVecm[i] commitVec d •C ASE III: e ∈ Ccausal. By line A1:16, for some replica pm in data center d, then d (pastVeccl)e = vc(COMMIT_CAUSAL_TX,e). n log(t)[n] ⊆ opLogd. By lines A2:24, A2:26, and A2:31, Proof. By Lemma 28, ∀i ∈ D \{d}. m n m stableVecd [i] ≤ knownVecd[i]. vc(COMMIT_CAUSAL_TX,e)[i] = (snapVecd )e[tx(e)][i]. Therefore, By line A2:5,

n commitVec(t)[i] ≤ knownVecd[i]. ∀i ∈ D \{d}. (snapVecm) [tx(e)][i] = (uniformVecm) [i]. By Lemma 25, d e d st(e) n log(t)[n] ⊆ opLogd. By Lemma 30,

m m ∀i ∈ D \{d}.(uniformVecd )st(e) ≤ (uniformVecd )e.

D.3.3 Properties of uniformVec Putting it together yields m m Lemma 30. For i ∈ D, uniformVecd [i] at any replica pd in m data center d is non-decreasing. ∀i ∈ D \{d}. (pastVeccl)e[i] ≤ (uniformVecd )e[i].

m ASE e ∈ C Proof. Note that whenever uniformVecd [i] is updated at •C IV: strong. By line A1:22, lines A2:3, A3:3, A3:10, or A5:15, we take the maximum of it and some scalar value. (pastVeccl)e = vc(COMMIT_STRONG_TX,e). Lemma 31. Let e ∈ E be an event issued by client cl to By (5), m replica pd in data center d. Then ∀i ∈ D \{d}. e ∈ E \ Q ⇒ m vc(COMMIT_STRONG_TX,e)[i] = (snapVecd )e[tx(e)][i]. m ∀i ∈ D \{d}. (pastVeccl)e[i] ≤ (uniformVecd )e[i], Therefore, similar to CASE III, we have and m ∀i ∈ D \{d}. (pastVeccl)e[i] ≤ (uniformVecd )e[i]. m e ∈ Q ⇒ (pastVeccl)e[d] ≤ (uniformVecd )e[d]. •C ASE V: e ∈ Q. By line A3:22, Proof. We perform a case analysis according to the type of m event e. (pastVeccl)e[d] ≤ (uniformVecd )e[d].

34 m •C ASE VI: e ∈ A. By line A3:24, – CASE I: uniformVecd [i] is updated at line A5:15. m By line A5:12, ∀i ∈ D \{d}. (pastVeccl)e[i] ≤ (uniformVecd )e[i]. ∃g0 ⊆ D. |g0| ≥ f + 1 ∧ d ∈ g0 ∧ (12) m uniformVecd (k + 1)[i] = Lemma 32. Let cl be a client and d cur_dc(cl). At any , uniformVecm(k)[i], time, max d minstableMatrixm(k + 1)[ j][i] . m 0 d ∀i ∈ D \{d}. pastVeccl[i] ≤ uniformVecd [i] j∈g m for some replica pd in data center d. By the induction hypothesis and Lemma 14, Proof. By a simple induction on the number of events that cl ∃g00 ⊆ D. |g00| ≥ f + 1 ∧ d ∈ g00 ∧ (13) issues and Lemmas 31 and 30. ∀ j ∈ g00. ∀n ∈ P . m n Lemma 33. Let t be a transaction that originates at data uniformVecd (k)[i] ≤ knownVec j (k)[i] center d. At any time, n  ≤ knownVec j (k + 1)[i] . ∀i ∈ D \{d}. snapshotVec(t)[i] ≤ uniformVecm[i] d By Lemma 27, for the particular g0 ⊆ D in (12), m for some replica pd in data center d. 0 m ∀ j ∈ g . stableMatrixd (k + 1)[ j][i] (14) Proof. By line A2:5 and Lemma 30. m ≤ stableVec j (k + 1)[i]. Lemma 34 PROPERTY 3 . pm ( ) For any replica d in data center By Lemmas 28 and 14, for any replica pm in data d j , center j,  ∀i ∈ D. ∃g ⊆ D. |g| ≥ f + 1 ∧ d ∈ g ∧ m ∀n ∈ P . stableVec j (k + 1)[i] (15)  m n  ≤ knownVecn(k + 1)[i]. ∀ j ∈ g. ∀n ∈ P . uniformVecd [i] ≤ knownVec j [i] . j 0 Proof. Fix i ∈ D. We proceed by induction on the length Therefore, for the particular g ⊆ D in (12), of the execution. In the following, we denote the value of 0 0 m ∀ j ∈ g . ∀n ∈ P . minstableMatrixd (k + 1)[ j][i] m m m m j∈g0 knownVecd , stableVecd , uniformVecd , stableMatrixd , and n pastVeccl (for some client cl) after k steps of an exe- ≤ knownVec j0 (k + 1)[i]. (16) m m m cution by knownVecd (k), stableVecd (k), uniformVecd (k), 0 stableMatrixm(k), and pastVec (k), respectively. By (12), (13), and (16), we can either take g = g d cl in (12) or g = g00 in (13) such that • Base Case. k = 0. It holds trivially since ∀ j ∈ g. ∀n ∈ . uniformVecm(0)[i] = 0. P d m n uniformVecd (k + 1)[i] ≤ knownVec j (k + 1)[i]. • Induction Hypothesis. Suppose that for any execution of m Therefore, length k, for any replica pd in data center d, ∃g ⊆ D. |g| ≥ f + 1 ∧ d ∈ g ∧ ∃g ⊆ D. |g| ≥ f + 1 ∧ d ∈ g ∧ ∀ j ∈ g. ∀1 ≤ n ≤ N. ∀ j ∈ g. ∀n ∈ P . m n  m n  uniformVecd (k)[i] ≤ knownVec j (k)[i] . uniformVecd (k + 1)[i] ≤ knownVec j (k + 1)[i] . m • Induction Step. Consider an execution of length k + 1. – CASE II: uniformVecd [i] (i ∈ D \{d}) is updated If the (k + 1)-st step of this execution does not update at line A2:3. Then there exists some client cl with m m uniformVecd [i] for any replica pd in data center d, then d = cur_dc(cl) such that by the induction hypothesis and Lemma 15, m uniformVecd (k + 1)[i] = (17) ∃g ⊆ D. |g| ≥ f + 1 ∧ d ∈ g ∧  m max pastVeccl(k)[i],uniformVecd (k)[i] . ∀ j ∈ g. ∀1 ≤ n ≤ N. By the induction hypothesis and Lemma 14, uniformVecm(k + 1)[i] = uniformVecm(k)[i] d d 0 0 0 n ∃g ⊆ D. |g | ≥ f + 1 ∧ d ∈ g ∧ (18) ≤ knownVec j (k)[i] ∀ j ∈ g0. ∀n ∈ . ≤ knownVecn(k + 1)[i]. P j m n uniformVecd (k)[i] ≤ knownVec j (k)[i] Otherwise, we perform a case analysis according to how n  m ≤ knownVec j (k + 1)[i] . uniformVecd [i] is updated.

35 m By Lemma 32, the induction hypothesis, and Lemma 35. For any replica pd in data center d, Lemma 15, m n ∀i ∈ D. ∀n ∈ P . uniformVecd [i] ≤ knownVecd[i]. ∃g00 ⊆ D. |g00| ≥ f + 1 ∧ d ∈ g00 ∧ (19) Proof. By Lemma 34. ∀ j ∈ g00. ∀n ∈ P . n Lemma 36. Let t ∈ Tcausal be a causal transaction that origi- pastVeccl(k)[i] ≤ knownVec j (k + 1)[i]. nates at data center i and accesses partition n. If 0 By (17), (18), and (19), we can take g = g in (18) m or g = g00 in (19) such that commitVec(t)[i] ≤ uniformVecd [i] for some replica pm in data center d, then ∀ j ∈ g. ∀n ∈ P . d m n n uniformVecd (k + 1)[i] ≤ knownVec j (k + 1)[i]. log(t)[n] ⊆ opLogd. Therefore, Proof. By Lemma 35,

m n ∃g ⊆ D. |g| ≥ f + 1 ∧ d ∈ g ∧ uniformVecd [i] ≤ knownVecd[i]. ∀ j ∈ g. ∀n ∈ P . Therefore, m n  uniformVecd (k + 1)[i] ≤ knownVec j (k + 1)[i] . n commitVec(t)[i] ≤ knownVecd[i]. m – CASE III: uniformVec [i] (i ∈ D \{d}) is updated d By Lemma 25, at lines A3:3 or A3:10. Therefore, there exists some log(t)[n] ⊆ opLogn. transaction t originating at data center d such that d

m uniformVecd (k + 1)[i] = (20) m  m Lemma 37. For any replica pd in data center d, max snapshotVec(t)[i],uniformVecd (k)[i] . uniformVecm[d] ≤ clockm. By the induction hypothesis and Lemma 15, d d Proof. By Lemma 35, ∃g0 ⊆ D. |g0| ≥ f + 1 ∧ d ∈ g0 ∧ (21) 0 m m ∀ j ∈ g . ∀n ∈ P . uniformVecd [d] ≤ knownVecd [d]. m n uniformVecd (k)[i] ≤ knownVec j (k)[i] By Lemma 16, n  ≤ knownVec j (k + 1)[i] . m m knownVecd [d] ≤ clockd . By Lemma 33, the induction hypothesis, and Putting it together yields Lemma 15, uniformVecm[d] ≤ clockm. ∃g00 ⊆ D. |g00| ≥ f + 1 ∧ d ∈ g00 ∧ (22) d d ∀ j ∈ g00. ∀n ∈ P . n snapshotVec(t)[i] ≤ knownVec j (k + 1)[i]. D.3.4 Properties of pastVec By (20), (21), and (22), we can take g = g0 in (21) Lemma 38. Let e ∈ S be a START event of transaction t issued or g = g00 in (22) such that by client cl. Then ∀ j ∈ g. ∀n ∈ P . (pastVec )e ≤ snapshotVec(t). m n cl uniformVecd (k + 1)[i] ≤ knownVec j (k + 1)[i]. Proof. By Definition 10 of snapshotVec(t) and lines A2:2– Therefore, A2:7.

∃g ⊆ D. |g| ≥ f + 1 ∧ d ∈ g ∧ Lemma 39. For i ∈ D, pastVeccl[i] at any client cl is non- decreasing. ∀ j ∈ g. ∀n ∈ P . m n  Proof. Note that pastVec [i] (i ∈ ) is updated only at uniformVecd (k + 1)[i] ≤ knownVec j (k + 1)[i] . cl D lines A1:16 or A1:22 when some transaction is committed. Therefore, the lemma holds due to Lemmas 12 and 38.

36 D.4 Metadata for Strong Transactions Lemma 46. Let e ∈ REXT be an external read event. Let d , m dc(tx(e)) and m , coord(tx(e)). Then Lemma 40. For any replica pd in any data center d, m knownVecd [strong] is non-decreasing. ts(st(e)) = snapshotVec(tx(e))

Proof. By (R6) and line A6:6. = snapVec(GET_VERSION,e) = (snapVecm) [tx(e)]. m d st(e) Lemma 41. For any replica pd in any data center d, m Proof. By Definition 43 of timestamps, line A2:13, and stableVecd [strong] is non-decreasing. line A2:6. Proof. By Lemma 40, Assumption2, and line A5:9. Lemma 47. Let e ∈ REXT be an external read event which Lemma 42 (PROPERTY 6). Let t ∈ Tstrong be a strong trans- reads from transaction t. Then action that originates at data center i and accesses partition m. ts(t) ≤ ts(st(e)). If m Proof. By line A3:5, commitVec(t)[strong] ≤ knownVecd [strong]

m commitVec(GET_VERSION,e) ≤ snapVec(GET_VERSION,e). for some replica pd in data center d, then Since e reads from t, m log(t)[m] ⊆ opLogd . ts(t) = commitVec(GET_VERSION,e). m Proof. Note that knownVecd [strong] can be updated only at By Lemma 46, line A6:9. By (R6), all committed strong transactions with m ts(st(e)) = snapVec(GET_VERSION,e). strong timestamps less than or equal to knownVecd [strong] m have been delivered to pd . By line A6:8, Therefore, ts(t) ≤ ts(st(e)). m log(t)[m] ⊆ opLogd .

Lemma 48.

D.5 Timestamps ∀e ∈ (CRW ∩Ccausal) ∪Cstrong.

Definition 43 (Timestamps of Events). Let e ∈ Ccausal ∪ ts(e) = ts(tx(e)) = commitVec(tx(e)). Cstrong ∪ Q ∪ A be an event issued by client cl. We define Proof. By Definition 43 of timestamps and Definition 11 of its timestamp ts(e) as commitVec(tx(e)).

ts(e) , (pastVeccl)e. Lemma 49. ∀t ∈ T. ts(t) ≥ ts(st(t)). Let e ∈ S be a START event of transaction t. Let d dc(t) and , Proof. By Lemmas 12, 46, and 48. m , coord(t). We define its timestamp ts(e) as

m D.6 Session Order ts(e) , (snapVecd )e[t]. Lemma 50. See lines A1:3, A1:16, A1:22, A1:27, and A1:32 for so ∀s1,s2 ∈ X. s1 −→ s2 ⇒ ts(s1) ≤ ts(s2) START, COMMIT_CAUSAL_TX, COMMIT_STRONG_TX, CL_- ∧ s ∈ T ⇒ ts(s ) ≤ ts(st(s )) ≤ ts(s ). UNIFORM_BARRIER, and CL_ATTACH events, respectively. 2 1 2 2 Proof. By Definitions 43 and 44 of timestamps and Definition 44 (Timestamps of Transactions). The timestamp Lemma 39, ts(t) of a transaction t is that of its commit event, i.e., ts(s1) ≤ ts(s2), and ∀t ∈ T. ts(t) , ts(ct(t)). s2 ∈ T ⇒ ts(s1) ≤ ts(st(s2)). Lemma 45. Let e ∈ S be a START event. Let d , dc(tx(e)) Besides, by Lemma 12, and m , coord(tx(e)). Then s2 ∈ T ⇒ ts(st(s2)) ≤ ts(s2). m (∀i ∈ D. ts(e)[i] ≥ (uniformVecd )e[i]) ∧ Therefore, m ts(e)[strong] ≥ (stableVecd )e[strong]. s2 ∈ T ⇒ ts(s1) ≤ ts(st(s2)) ≤ ts(s2). Proof. By lines A2:5, A2:6, and A2:7.

37 D.7 Lamport Clocks D.8 Visibility Relation

Definition 51 (Lamport Clocks of Events). Let e ∈ Ccausal ∪ Definition 57 (Visibility Relation). C ∪ Q ∪ A be an event issued by client cl. We define its strong vis Lamport clock lclock(e) as ∀s1,s2 ∈ X. s1 −→ s2 ⇔ (s2 ∈ T ⇒ ts(s1) ≤ ts(st(s2))) lclock(e) , (lccl)e.  lc ∧ (s2 ∈ Q ∪ A ⇒ ts(s1) ≤ ts(s2)) ∧ s1 −→ s2. See lines A1:14, A1:23, A1:28, and A1:33 for COMMIT_- CAUSAL_TX, COMMIT_STRONG_TX, CL_UNIFORM_BAR- Theorem 58. RIER, and CL_ATTACH events, respectively. A |= CONFLICTORDERING. Definition 52 (Lamport Clocks of Transactions). The Lam- port clock lclock(t) of a transaction t is that of its commit Proof. We need to show that event, i.e., vis vis ∀t1,t2 ∈ Tstrong. t1 ./ t2 ⇒ t1 −→ t2 ∨ t2 −→ t1. ∀t ∈ T. lclock(t) lclock(ct(t)). , Consider the history h of TCS. By Theorem9, h |

Lemma 53. Let e ∈ REXT be an external read event issued by committed(h) has a legal permutation π. Suppose that client cl. Then certify(tid(t1),_,_,_,_) ≺π certify(tid(t2),_,_,_,_). (lccl)e < lclock(tx(e)). Since t1 ./ t2 and t2 is committed, by (4), Proof. If tx(e) is a causal transaction, by line A1:14, commitVec(t1) ≤ snapshotVec(t2). lclock(e) < lclock(ct(e)) = lclock(tx(e)). By Lemmas 46 and 48, If tx(e) is a strong transaction, by line A1:19 and (6), ts(t1) ≤ ts(st(t2)). lclock(e) < lclock(ct(e)) = lclock(tx(e)). On the other hand, by (6),

lclock(t1) < lclock(t2). Definition 54 (Lamport Clock Order). The Lamport clock By Definition 54 of lc, order lc on X is the total order defined by their Lamport clocks, lc with their client identifiers for tie-breaking. t1 −→ t2. Lemma 55. Therefore, by Definition 57 of vis, so ⊆ lc. t −→vis t . Proof. By lines A1:14, A1:19,(6), A1:23, A1:28, and A1:33. 1 2

Lemma 56. Let e ∈ REXT be an external read event which Lemma 59. reads from transaction t. Then so ⊆ vis. lc t −→ tx(e). Proof. By Lemmas 50 and 55. Proof. Suppose that e is issued by client cl. By line A1:8, Lemma 60. The visibility relation vis is a partial order.

lclock(t) ≤ (lccl)e. Proof. We need to show that • vis is irreflexive. This holds because lc is irreflexive. By Lemma 53, vis vis • vis is transitive. Suppose that s1 −→ s2 −→ s3. By Defi- nition 57 of vis, (lccl)e < lclock(tx(e)). lc lc Therefore, s1 −→ s2 −→ s3. lclock(t) < lclock(tx(e)). By Definition 54 of lc, By Definition 54 of lc, lc s1 −→ s3. t −→lc tx(e). Regarding timestamps, we distinguish between the fol- lowing four cases and use Lemma 49.

38 – s2 ∈ Q ∪ A ∧ s3 ∈ Q ∪ A. Putting it together yields

ts(s1) ≤ ts(s2) ≤ ts(s3). commitVec(t1)[strong] < commitVec(t2)[strong].

– s2 ∈ Q ∪ A ∧ s3 ∈ T. Next we show that vis ts(s1) ≤ ts(s2) ≤ ts(st(s3)). t1 −→ t2 ⇐= commitVec(t1)[strong] < commitVec(t2)[strong]. – s2 ∈ T ∧ s3 ∈ Q ∪ A. Assume that ts(s1) ≤ ts(st(s2)) ≤ ts(s2) ≤ ts(s3). commitVec(t1)[strong] < commitVec(t2)[strong]. (24) – s2 ∈ T ∧ s3 ∈ T. Since t1 ./ t2, by Theorem 58, ts(s1) ≤ ts(st(s2)) ≤ ts(s2) ≤ ts(st(s3)). t −→vis t ∨ t −→vis t . By Definition 57 of vis, 1 2 2 1

vis By (23) and (24), s1 −→ s3. vis ¬(t2 −→ t1). Therefore, vis Theorem 61. t1 −→ t2.

A |= CAUSALVISIBILITY. D.9 Execution Order Proof. By Lemmas 59 and 60, Definition 63 (Execution Points). Let k be a key. The “exe- + + (so ∪ vis) = vis = vis. cution point” ep(e,k) of event e ∈ (REXT ∩Rk)∪Ck is defined as follows:

• If e ∈ REXT ∩ Rk, then ep(e,k) is at line A3:5; Lemma 62 (PROPERTY 5). For any two conflicting transac- • If e ∈ Ck ∩Ccausal, then ep(e,k) is at line A3:19 for this particular key k; tions t1 and t2, • If e ∈ Ck ∩Cstrong then ep(e,k) is at line A6:8 for delivery vis of the update of tx(e) on this particular key k. Note that t1 −→ t2 ⇔ DELIVER is asynchronous with the commit event e. commitVec(t1)[strong] < commitVec(t2)[strong]. Definition 64 (Per-key Execution Order). Let k be a key. Proof. We first show that Suppose that {e1,e2} ⊆ (REXT ∩Rk)∪Ck. Event e1 is executed eok vis before event e2, denoted e1 −−→ e2, if ep(e1,k) is executed t1 −→ t2 ⇒ (23) before ep(e2,k) in real time. commitVec(t1)[strong] < commitVec(t2)[strong]. Lemma 65. Let k ∈ Key be a key, t ∈ Tk be a transaction, vis and e ∈ REXT ∩ Rk be an external read event. Suppose that Assume that t1 −→ t2. By Definition 57 of vis, d , dc(t) = dc(tx(e)). Then (t ) ≤ ( (t )). ts 1 ts st 2 eo t −→vis tx(e) ⇒ ct(t) −−→k e. By Lemmas 46 and 48, Proof. By Definition 57 of vis, commitVec(t1) ≤ snapshotVec(t2). ts(t) ≤ ts(st(e)). Therefore, Since e ∈ REXT, by Lemma 46, commitVec(t1)[strong] ≤ snapshotVec(t2)[strong]. ts(t) ≤ snapVec(GET_VERSION,e). By (5), In the following, we distinguish between two cases according commitVec(t2)[strong] > snapshotVec(t2)[strong]. to whether t ∈ Tcausal or t ∈ Tstrong. Let m , partition(k).

39 •C ASE I: t ∈ Tcausal. By Lemma 48, Now let e be an external read event. For notational conve- nience, we define Ve to be the set of update transactions on k ts(t) = commitVec(t) ≤ snapVec(GET_VERSION,e). that are visible to tx(e), and Se the set of update transactions on k that are safe to read at line A3:5. By Assumption6, (R5), Therefore, after line A3:4 for e, and (R3), all transactions in Se are committed. Formally, m (knownVecd )e[d] ≥ snapVec(GET_VERSION,e)[d] Definition 69 (Visibility Set). Let e ∈ REXT ∩ Rk be an exter- nal read event on key k. ≥ commitVec(t)[d]. (25) −1 Ve , vis (tx(e)) ∩ Tk. By Lemma 18, COMMIT of Algorithm A3 for ws(t)[m] 3 m hk,_i finishes before e starts at replica pd . By Defini- Definition 70 (Safe Set). Let e ∈ REXT ∩ Rk be an external tion 64 of eok, read event on key k. Suppose that e is issued to replica pm in eo d ct(t) −−→k e. data center d.

•C ASE II: t ∈ T . By Lemma 48, strong Se , {t ∈ Tk : ts(t) ≤ snapVec(GET_VERSION,e) ∧ m log[t][k] ∈ (opLogd )e[k]}. ts(t) = commitVec(t) ≤ snapVec(GET_VERSION,e).

Lemma 71. Let e ∈ REXT ∩ Rk be an external read event on Therefore, after line A3:4 for e, m key k. Suppose that e is issued to replica pd in data center d. m m When e returns at pd (line A3:5), we have (knownVecd )e[strong] ≥ snapVec(GET_VERSION,e)[strong]

≥ commitVec(t)[strong]. (26) Ve ⊆ Se.

By Lemma 42, DELIVER of Algorithm A6 for ws(t)[m] 3 Proof. For each t ∈ Ve, we need to show that t ∈ Se. That is, m hk,_i finishes before e starts at replica pd . By Defini- ts(t) ≤ snapVec(GET_VERSION,e) (27) tion 64 of eok, eok ct(t) −−→ e. and

m log[t][k] ∈ (opLogd )e[k]. (28)

D.10 Arbitration Relation We first show that (27) holds. Since t ∈ Ve, Definition 66 (Arbitration Relation). We define the arbitra- vis tion relation ar on X as the Lamport clock order between t −→ tx(e). them, i.e., By Definition 57 of vis, ar = lc. ts(t) ≤ ts(st(e)). Theorem 67. By Lemma 46, A |= CAUSALARBITRATION.

ts(t) ≤ snapVec(GET_VERSION,e). Proof. By Definition 57 of vis and Definition 66 of ar, To show that (28) holds, we perform a case analysis ac- vis ⊆ lc = ar. cording to whether t is a local transaction in data center d or a remote one in data center i 6= d. •C ASE I: t is a local transaction in data center d. Since vis D.11 Return Values t −→ tx(e), by Lemma 65, eo It is straightforward to show that INTRVAL holds for internal ct(t) −−→k e. read events. Therefore, Theorem 68. m A |= INTRVAL. log[t][k] ∈ (opLogd )e[k].

Proof. Let e ∈ RINT ∩ Rk be an internal read event. The trans- •C ASE II: t is a remote transaction in data center i 6= d. action tx(e) contains update events on k. By line A2:12, e We distinguish between two cases according to whether reads from the last update event on k preceding e in tx(e). t ∈ Tcausal or t ∈ Tstrong.

40 – CASE I: t ∈ Tcausal. Since i 6= d, by line A3:3, D.12 Uniformity D.12.1 Uniformity of Causal Transactions Originating snapVec [i] ≤ (uniformVecm) [i]. (GET_VERSION,e) d e at Correct Data Centers By (27), m Lemma 74. For any replica pd in any correct data center d ∈ , knownVecm[d] grows without bound. m C d ts(t)[i] ≤ (uniformVecd )e[i]. Proof. Since data center d is correct, by Assumption4, PROP- By Lemma 36, AGATE_LOCAL_TXS of Algorithm A4 will be executed in- finitely often. log[t][k] ∈ (opLogm) [k]. d e •C ASE I: Line A4:3 is executed infinitely often. By As- m sumption1, knownVecd [d] grows without bound. – CASE II: t ∈ Tstrong. By line A3:4, •C ASE II: Line A4:5 is executed infinitely often. That is, it is infinitely often that snapVec(GET_VERSION,e)[strong] ≤ (knownVecm) [strong]. m d e preparedCausald 6= ∅.

By (27), By Assumption4, causal transactions in preparedCausalm will eventually be committed m d ts(t)[strong] ≤ (knownVec )e[strong]. m d and removed from preparedCausald (line A3:17). Thus, it is infinitely often that new causal transactions are By Lemma 42, m prepared and added into preparedCausald (line A3:12) with larger and larger prepare timestamps (line A3:11). log[t][k] ∈ (opLogm) [k]. d e Therefore,

m min{ts | h_,_,tsi ∈ preparedCausald } Theorem 72. and A |= EXTRVAL. m knownVecd [d] Proof. Let e ∈ REXT ∩ Rk be an external read event on key k. = min{ts | h_,_,tsi ∈ preparedCausalm} − 1 Suppose that e reads from transaction t in Se. Since all trans- d actions in Se are committed, t is committed. By Lemma 47, grow without bound. ts(t) ≤ ts(st(e)).

m By Lemma 56, Lemma 75. Let pd be a replica in a correct data center d ∈ C. t −→lc tx(e). If for some j ∈ D and some value x ∈ N m By Definition 57 of vis, knownVecd [ j] ≥ x,

t −→vis tx(e). then eventually ∀c ∈ . knownVecm[ j] ≥ x. By Definition 69 of Ve, C c

t ∈ V . Proof. Since data center d is correct, by Assumption4, for e m each other data center i 6= d, replica pd will keep Both Ve and Se are totally ordered by lc. Since t is the latest • replicating to data center i the write sets one in Se and Ve ⊆ Se (Theorem 71), t is also the latest one in m Ve. Thus, e reads from t in Ve. That is, e reads from the update h_,wbuff ,commitVec,_i ∈ committedCausald [ j] event ud(t,k) of Ve. that have not been received by i from the perspective m Theorem 73. of d (commitVec[d] ≤ knownVecd [d] at line A4:6 and m A |= RVAL. commitVec[ j] > globalMatrixd [i][ j] at line A4:23); m • or sending heartbeats with up-to-date knownVecd [ j] to Proof. By Theorems 68 and 72. data center i (lines A4:11 and A4:27).

41 m m By Assumption2, knownVecc [ j] at replica pc of each correct On the other hand, by Definition 43 of timestamps and data center c ∈ C will eventually be updated (lines A4:18 and Lemma 31 (let n , coord(t) and cl , client(t)), A4:21) such that ∀i ∈ D \{d}. ts(t)[i] = (pastVeccl)ct(t)[i] m m n knownVecc [ j] ≥ knownVecd [ j] ≥ x. ≤ (uniformVecd)ct(t)[i]. By Lemma 77, there exists some time τ00 such that m 00 Lemma 76. Let d ∈ C be a correct data center. For any replica ∀i ∈ D \{d}. ts(t)[i] ≤ (uniformVecc )(τ )[i]. m m pc in any correct data center c ∈ C, uniformVecc [d] grows Let without bound. 0 00 τ , max{τ ,τ }. m Proof. By Lemmas 74 and 75, for any replica pc in any By Lemma 30, m correct data center c ∈ C, knownVecc [d] grows without m m ∀i ∈ D. ts(t)[i] ≤ uniformVecc (τ)[i]. bound. By lines A5:2 and A5:8, for any replica pc in m any correct data center c ∈ C, stableVecc [d] grows without bound. By line A5:3, Assumptions2 and3, and lines A5:12– m A5:15, for any replica pc in any correct data center c ∈ C, D.12.2 Uniformity of Causal Transactions Visible to uniformVecm[d] c grows without bound. CL_UNIFORM_BARRIER events m Lemma 77 (PROPERTY 4). Let pd be any replica in any Lemma 79. Let t ∈ Tcausal be a causal transaction and q ∈ Q d 0 vis data center . For any time τ, there exists some time τ such be a CL_UNIFORM_BARRIER event. If t −→ q, then for any that m replica pc in any correct data center c ∈ C, eventually m ∀i ∈ D. ∀c ∈ C. ∀n ∈ P . ∀i ∈ D. ts(t)[i] ≤ uniformVecc [i]. n 0 m uniformVecc(τ )[i] ≥ uniformVecd (τ)[i]. Proof. Since t −→vis q, by Definition 57 of vis, Proof. By Lemma 34, Assumption3, and the fact that at most ts(t) ≤ ts(q). f data centers may fail, n Suppose that q is issued by client cl to replica pd in data center ∀i ∈ D. ∃c ∈ C. ∀n ∈ P . d and is returned at time τq. By Definition 43 of timestamps m n uniformVecd (τ)[i] ≤ knownVecc(τ)[i]. and Lemma 31, n By Lemma 75, there exists some time τ00 such that ts(t)[d] ≤ ts(q)[d] ≤ (uniformVecd)q[d]. By Lemma 77, there exists some time τ0 such that ∀i ∈ D. ∀c ∈ C. ∀n ∈ P . m n 00 ts(t)[d] ≤ (uniformVecm)(τ0)[d]. uniformVecd (τ)[i] ≤ knownVecc(τ )[i]. c On the other hand, by Definition 43 of timestamps and By Algorithm A5 and Assumption3, there exists some time Lemma 32, τ0 such that ∀i ∈ D \{d}. ts(t)[i] ≤ ts(q)[i] ∀i ∈ D. ∀c ∈ C. ∀n ≤ P . = (pastVeccl)q[i] uniformVecn( 0)[i] ≥ uniformVecm( )[i]. c τ d τ l ≤ uniformVecd(τq)[i] l for some replica pd in data center d. By Lemma 77, there 00 Lemma 78. Let d ∈ C be a correct data center and t ∈ Tcausal exists some time τ such that be a causal transaction that originates at d. Then for any m 00 m ∀i ∈ D \{d}. ts(t)[i] ≤ (uniformVecc )(τ )[i]. replica pc in any correct data center c ∈ C, eventually Let m 0 00 ∀i ∈ D. ts(t)[i] ≤ uniformVecc [i]. τ , max{τ ,τ }. Proof. Since d is correct, by Lemma 76, there exists some By Lemma 30, time τ0 such that m ∀i ∈ D. ts(t)[i] ≤ uniformVecc (τ)[i]. m 0 ts(t)[d] ≤ uniformVecc (τ )[d].

42 D.12.3 Uniformity of Strong Transactions By Lemma 77, there exists some time τ00 such that m Lemma 80. For any replica pc in any correct data center ts(t)[d] ≤ (uniformVecm)(τ00)[d]. m c c ∈ C, knownVecc [strong] grows without bound. Let Proof. By Assumption4 and Theorem9, replica pm will c τ max{τ0,τ00}. either deliver committed strong transactions infinitely of- , ten (DELIVER of Algorithm A6) or submit dummy strong By Lemma 30, transactions infinitely often (HEARTBEAT_STRONG of Al- m ∀i ∈ . ts(t)[i] ≤ uniformVecm(τ)[i]. gorithm A6). Thus, knownVecc [strong] grows without D c bound. Lemma 81. Let t ∈ T be a transaction. Then for any replica m D.13 Eventual Visibility pc in any correct data center c ∈ C, eventually Theorem 83. m ts(t)[strong] ≤ stableVecc [strong]. A |= EVENTUALVISIBILITY. Proof. By Lemma 80 and lines A5:2 and A5:9, m stableVecc [strong] grows without bound. Therefore, Proof. Consider a transaction t ∈ T such that there exists some time τ such that vis m dc(t) ∈ C ∨ (∃q ∈ Q. t −→ q) ∨ t ∈ Tstrong. ts(t)[strong] ≤ stableVecc (τ)[strong]. By Lemma 59, it suffices to show that for any client cl,

vis T| = ∞ ⇒ ∃t0 ∈ T| . t −→ t0. Lemma 82. Let t ∈ Tstrong be a strong transaction. Then for cl cl m any replica pc in any correct data center c ∈ C, eventually By Lemmas 78, 79, 81, and 82, there exists some time τ such m ∀i ∈ D. ts(t)[i] ≤ uniformVecc [i]. that

Proof. Let d , dc(t), n , coord(t), and cl , client(t). By ∀c ∈ C. ∀n ∈ P . Definition 43 of timestamps, n (∀i ∈ D. ts(t)[i] ≤ uniformVecc(τ)[i]) ∧ ts(t)[strong] ≤ stableVecn(τ)[strong]. (29) ts(t) = (pastVeccl)ct(t). c

On the one hand, by Lemma 31, Since T|cl = ∞, there exists some correct data center d ∈ C to which cl issues an infinite number of transactions. Let ∀i ∈ \{d}. ts(t)[i] = (pastVec ) [i] D cl ct(t) t0 ∈ T be the first transaction issued by client cl to data center n ≤ (uniformVecd)ct(t)[i]. d which starts after time τ such that

By Lemma 77, there exists some time τ0 such that lclock(t) < lclock(t0).

m 0 ∀i ∈ D \{d}. ts(t)[i] ≤ (uniformVecc )(τ )[i]. Thus, by Definition 54 of lc,

On the other hand, by Lemma 48, t −→lc t0.

ts(t)[d] = commitVec(t)[d]. 0 Let m , coord(t ). Since d is correct, by (29), By (5), m (∀i ∈ D. ts(t)[i] ≤ uniformVecd (τ)[i]) ∧ m commitVec(t)[d] = snapshotVec(t)[d]. ts(t)[strong] ≤ stableVecd (τ)[strong].

By lines A6:2 and A3:22, By Lemma 30,

n m m snapshotVec(t)[d] ≤ (uniformVecd)ct(t)[d]. ∀i ∈ D. uniformVecd (τ)[i] ≤ (uniformVecd )st(t0)[i]. Putting it together yields, By Lemma 41,

n m m ts(t)[d] ≤ (uniformVecd)ct(t)[d]. stableVecd (τ)[strong] ≤ (stableVecd )st(t0)[strong].

43 By Lemma 45, Let q ∈ Q be the CL_UNIFORM_BARRIER event issued by l m 0 client cl just before a. Suppose that q is issued to replica pd (∀i ∈ D. (uniformVecd )st(t0)[i] ≤ ts(st(t ))[i]) ∧ in data center d. By Lemma 31, m 0 (stableVec )st(t0)[strong] ≤ ts(st(t ))[strong]. d l (uniformVecd)q[d] ≥ (pastVeccl)q[d] Putting it together yields = (pastVeccl)e[d]. (33) ts(t) ≤ ts(st(t0)). By (31), (32), (33), and Lemma 77, eventually for the cor- By Definition 57 of vis, rect data center c,

vis 0 m t −→ t . ∀i ∈ D. uniformVecc [i] ≥ (pastVeccl)e[i].

Therefore, the wait condition

D.14 UNISTORE Correctness m ∀i ∈ D \{c}. uniformVecc [i] ≥ (pastVeccl)e[i] Theorem 84. m UNISTORE |= POR. at line A3:24 at replica pc will eventually hold. Thus, the CL_ATTACH event a eventually terminates. Proof. By Theorems 58, 61, 67, 73, and 83. Theorem 86. Any client event issued at a correct data center D.15 UNISTORE Liveness will eventually terminate. Lemma 85. Each client migration CL_ATTACH to a correct data center will eventually terminate, provided that the client Proof. Consider any event e issued by client cl at a correct managed to complete its CL_UNIFORM_BARRIER call at its data center c ∈ C. It suffices to show that each wait condition original data center just before CL_ATTACH. in the execution of e, if any, will eventually hold. In the fol- lowing, we perform a case analysis according to the type of Proof. Suppose that client cl in its current data center d , event e. cur_dc(cl) issues a CL_ATTACH event a to replica pm in cor- c •C ASE I: e ∈ S. The theorem holds trivially. rect data center c ∈ C. •C ASE II: e ∈ R. If e ∈ R , the theorem holds trivially. Let e ∈ C be the last commit event issued by client cl before INT Otherwise, e ∈ R . By Lemmas 74 and 80, the wait a. If e does not exist, then EXT condition at line A3:4 for e will eventually hold.

∀i ∈ D. pastVeccl[i] = 0. •C ASE III: e ∈ U. The theorem holds trivially. •C ASE IV: e ∈ Ccausal. Since data center c is correct, the Therefore, the wait condition wait condition at line A2:28 will eventually hold. ∀i ∈ D \{c}. uniformVecm[i] ≥ 0 •C ASE V: e ∈ Cstrong. By Lemma 76, the wait condition c at line A3:22 will eventually hold. Thus, line A6:2 will m at line A3:24 at replica pc will eventually hold. Thus, the eventually terminate. Then, by Assumption6, the pro- CL_ATTACH event a eventually terminates. cedure CERTIFY and thus the event e will eventually n0 Otherwise, suppose that e is issued to replica pd0 in data terminate. 0 center d . By Lemma 31, •C ASE VI: e ∈ CL_UNIFORM_BARRIER. By Lemma 76,

0 the wait condition at line A3:22 will eventually hold. ∀i ∈ \{d0}. (uniformVecn ) [i] ≥ (pastVec ) [i]. D d0 e cl e (30) •C ASE VII: e ∈ CL_ATTACH. The theorem holds due to Note that it is possible that d0 6= d, since there may exist CASE VI and Lemma 85. other CL_ATTACH events between e and a. Therefore, we distinguish between the following two cases: •C ASE I: d0 = d. By (30),

n0 ∀i ∈ D \{d}. (uniformVecd )e[i] ≥ (pastVeccl)e[i]. (31)

• CASE II: d0 6= d. Consider the last CL_ATTACH event, denoted a0, before a. Suppose that a0 is issued to replica n 0 pd in data center d. By Lemma 31, when a terminates, n ∀i ∈ D \{d}. (uniformVecd)a0 [i] ≥ (pastVeccl)a0 [i] = (pastVeccl)e[i]. (32)

44