<<

Consolidating and Consensus for Commits under Conflicts Shuai Mu and Lamont Nelson, New York University; Wyatt Lloyd, University of Southern California; Jinyang Li, New York University https://www.usenix.org/conference/osdi16/technical-sessions/presentation/mu

This paper is included in the Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’16). November 2–4, 2016 • Savannah, GA, USA ISBN 978-1-931971-33-1

Open access to the Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation is sponsored by USENIX. Consolidating Concurrency Control and Consensus for Commits under Conflicts

Shuai Mu?, Lamont Nelson?, Wyatt Lloyd†, and Jinyang Li? ?New York University, †University of Southern California

Abstract concurrency control provides strong consistency despite conflicting data accesses by different transactions. This Conventional fault-tolerant distributed transactions layer approach is commonly used in practice. For example, a traditional concurrency control protocol on top of the Spanner [9] implements two-phase-locking (2PL) and Paxos consensus protocol. This approach provides scala- two phase commit (2PC) in which both the data and bility, availability, and strong consistency. When used for locks are replicated by Paxos. As another example, Per- wide-area storage, however, this approach incurs cross- colator [35] implements a variant of opportunistic con- data-center coordination twice, in serial: once for con- currency control (OCC) with 2PC on top of BigTable [7] currency control, and then once for consensus. In this which relies on primary- replication and Paxos. paper, we make the key observation that the coordination When used for wide-area storage, the layering approach required for concurrency control and consensus is highly incurs cross-data-center coordination twice, in serial: similar. Specifically, each tries to ensure the serialization once by the concurrency control protocol to ensure trans- graph of transactions is acyclic. We exploit this insight action consistency (strict [34, 45]), and in the design of Janus, a unified concurrency control and another time by the consensus protocol to ensure replica consensus protocol. Janus targets one-shot transactions consistency (linearizability [14]). Such double coordina- written as stored procedures, a common, but restricted, tion is not necessary and can be eliminated by consoli- class of transactions. Like MDCC [16] and TAPIR [51], dating concurrency control and consensus into a unified Janus can commit unconflicted transactions in this class in protocol. MDCC [16] and TAPIR [51] showed how uni- one round-trip. Unlike MDCC and TAPIR, Janus avoids fied approaches can provide the read-commited and strict aborts due to contention: it commits conflicted transac- serializability isolation levels, respectively. Both pro- tions in this class in at most two round-trips as long as tocols optimistically attempt to commit and replicate a the network is well behaved and a majority of each server transaction in one wide-area roundtrip. If there are con- replica is alive. flicts among concurrent transactions, however, they abort We compare Janus with layered designs and TAPIR and retry, which can significantly degrade performance under a variety of workloads in this class. Our evaluation under contention. This concern is more pronounced for shows that Janus achieves ∼5× the throughput of a layered wide area storage: as transactions take much longer to system and 90% of the throughput of TAPIR under a complete, the amount of contention rises accordingly. contention-free microbenchmark. When the workloads This paper proposes the Janus protocol for building become contended, Janus provides much lower latency fault-tolerant distributed transactions. Like TAPIR and and higher throughput (up to 6.8×) than the baselines. MDCC, Janus can commit and replicate transactions in one cross-data-center roundtrip when there is no con- 1 Introduction tention. In the face of interference by concurrent con- flicting transactions, Janus takes at most one additional Scalable, available, and strongly consistent distributed cross-data-center roundtrip to commit. transactions are typically achieved through layering a tra- The key insight of Janus is to realize that strict serializ- ditional concurrency control protocol on top of shards ability for transaction consistency and linearizability for of a data store that are replicated by the Paxos [19, 33] replication consistency can both be mapped to the same consensus protocol. Sharding the data into many small underlying abstraction. In particular, both require the subsets that can be stored and served by different servers execution history of transactions be equivalent to some provides . Replicating each shard with con- linear total order. This equivalence can be checked by sensus provides availability despite server failure. Co- constructing a serialization graph based on the execution ordinating transactions across the replicated shards with history. Each vertex in the graph corresponds to a trans-

USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 517 action. Each edge represents a dependency between two at the cost of additional inter-server messages and the transactions due to replication or conflicting data access. potential for aborts. Specifically, if transactions T1 and T2 make conflicting ac- This approach of dependency tracking and execution cess to some data item, then there exists an edge T1 → T2 re-ordering has been separately applied for concurrency if a shard responsible for the conflicted data item executes control [31] and consensus [20, 30]. The insight of Janus or replicates T1 before T2. An equivalent linear total or- is that both concurrency control and consensus can use der for both strict serializability and linearizability can be a common dependency graph to achieve consolidation found if there are no cycles in this graph. without aborts. Janus ensures an acyclic serialization graph in two We have implemented Janus in a transactional key- ways. First, Janus captures a preliminary graph by track- value storage system. To make apples-to-apples com- ing the dependencies of conflicting transactions as they parisons with existing protocols, we also implemented arrive at the servers that replicate each shard. The co- 2PL+MultiPaxos, OCC+MultiPaxos and TAPIR [51] in ordinator of a transaction T then collects and propagates the same framework. We ran experiments on Ama- T’s dependencies from sufficiently large quorums of the zon EC2 across multiple availability regions using mi- servers that replicate each shard. The actual execution crobenchmarks and the popular TPC-C benchmark [1]. of T is deferred until the coordinator is certain that all of Our evaluation shows that Janus achieves ∼5× the T’s participating shards have obtained all dependencies throughput of layered systems and 90% of the through- for T. Second, although the dependency graph based on put of TAPIR under a contention-free microbenchmark. the arrival orders of transactions may contain cycles, T’s When the workloads become contended due to skew or participating shards re-order transactions in cycles in the wide-area latency, Janus provides much lower latency same deterministic order to ensure an acyclic serializa- and more throughput (up to 6.8×) than existing systems tion graph. This provides strict serializability because all by eliminating aborts. servers execute conflicting transactions in the same order. In the absence of contention, all participating shards can obtain all dependencies for T in a single cross-data- 2 Overview center roundtrip. In the presence of contention, two types conflicts can appear: one among different transactions This section describes the system setup we target, reviews and the other during replication of the same transaction. background and motivation, and then provides a high Conflicts between different transactions can still be han- level overview of how Janus unifies concurrency control dled in a single cross-data-center roundtrip. They mani- and consensus. fest as cycles in the dependency graph and are handled by ensuring deterministic execution re-ordering. Conflicts 2.1 System Setup that appear during replication of the same transaction, however, require a second cross-data-center roundtrip to We target the wide-area setup where data is sharded across reach consensus among different server replicas with re- servers and each data shard is replicated in several geo- gard to the set of dependencies for the transaction. Nei- graphically separate data centers to ensure availability. ther scenario requires Janus to abort a transaction. Thus, We assume each data center maintains a full replica of Janus always commits with at most two cross-data-center data, which is a common setup [4, 26]. This assump- roundtrips in the face of contention as long as network tion is not strictly needed by existing protocols [9, 51] and machine failures do not conspire to prevent a majority or Janus. But, it simplifies the performance discussion of each server replica from communicating. in terms of the number of cross-data-center roundtrips Janus’s basic protocol and its ability to always commit needed to commit a transaction. assumes a common class of transactional workloads: one- We adopt the execution model where transactions are shot transactions written as stored procedures. Stored represented as stored procedures, which is also com- procedures are frequently used [12, 13, 15, 22, 31, 40, monly used [12, 13, 15, 22, 31, 40, 43, 46]. As data 43, 46] and send transaction logic to servers for execu- is sharded across machines, a transaction is made up tion as pieces instead of having clients execute the logic of a series of stored procedures, referred to as pieces, while simply reading and writing data to servers. One- each of which accesses data items belonging to a single shot transactions preclude the execution output of one shard. The execution of each piece is atomic with regard piece being used as input for a piece that executes on to other concurrent pieces through local locking. The a different server. One-shot transactions are sufficient stored-procedure execution model is crucial for Janus to for many workloads, including the popular and canonical achieve good performance under contention. As Janus TPC-C benchmark. While the design of Janus can be must serialize the execution of conflicting pieces, stored extended to support general transactions, doing so comes procedures allows the execution to occur at the servers,

518 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association Commit latency in How to resolve conflicts How to resolve conflicts System wide-area RTTs during replication during execution/commit 2PL+MultiPaxos [9] 2∗ Leader assigns ordering Locking† OCC+MultiPaxos [35] 2* Leader assigns ordering Abort and retry Replicated Commit [28] 2 Leader assigns ordering Abort and retry TAPIR [51] 1 Abort and retry commit Abort and retry MDCC [16] 1 Retry via leader Abort Janus 1 Reorder Reorder

Table 1: Wide-area commit latency and conflict resolution strategies for Janus and existing systems. MDCC provides read commited isolation, all other protocols provide strict serializability. which is much more efficient than if the execution was For example, Spanner [9] layers 2PL/2PC over Paxos. carried out by the clients over the network. Percolator [35] layers a variant of OCC over BigTable which relies on primary-backup replication and Paxos. 2.2 Background and Motivation The layered design is simple conceptually, but does The standard approach to building fault-tolerant dis- not provide the best performance. When one protocol tributed transactions is to layer a concurrency control is layered on top of the other, the coordination required protocol on top of a consensus protocol used for repli- for consistency occurs twice, serially; once at the concur- cation. The layered design addresses three paramount rency control layer, and then once again at the consensus challenges facing distributed transactional storage: con- layer. This results in extra latency that is especially costly sistency, scalability and availability. for wide area storage. This observation is not new and has been made by TAPIR and MDCC [16, 51], which • Consistency refers to the ability to preclude “bad” in- point out that layering suffers from over-coordination. terleaving of concurrent conflicting operations. There are two kinds of conflicts: 1) transactions containing A unification of concurrency control and consensus can multiple pieces performing non-serializable data ac- reduce the overhead of coordination. Such unification is cess at different data shards. 2) transactions replicating possible because the consistency models for concurrency their writes to the same data shard in different orders at control (strict serializability [34, 45]) and consensus (lin- different replica servers. Concurrency control handles earizability [14]) share great similarity. Specifically, both the first kind of conflict to achieve strict serializabil- try to satisfy the ordering constraints among conflicting ity [45]; consensus handles the second kind to achieve operations to ensure equivalence to a linear total order- linearizability [14]. ing. Table1 classifies the techniques used by existing • Scalability refers to the ability to improve performance systems to handle conflicts that occur during transaction by adding more machines. Concurrency control proto- execution/commit and during replication. As shown, op- cols naturally achieve scalability because they operate timistic schemes rely on abort and retry, while conser- on individual data items that can be partitioned across vative schemes aim to avoid costly aborts. Among the machines. By contrast, consensus protocols do not conservative schemes, consensus protocols rely on the address scalability as their state-machine approach du- (Paxos) leader to assign the ordering of replication oper- plicates all operations across replica servers. ations while transaction protocols use per-item locking. • Availability refers to the ability to survive and recover from machine and network failures including crashes In order to unify concurrency control and consensus, and arbitrary message delays. Solving the availability one must use a common coordination scheme to handle challenge is at the heart of consensus protocols while both types of conflicts in the same way. This design traditional concurrency control does not address it. requirement precludes leader-based coordination since distributed transactions that rely on a single leader to Because concurrency control and consensus each assign ordering [41] do not scale well. MDCC [16] and solves only two out of the above three challenges, it is TAPIR [51] achieve unification by relying on aborts and natural to layer one on top of the other to cover all bases. retries for both concurrency control and consensus. Such an optimistic approach is best suited for workloads with ∗Additional roundtrips are required if the coordinator logs the com- low contention. The goal of our work is to develop a mit status durably before returning to clients. †2PL also aborts and retries due to false positives in distributed unified approach that can work well under contention by deadlock detection. avoiding aborts.

USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 519 1 2 1 2 1 2 1 2 Servers Sx, Sx replicate item X; Sy Sy replicate item Y Servers Sx, Sx replicate item X; Sy Sy replicate item Y 1 1 Sx Sx

Paxos Paxos Data Data accept accept 1 Center-1 1 Center-1 Sy Sy

2 2 Sx Sx Data 2 Paxos Center-2 Paxos Data 2 Sy accept accept Sy

2 Center-2 T T T 1 2 → → ↔ 1 T T Coordinator T Coordinator 2 1 for T1

Execute 2PC Prepare 2PC Commit or Abort Pre-accept Accept Commit (any interleaving during (pieces are dispatched, (Skipped if (reorder and execute) Execute or Prepare but execution is deferred) no contention causes abort) during replication) (a) OCC over Multi-Paxos. (b) Janus

Figure 1: Workflows for committing a transaction that increments x and y that are stored on different shards. The execute and 2PC-prepare messages in OCC over Multi-Paxos can be combined for stored procedures.

2.3 A Unified Approach to Concurrency on item-x while server Sy receives the writes on item- Control and Consensus y in the opposite order. In order to ensure a cycle-free serialization graph and abort-free execution, Janus de- Can we design unify concurrency control and consen- terministically re-orders transactions involved in a cycle sus with a common coordination strategy that minimizes before executing them. costly aborts? A key observation is that the ordering To understand the advantage and challenges of unifi- constraints desired by concurrency control and replica- cation, we contrast the workflow of Janus with that of tion can both be captured by the same serialization graph OCC+MultiPaxos using a concrete example. The exam- abstraction. In the serialization graph, vertexes represent ple transaction, T : x++; y++;, consists of two pieces, each transactions and directed edges represent the dependen- 1 is a stored procedure incrementing item-x or item-y. cies among transactions due to the two types of conflicts. Both strict serializability and linearizability can be re- Figure 1a shows how OCC+MultiPaxos executes and duced to checking that this graph is free of cycles [20, 45]. commits T1. First, the coordinator of T1 dispatches the two pieces to item-x and item-y’s respective Paxos lead- Sx: T1: Write(X), T2:Write(X) Sx: T1: Write(X), T2: Write(X) ers, which happen to reside in different data centers. The leaders execute the pieces and buffer the writes locally. To commit T1, the coordinator sends 2PC-prepare re- quests to Paxos leaders, which perform OCC validation T1 T2 T1 T2 and replicate the new values of x and y to others. Be-

S'x: T2: Write(X), T1: Write (X) Sy: T2: Write(Y), T1: Write(Y) cause the pieces are stored procedures, we can combine the dispatch and 2PC-prepare messages so that the coordi- (a) A cycle representing (b) A cycle representing T replication conflicts. transaction conflicts. nator can execute and commit 1 in two cross-data-center roundtrips in the absence of conflicts. Suppose a concur- Figure 2: Replication and transaction conflicts can be rent transaction T2 also tries to increment the same two represented as cycles in a common serialization graph. counters. The dispatch messages (or 2PC-prepares) of T1 and T2 might be processed in different orders by dif- Janus attempts to explicitly capture a preliminary se- ferent Paxos leaders, causing T1 and/or T2 to be aborted rialization graph by tracking the dependencies of con- and retried. On the other hand, because Paxos leaders flicting operations without immediately executing them. impose an ordering on replication, all replicas for item-x Potential replication or transaction conflicts both man- (or item-y) process the writes of T1 or T2 consistently, but ifest as cycles in the preliminary graph. The cycle in at the cost of additional wide-area roundtrip. Figure 2a is due to conflicts during replication: T1’s write Figure 1b shows the workflow of Janus. To commit and of data item X arrives before T2’s write on X at server execute T1, the coordinator moves through three phases: Sx, resulting in T1 → T2. The opposite arrival order hap- PreAccept, Accept and Commit, of which the Accept phase 0 pens at a different server replica Sx, resulting in T1 ← T2. may be skipped. In PreAccept, the coordinator dispatches The cycle in Figure 2b is due to transaction conflicts: the T1’s pieces to their corresponding replica servers. Unlike server Sx receives T1’s write on item-x before T2’s write OCC+MultiPaxos, the server does not execute the pieces

520 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association immediately but simply tracks their arrival orders in its Algorithm 1: Coordinator::DoTxn(T=[α1, ..., αN ]) local dependency graph. Servers reply to the coordina- 1 T’s metadata is a globally unique identifier, a tor with T ’s dependency information. The coordinator 1 partition list, and an abandon flag. can skip the Accept phase if enough replica servers reply 2 PreAccept Phase: with identical dependencies, which happens when there is no contention among server replicas. Otherwise, the 3 send PreAccept(T, ballot) to all participating coordinator engages in another round of communication servers, ballot default as 0 4 ∀ ∈ with servers to ensure that they obtain identical depen- if piece αi α1...αN , αi has a fast quorum of PreAcceptOKs with the identical dependency depi dencies for T1. In the Commit phase, each server executes transactions according to their dependencies. If there is then 5 dep = Union(dep1, dep2, ..., depN ) a dependency cycle involving T1, the server adheres to a deterministic order in executing transactions in the cycle. 6 goto commit phase As soon as the coordinator hears back from the nearest 7 else if ∀ αi ∈ α1...αN , αi has a majority quorum of server replica which typically resides in the local data PreAcceptOKs then center, it can return results to the user. 8 foreach αi ∈ α1...αN do As shown in Figure 1b, Janus attempts to coordi- 9 depi ←Union dependencies returned by the nate both transaction commit and replication in one shot majority quorum for piece αi. PreAccept with the phase. In the absence of contention, 10 dep = Union(dep1, dep2, ..., depN ) Janus completes a transaction in one cross-data-center 11 goto accept phase roundtrip, which is incurred by the PreAccept phase. (The 12 else commit phase incurs only local data-center latency.) In 13 return FailureRecovery(T) the presence of contention, Janus incurs two wide-area roundtrips from the PreAccept and Accept phases. These 14 Accept Phase: two wide-area roundtrips are sufficient for Janus to be 15 foreach αi in α1...αN in parallel do able to resolve replication and transaction commit con- 16 send Accept(T, dep, ballot) to αi’s participating flicts. The resolution is achieved by tracking the depen- servers. dencies of transactions and then re-ordering the execution 17 if ∀ αi has a majority quorum of Accept-OKs then of pieces to avoid inconsistent interleavings. For exam- 18 goto commit phase ple in Figure 1b, due to contention between T and T , 1 2 19 else the coordinator propagates T ↔ T to all server replicas 1 2 20 return FailureRecovery(T) during commit. All servers detect the cycle T1 ↔ T2 and 21 Commit Phase: deterministically choose to execute T1 before T2. 22 send Commit(T, dep) to all participating servers 23 return to client after receiving execution results. 3 Design

This section describes the design for the basic Janus pro- ordinator is responsible for communicating with servers tocol. Janus is designed to handle a common class of storing the desired data shard to commit the transaction transactions with two constraints: 1) the transactions do and then proxy the result to the client. Coordinators have not have any user-level aborts. 2) the transactions do not no global nor persistent state and they do not communi- contain any piece whose input is dependent on the exe- cate with each other. In our evaluation setup, coordinators cution of another piece, i.e., they are one-shot transac- are co-located with clients. tions [15]. Both the example in Section2 and the popular We require that the shards information of a transaction TPC-C workload belong to this class of transactions. We T is known before running. Suppose a transaction T in- discuss how to extend Janus to handle general transac- volves n pieces, α1,...,αn. Let r be the number of servers tions and the limitations of the extension in Section4. replicating each shard. Thus, there are r servers process- Additional optimizations and a full TLA+ specification ing each piece αi. We refer to them as αi’s participating are included in a technical report [32]. servers. We refer to the set of r ∗ n servers for all pieces as the transaction T’s participating servers, or the servers 3.1 Basic Protocol that T involves. There is a fast path and standard path of execution in The Janus system has three roles: clients, coordinators, Janus. When there is no contention, the coordinator takes and servers. A client sends a transaction request as a set the fast path which consists of two phases: pre-accept and of stored procedures to a nearest coordinator. The co- commit. When there is contention, the coordinator takes

USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 521 Algorithm 2: Server S::PreAccept(T, ballot) Algorithm 4: Server S::Commit(T, dep)

24 if highest_ballotS[T] < ballot then 39 GS[T].dep ←dep 25 return PreAccept-NotOK 40 GS[T].status ←committing 41 Wait & Inquire Phase: 26 highest_ballotS[T] ←ballot 27 //add T to GS if not already 42 repeat 0 0 28 AddNewTx(GS, T) 43 choose T {T: GS[T ].status < committing 0 29 GS[T].status ←pre-accepted#ballot 44 if T does not involve S then 0 0 30 dep ←GS[T].dep 45 send Inquire(T ) to a server that T involves 31 0 return PreAccept-OK, dep 46 wait until GS[T ].status ≥ committing 0 0 47 until ∀T {T in GS : GS[T ].status ≥ committing 48 Execute Phase: Algorithm 3: Server S::Accept(T, dep, ballot) 49 repeat 0 0 32 if GS[T].status ≥ committing or 50 choose T ∈ GS: ReadyToProcess(T ){ 0 33 highest_ballotS[T] > ballot then 51 scc←StronglyConnectedComponent(GS, T ) 00 34 return Accept-NotOK, highest_ballotS[T] 52 for each T in DeterministicSort(scc) do 00 00 0 53 if T S and not T .abandon then 35 highest_ballotS[T ] ←ballot involves 0 00 00 36 GS[T ].dep ←dep 54 T .result ←execute T 0 37 G T .status ←accepted ballot 00 S[ ] # 55 processedS[T ] ←true 38 return Accept-OK 56 until processedS[T] is true 57 reply CommitOK, T.abandon, T.result

the standard path of three phases: pre-accept, accept and commit, as shown in Figure 1b. Algorithm 5: Server S::ReadyToProcess(T)

Dependency graphs. At a high level, the goal of pre- 58 if processedS[T] or GS[T].status < committing then accept and accept phase is to propagate the necessary 59 return false dependency information among participating servers. To 60 scc ← StronglyConnectedComponent(GS, T) track dependencies, each server S maintains a local de- 0 0 61 for each T < scc and T {T do pendency graph, GS. This is a directed graph where each 0 62 if processedS[T ] , true then vertex represents a transaction and each edge represents 63 return false the chronological order between two conflicting transac- tions. The incoming edges for a vertex is stored in its 64 return true deps value. Additionally, each vertex contains the status of the transaction, which captures the stage of the pro- tocol the transaction is currently in. The basic protocol participating servers. As shown in Algorithm2, upon requires three vertex status types with the ranked order receiving the pre-accept for T, server S inserts a new pre-accepted < accepted < committing . Servers and vertex for T if one does not already exist in the local coordinators exchange and merge these graphs to ensure dependency graph Gs. The newly inserted vertex for T that the required dependencies reach the relevant servers has the status pre-accepted. An edge T 0 → T is inserted before they execute a transaction. if an existing vertex T 0 in the graph and T both intend to In Janus, the key invariant for correctness is that all make conflicting access to a common data item at server participating servers must obtain the exact same set of S. The server then replies PreAccept-OK with T’s direct dependencies for a transaction T before executing T. This dependencies GS[T].dep in the graph. The coordinator is challenging due to the interference from other concur- waits to collect a sufficient number of pre-accept replies rent conflicting transactions and from the failure recovery from each involved shard. There are several scenarios mechanism. (Algorithm1, lines 4-11). In order to take the fast path, Next, we describe the fast path of the protocol, which is the coordinator must receive a fast quorum of replies taken by the coordinator in the absence of interference by containing the same dependency list for each shard. A concurrent transactions. Psuedocode for the coordinator fast quorum F in Janus contains all r replica servers. is shown in Algorithm1. The PreAccept phase. The coordinator sends PreAc- The fast quorum concept is due to Lamport who origi- cept messages both to replicate transaction T and nally applied it in the FastPaxos consensus protocol [21]. to establish a preliminary ordering for T among its In consensus, fast quorum lets one skip the Paxos leader

522 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association and directly send a proposed value to replica servers. The Accept phase. If some shard does not return a fast In Fast Paxos, a fast quorum must contain at least three quorum of identical dependency list during pre-accept, quarter replicas. By contrast, when applying the idea to then consensus on the complete set of dependencies for unify consensus and concurrency control, the size of the T has not yet been reached. In particular, additional fast quorum is increased to include all replica servers. dependencies for T may be inserted later, or existing de- Futhermore, the fast path of Janus has the additional re- pendencies may be replaced (due to the failure recovery quirement that the dependency list returned by the fast mechanism, Section 3.3). quorum must be the same. This ensures that the ancestor The accept phase tries to reach consensus via the ballot graph of transaction T on the standard path and during accept/reject mechanism similar to the one used by Paxos. failure recovery always agrees with what is determined Because a higher ballot supersedes a lower one, the ac- on the fast path at the time of T’s execution. cepted status of a transaction includes a ballot number such that accepted#b < accepted#b0, if ballot b

USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 523 3.3 Handling Coordinator Failure Algorithm 7: Server S::Prepare(T, ballot)

84 dep ←GS[T].dep 85 if GS[T].status ≥ commiting then Algorithm 6: Coordinator C::FailureRecovery(T) 86 return Tx-Done, dep 65 Prepare Phase: 87 else if highest_ballotS[T] > ballot then 66 ballot ← highest ballot number seen + 1 88 return Prepare-NotOK, highest_ballotS[T] 67 send Prepare(T, ballot) to T’s participating servers 89 highest_ballot [T] ←ballot 68 if ∃ Tx-Done,dep among replies then S 90 reply T, Prepare-OK, G [T].ballot, dep 69 //T’s direct dependencies dep has been S determined 70 goto commit phase

71 else if ∃ αi without a majority quorum of included in the prepare message. If so, it rejects. Other- Prepare-OKs then wise, it replies Prepare-OK with T’s direct dependencies. 72 goto prepare phase If the coordinator fails to receive a majority quorum of Prepare-OKs for some shard, it retries the prepare phase 73 let R be the set of replies w/ the highest ballot 0 0 with a higher ballot after a back-off. 74 if ∃dep ∈ R: dep [T].status = accepted then 0 75 goto accept phase w/ dep = dep The coordinator continues if it receives a majority quorum (M) of Prepare-OKs for each shard. Next, it 76 else if R contains at least F ∩ M identical must distinguish against several cases in order to decide dependencies dep for each shard then i whether to proceed from the pre-accept or the accept 77 dep = Union(dep , dep , .. dep ) 1 2 N phase (lines 73-83). We point out a specific case in 78 goto accept phase w/ dep which the recovery coordinator has received a majority 0 0 79 else if ∃dep ∈ R : dep [T].status = pre-accepted Prepare-OKs with the same dependencies for each shard then and the status of T on the servers is pre-accepted. In 80 goto preaccept phase, avoid fast path this case, tranaction T could have succeeded on the fast 81 else path. Thus, the recovery coordinator merge the result- 82 T.abandon = true ing dependencies and proceed to the accept phase. This 83 goto accept phase w/ dep = nil case also illustrates why Janus must use a fast quorum F containing all server replicas. For any two conflict- ing transactions that both require recovery, a recovery The coordinator may crash at anytime during protocol coordinator is guaranteed to observe their dependency if execution. Consequently, a participating server for trans- (F ∩M)∩(F ∩M) , ∅, where M is a majority quorum. action T will notice that T has failed to progress to the The pre-accept and accept phase used for recovery is committing status for a threshold amount of time. This the same as in the normal execution (Algorithm1) except triggers the failure recovery process where some server that both phases use the ballot number chosen in the takes up the role of the coordinator to try to commit T. prepare phase. If the coordinator receives a majority The recovery coordinator progresses through: prepare, quorum of Accept-OK replies, it proceeds to the commit accept and commit phases, as shown in Algorithm6. Un- phase. Otherwise, it restarts in the prepare phase using a like in the normal execution (Algorithm1), the recovery higher ballot number. pre-accept prepare ac- coordinator may repeat the , and The recovery coordinator attempts to commit the trans- cept phases many times using progressively higher ballot action if possible, but may abandon it if the coordinator numbers, to cope with contention that arises when multi- cannot recover the inputs for the transaction. This is pos- ple servers try to recover the same transaction or when the sible when some servers pass on dependencies for one original coordinator has been falsely suspected of failure. transaction to another and then fail simultaneously with prepare In the phase, the recovery coordinator picks the first transaction’s coordinator. A recovery coordina- b > T a unique ballot number 0 and sends it to all of ’s tor abandons a transaction T using the normal protocol participating servers.‡ Algorithm7 shows how servers except it sets the abandon flag for T during the accept prepare T handle s. If the status of in the local graph is phase and the commit phase. Abandoning a transaction committing Tx-Done or beyond, the server replies with this way ensures that servers reach consensus on aban- T ’s dependencies. Otherwise, the server checks whether doning T even if the the original coordinator is falsely T it has seen a ballot number for that is higher than that suspected of failure. During transaction execution, if a ‡Unique server ids can be appended as low order bits to ensure server encounters a transaction T with the abandon flag unique ballot numbers. set, it simply skips the execution of T.

524 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association 4 General Transactions volve key-dependencies on frequently updated data and thus would incur few user-level aborts due to conflicts. This section discusses how to extend Janus to handle dependent pieces and limited forms of user aborts. Al- though Janus avoids aborts completely for one-shot trans- 5 Implementation actions, our proposed extensions incur aborts when trans- actions conflict. Let a transaction’s keys be the set of data items that the In order to evaluate Janus together with existing systems transaction needs to read or write. We classify general and enable an apples-to-apples comparison, we built a transactions into two categories: 1) transactions whose modular framework that facilitates the construc- keys are known prior to execution, but the inputs of some tion and debugging of a variety of concurrency control pieces are dependent on the execution of other pieces. 2) and consensus protocols efficiently. transactions whose keys are not known beforehand, but Our software framework consists of 36,161 lines are dependent on the execution of certain pieces. of C++ code, excluding comments and blank lines. The framework includes a custom RPC library, an in- Transactions with pre-determined keys. For any memory , as well as test suits and several bench- transaction T in this category, T’s set of participating marks. The RPC library uses asynchronous socket I/O servers are known apriori. This allows the coordinator (epoll/kqueue). It can passively batch RPC mes- to go through the pre-accept, accept phases to fix T’s sages to the same machine by reading or writing multiple position in the serialization graph while deferring piece RPC messages with a single system call whenever possi- execution. Suppose servers Si, Sj are responsible for the ble. The framework also provides common library func- data shards accessed by piece αi and αj (i < j) respec- tions shared by most concurrency control and consen- tively and αj ’s inputs depend on the outputs of αi. In sus protocols, such as a multi-phase coordinator, quorum this case, we extend the basic protocol to allow servers to computation, logical timestamps, epochs, database locks, communicate with each other during transaction execu- etc. By providing this common functionality, the library tion. Specifically, once server Sj is ready to execute αj , simplifies the task of implementing a new concurrency it sends a message to server Si to wait for the execution control or consensus protocol. For example, a TAPIR of αi and fetch the corresponding outputs. We can use implementation required only 1,209 lines of new code, the same technique to handle transactions with a certain and the Janus implementation required only 2,433 lines type of user-aborts. Specifically, the piece αi containing of new code. In addition to TAPIR and Janus we also user-aborts is not dependent on the execution of any other implemented 2PL+MultiPaxos and OCC+MultiPaxos. α piece that performs writes. To execute any piece j of Our OCC implementation is the standard OCC with S S the transaction, server j communicates with server i to 2PC. It does not include the optimization to combine the α α find out i’s execution status. If i is aborted, then server execution phase with the 2PC-prepare phase. Our 2PL S α j skips the execution of j . is different from conventional implementations in that it Transactions with dependent keys. As an exam- dispatches pieces in parallel in order to minimize execu- ple transaction of this type, consider transaction T(x): tion latency in the wide area. This increases the chance y←Read(x); Write(y, ...) The key to be accessed by T’s sec- for deadlocks significantly. We use the wound-wait proto- ond piece is dependent on the value read by the first piece. col [37] (also used by Spanner [9]) to prevent deadlocks. Thus, the coordinator cannot go through the usual phases With a contended workload and parallel dispatch, the to fix T’s position in the serialization graph without exe- wound-wait mechanism results in many false positives cution. We support such a transaction by requiring users for deadlocks. In both OCC and 2PL, the coordinator to transform it to transactions in the first category using does not make a unilateral decision to abort transactions. a known technique [41]. In particular, any transaction The MultiPaxos implementation inherits the common op- with dependent keys can be re-written into a combina- timization of batching many messages to and from leaders tion of several read-only transactions that determine the from the passive batching mechanism in the RPC library. unknown keys and a conditional-write transaction that Apart from the design described in Section3, our performs writes if the read values match the input keys. implementation of Janus also includes the garbage col- For example, the previous transaction T can be trans- lection mechanism to truncate the dependency graph as formed into two transactions, T1(x): y← Read(x); return transactions finish. We have not implemented Janus’s y; followed by T2(x,y): y0← Read(x); if y0,y {abort;} else coordinator failure recovery mechanism nor the exten- {Write(y, ...);} This technique is optimistic in that it turns sions to handle dependent pieces. Our implementations concurrency conflicts into user-level aborts. However, as of 2PL/OCC+MultiPaxos and TAPIR also do not include observed in [41], real-life OLTP workloads seldom in- failure recovery.

USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 525 Oregon Ireland Seoul 2PL OCC Janus Tapir Oregon 0.9 140 122 120 Ireland 0.7 243 txn/s) Seoul 1.6 3 80

Table 2: Ping latency between EC2 datacenters (ms). 40

Throughput (10 0 0.5 0.6 0.7 0.8 0.9 1 6 Evaluation Zipf Coefficient

Our evaluation aims to understand the strength and limi- (a) Cluster Throughput 2PL OCC tations of the consolidated approach of Janus. How does Janus Tapir it compare against conventional layered systems? How 1 does it compare with TAPIR’s unified approach, which 0.8 aborts under contention? The highlights include: 0.6 • In a single data center with no contention, Janus 0.4 achieves 5× the throughput of 2PL/OCC+MultiPaxos, Commit Rate 0.2 and 90% of TAPIR’s throughput in microbenchmarks. 0 0.5 0.6 0.7 0.8 0.9 1 • All systems’ throughput decrease as contention rises. Zipf Coefficient However, as Janus avoids aborts, its performance un- der moderate or high contention is better than existing (b) Commit Rates systems. Figure 3: Single datacenter performance with in- • Wide-area latency leads to higher contention. Thus, creasing contention as controlled by the Zipf coeffi- the relative performance difference between Janus and cient. the other systems is higher in multi-data-center exper- iments than in single-data-center ones. and Tapir are spread evenly across the three datacenters. 6.1 Experimental Setup We use enough EC2 machines such that clients are not bottlenecked. Testbed. We run all experiments on Amazon EC2 using m4.large instance types. Each node has 2 virtual CPU Other measurement details. Each of our experiment cores, 8GB RAM. For geo-replicated experiments, we lasts 30 seconds, with the first 7.5 and last 7.5 seconds use 3 EC2 availability regions, us-west-2 (Oregon), ap- excluded from results to avoid start-up, cool-down, and northeast-2 (Seoul) and eu-west-1 (Ireland). The ping potential client synchronization artifacts. For each set latencies among these data centers are shown in Table2. of experiments, we report the throughput, 90th percentile latency, and commit rate. The system throughput is cal- Experimental parameters. We adopt the configura- culated as the number of committed transactions divided tion used by TAPIR [51] where a separate server replica by the measurement duration and is expressed in trans- group handles each data shard. Each microbenchmark actions per second (tps). We calculate the latency of a uses 3 shards with a replication level of 3, resulting in transaction as the amount of time taken for it to commit, a total of 9 server machines being used. In this set- including retries. A transaction that is given up after 20 ting, a majority quorum contains at least 2 servers and a retries is considered to have infinite latency. We calculate fast quorum must contain all 3 servers. When running the commit rate as the total number of committed trans- geo-replicated experiments, each server replica resides actions divided by the total number of commit attempts in a different data center. The TPC-C benchmark uses (including retries). 6 shards with a replication level of 3, for a total of 18 processes running on 9 server machines. Calibrating TAPIR’s performance. Because we re- We use closed-loop clients: each client thread issues implemented the TAPIR protocol using a different code one transaction at a time back-to-back. Aborted trans- base, we calibrated our results by running the most basic actions are retried for up to 20 attempts. We vary the experiment (one-key read-modify-write transactions), as injected load by changing the number of clients. We run presented in Figure 7 of TAPIR’s technical report [49]. client processes on a separate set of EC2 instances than The experiments run within a single data center with only servers. In the geo-replicated setting, experiments of 2PL one shard. The keys are chosen uniformly randomly. Our and OCC co-locate the clients in the datacenter contain- TAPIR implementation (running on EC2) achieves a peak ing the underlying MultiPaxos leader. Clients of Janus throughput of ∼165.83K tps, which is much higher than

526 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association 2PL OCC the amount (∼8K tps) reported for Google VMs [49]. Janus Tapir We also did the experiment corresponding to Figure 13 60 of [51] and the results show a similar abort rate per- txn/s) formance for TAPIR and OCC (TAPIR’s abort rate is 3 40 an order of magnitude lower than OCC with lower zipf coefficients). Therefore, we believe our TAPIR imple- 20 mentation is representative. Throughput (10 0 0.5 0.6 0.7 0.8 0.9 1 6.2 Microbenchmark Zipf Coefficient (a) Cluster Throughput Workload. In the microbenchmark, each transaction 2PL OCC performs 3 read-write access on different shards by in- Janus Tapir crementing 3 randomly chosen key-value pairs. We pre- 1 populate each shard with 1 million key-value pairs before 0.8 starting the experiments. We vary the amount of con- 0.6 tention in the system by choosing keys according to a zipf 0.4 distribution with a varying zipf coefficient. The larger Commit Rate 0.2 the zipf coefficient, the higher the contention. This type 0 of microbenchmark is commonly used in evaluations of 0.5 0.6 0.7 0.8 0.9 1 transactional systems [9, 51]. Zipf Coefficient Single data center (low contention). Figure3 shows (b) Commit Rates the single data center performance of different systems Figure 4: Geo-replicated performance with increas- when the zipf coefficient increases from 0.5 to 1.0. Zipf ing contention as controlled by the Zipf coefficient. coefficients of 0.0∼0.5 are excluded because they have negligble contention and similar performance to zipf=0.5. In these experiments, we use 900 clients to inject load. does not have aborts, its throughput also drops because The number of clients is chosen to be on the “knee” the size of the dependency graph that is being maintained of the latency-throughput curve of TAPIR for a specific and exchanged grows with the amount of the contention. zipf value (0.5). In other words, using more than 900 Nevertheless, the overhead of graph computation is much clients results in significantly increased latency with only less than the cost of aborts/retries, which allows Janus to small throughput improvements. With zipf coefficients outperform existing systems. At zipf=0.9, the through- of 0.5∼0.6, the system experiences negligible amounts of put of Janus (55.41K tps) is 3.7× that of TAPIR (14.91K contention. As Figure 3b shows, the commit rates across tps). The corresponding 90-percentile latency for Janus all systems are almost 1 with zipf coefficient 0.5. and TAPIR is 24.65ms and 28.90ms, respectively. The Both TAPIR and Janus achieve much higher through- latter is higher due to repeated retries. put than 2PL/OCC+MultiPaxos. As a Paxos leader needs Multiple data centers (moderate to high contention). to handle much more communication load (>3×) than We move on to experiments that replicate data across non-leader server replicas, 2PL/OCC’s performance is multiple data centers. In this set of experiments, we use bottlenecked by Paxos leaders, which are only one-third 10800 clients to inject load, compared to 900 for in the of all 9 server machines. single data center setting. As the wide-area communica- Single data center (moderate to high contention). As tion latency is more than an order of magnitude larger, the zipf value varies from 0.6 to 1.0, the amount of con- we have to use many more concurrent requests in order to tention in the workload rises. As seen in Figure 3b, the achieve high throughput. Thus, the amount of contention commit rates of TAPIR, 2PL and OCC decrease quickly for a given zipf value is much higher in multi-datacenter as the zipf value increases from 0.6 to 1.0. At first glance, experiments than that of single-data-center experiments. it is surprising that 2PL’s commit rate is no higher than As we can see in Figure 4b, at zipf=0.5, the commit rate OCC’s. This is because our 2PL implementation dis- is only 0.37 for TAPIR. This causes the throughput of patches pieces in parallel. This combined with the large TAPIR to be lower than Janus at zipf=0.5 (Figure 4a). amounts of false positives induced by deadlock detection, At larger zipf values, e.g., zipf=0.9, the throughput of makes locking ineffective. By contrast, Janus does not Janus drops to 43.51K tps, compared to 3.95K tps for incur any aborts and maintains a 100% commit rate. In- TAPIR, and ∼1085 tps for 2PL and OCC. As Figure 4b creasing amounts of aborts result in significantly lower shows, TAPIR’s commit rate is slightly lower than that throughput for TAPIR, 2PL and OCC. Although Janus of 2PL/OCC. Interference during replication, apart from

USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 527 2PL OCC 2PL OCC 2PL OCC Janus Tapir Janus Tapir Janus Tapir 4 10 1000 1

103 800 0.8 600 0.6 102 400 0.4 1 Latency (ms) 10 200 Commit Rate 0.2 Throughput (txn/s) 0 0 0 10 1 10 100 1000 1 10 100 1000 1 10 100 1000 Clients Clients Clients (a) Cluster Throughput (b) 90% Latency (c) Commit Rates

Figure 5: Performance in the single datacenter setting for the TPC-C benchmark with increasing load and contention as controlled by the number of clients per partition.

2PL OCC 2PL OCC 2PL OCC Janus Tapir Janus Tapir Janus Tapir 4 10 1000 1

103 800 0.8 600 0.6 102 400 0.4 1 Latency (ms) 10 200 Commit Rate 0.2 Throughput (txn/s) 0 0 0 10 1 10 100 1000 10000 1 10 100 1000 10000 1 10 100 1000 10000 Clients Clients Clients (a) Cluster Throughput (b) 90% Latency (c) Commit Rates

Figure 6: Performance in the geo-replicated setting for the TPC-C benchmark with increasing load and contention as controlled by the number of clients per partition. aborts due to transaction conflicts, may also cause TAPIR Single datacenter setting. Figure5 shows the single to retry the commit. data center performance of different systems when the number of clients is increased. As the number of clients 6.3 TPC-C increases, the throughput of Janus climbs until it saturates the servers’ CPU, and then stabilizes at 5.78K tps. On the other hand, the throughput of other systems will first new-order payment order-status delivery stock-level climb, and then drop due to high abort rates incurred as ratio 43.65% 44.05% 4.13% 3.99% 4.18% contention increases. TAPIR’s the throughput peaks at 560 tps; The peak throughput for for 2PL/OCC is lower than 595/324 tps. Because TAPIR can abort and retry Table 3: TPC-C mix ratio in a Janus trial. more quickly than OCC, its commit rate becomes lower than OCC as the contention increases. Because of the Workload. As the most commonly accepted bench- massive number of aborts, the 90-percentile latency for mark for testing OLTP systems, TPC-C has 9 tables and 2PL/OCC/TAPIR are more than several seconds when 92 columns in total. It has five transaction types, three of the number of clients increases; by contrast, the latency which are read-write transactions and two are read-only of Janus remains relatively low (<400ms). The latency of transactions. Table3 shows a commit transaction ratio Janus increases with the number of clients due to as more in one of our trials. In this test we use 6 shards each outstanding transactions result in increased queueing. replicated in 3 data centers. The workload is sharded by warehouse; each shard contains 1 warehouse; each ware- Multi-data center setting. When replicating data house contains 10 districts, following the specification. across multiple data centers, more clients are needed to Because each new-order transaction needs to do a read- saturate the servers. As shown in Figure6, the throughput modify-write operation on the next-order-id that is unique of Janus (5.7K tps) is significantly higher than the other in a district, this workload exposes very high contention protocols. In this setup, the other systems show the same with increasing numbers of clients. trend as in single data-center but the peak throughput be-

528 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association comes lower because the amount of contention increases high load in a geo-replicated setting by efficiently round- dramatically with increased number of clients. TAPIR’s robining leader duties between replicas. Speculative commit rate is lower than OCC’s when there are ≥ 100 Paxos [36] can achieve consensus in a single round trip clients because the wide-area setting leads to more aborts by exploiting a co-designed datacenter network for con- for its inconsistent replication. sistent ordering. EPaxos [30] is the consensus protocol Figure 6b shows the 90-percentile latency. When the most related to our work. EPaxos builds on Mencius, Fast number of clients is <= 100, the experiments use only Paxos, and Generalized Paxos to achieve (near-)optimal a single client machine located in the Oregon data cen- commit latency in the wide-area, high throughput, and ter. When the number of clients > 100, the clients are tolerance to slow nodes. EPaxos’s use of dependency spread across all three data centers. As seen in Figure 6b, graphs to dynamically order commands based on their when the contention is low (<10 clients), the latency of arrival order at different replicas inspired Janus’s depen- TAPIR and Janus is low (less than 150ms) because both dency tracking for replication. protocols can commit in one cross-data center roundtrip Janus addresses a different problem than RSM because from Oregon. By contrast, the latency of 2PL/OCC is it provides fault tolerance and scalability to many shards. much more ( 230ms) as their commits require 2 cross RSMs are designed to replicate a single shard and when data center roundtrips. In our experiments, all leaders used across shards they are still limited to the throughput of the underlying MultiPaxos happen to be co-located of a single machine. in the same data center as the clients. Thus, 2PL/OCC both require only one cross-data center roundtrip to com- Geo-replication. Many recent systems have been de- plete the 2PC prepare phase and another roundtrip for the signed for the geo-replicated setting with varying degrees commit phase as our implementation of 2PL/OCC only of consistency guarantees and transaction support. Dy- returns to clients after the commit to reduce contention. namo [11], PNUTS [8], and TAO [6] provide eventual As contention rises due to increased number of clients, consistency and avoid wide-area messages for most oper- the 90-percentile latency of 2PL/OCC/TAPIR increases ations. COPS [26] and Eiger [27] provide causal consis- quickly to tens of seconds. The latency of Janus also tency, read-only transactions, and Eiger provides write- rises due to increased overhead in graph exchange and only transactions while always avoiding wide-area mes- computation at higher levels of contention. Nevertheless, sages. Walter [39] and Lynx [52] often avoid wide-area Janus achieves a decent 90-percentile latency (<900ms) messages and provide general transactions with parallel as the number of clients increases to 10,000. snapshot isolation and serializability respectively. All of these systems will typically provide lower latency than Janus because they made a different choice in the funda- 7 Related Work mental trade off between latency and consistency [5, 23]. Janus is on the other side of that divide and provides strict We review related work in fault-tolerant replication, geo- serializability and general transactions. replication, and concurrency control. Concurrency control. Fault tolerant, scalable systems Fault-tolerant replication. The replicated state ma- typically layer a concurrency control protocol on top of a chine (RSM) approach [17, 38] enables multiple ma- replication protocol. Sinfonia [3] piggybacks OCC into chines to maintain the same state through deterministic 2PC over primary-backup replication. Percolator [35] execution of the same commands and can be used to mask also uses OCC over primary-backup replication. Span- the failure of individual machines. Paxos [18] and view- ner [9] uses 2PL with wound wait over Multi-Paxos and stamped replication [24, 33] showed it was possible to inspired our 2PL experimental baseline. implement RSMs that are always safe and remain live CLOCC [2, 25] using fine-grained optimistic concur- when f or fewer out of 2 f + 1 total machines fail. Paxos rency control using loosely synchronized clocks over and viewstamped replication solve consensus for the se- viewstamped replication. Granola [10] is optimized for ries of commands the RSM executes. Fast Paxos [21] single shard transactions, but also includes a global trans- reduced the latency for consensus by having client send action protocol that uses a custom 2PC over viewstamped commands directly to replicas instead of a through a dis- replication. Calvin [42] uses a sequencing layer and 2PL tinguished proposer. Generalized Paxos [20] builds on over Paxos. Salt [47] is a concurrency control protocol for Fast Paxos by further enabling commands to commit out mixing acid and base transactions that inherits MySQL of order when they do not interfere, i.e., conflict. Janus Cluster’s chain replication [44]. Salt’s successor, Callas, uses this insight from Generalized Paxos to avoid speci- introduces modular concurrency control [48] that enables fying an order for non-conflicting transactions. different types of concurrency control for different parts Mencius [29] showed how to reach consensus with of a workload. low latency under low load and high throughput under Replicated Commit [28] executes Paxos over 2PC in-

USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 529 stead of the typical 2PC over Paxos to reduce wide-area chronized clocks. In Proceedings of ACM International messages. Rococo [31] is a dependency graph based con- Conference on Management of Data (SIGMOD), 1995. currency control protocol over Paxos that avoids aborts [3] M. K. Aguilera, A. Merchant, M. Shah, A. Veitch, and by reordering conflicting transactions. Rococo’s protocol C. Karamanolis. Sinfonia: a new paradigm for building inspired our use of dependency tracking for concurrency scalable distributed systems. In Proceedings of ACM Sym- control. All of these systems layer a concurrency con- posium on Operating Systems Principles (SOSP), 2007. trol protocol over a separate replication protocol, which [4] Amazon. Cross-Region Replication Using DynamoDB incurs coordination twice in serial. Streams. http://docs.aws.amazon.com/ MDCC [16] and TAPIR [49, 50, 51] are the most re- amazondynamodb/latest/developerguide/ Streams.CrossRegionRepl.html lated systems and we have discussed them extensively , 2016. throughout the paper. Their fast fault tolerant transaction [5] H. Attiya and J. L. Welch. Sequential consistency versus commits under low contention inspired us to work on linearizability. ACM Transactions on Computer Systems fast commits under all levels of contention, which is our (TOCS), 12(2), 1994. biggest distinction from them. [6] N. Bronson, Z. Amsden, G. Cabrera, P. Chakka, P. Di- mov, H. Ding, J. Ferris, A. Giardullo, S. Kulkarni, H. Li, M. Marchukov, D. Petrov, L. Puzar, Y. J. Song, and V.Venkataramani. TAO: Facebook’s 8 Conclusion for the social graph. In Proceedings of USENIX Confer- ence on Annual Technical Conference (ATC), 2013. We presented the design, implementation, and evalua- [7] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wal- tion of Janus, a new protocol for fault tolerant distributed lach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. transactions that are one-shot and written as stored pro- Bigtable: A distributed storage system for structured data. cedures. The key insight behind Janus is that the coordi- In Proceedings of USENIX Symposium on Opearting Sys- nation required for concurrency control and consensus is tems Design and Implementation (OSDI), 2006. highly similar. We exploit this insight by tracking con- [8] B. F. Cooper, R. Ramakrishnan, U. Srivastava, A. Silber- flicting arrival orders of both different transactions across stein, P. Bohannon, H.-A. Jacobsen, N. Puz, D. Weaver, shards and the same transaction within a shard using a and R. Yerneni. PNUTS: Yahoo!’s hosted data serving single dependency graph. This enables Janus to com- platform. Proceedings of International Conference on mit in a single round trip when there is no contention. Very Large Data Bases (VLDB), 2008. When there is contention Janus is able to commit by have [9] J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, all shards of a transaction reach consensus on its depen- J. Furman, S. Ghemawat, A. Gubarev, C. Heiser, dencies and then breaking cycles through deterministic P. Hochschild, et al. Spanner: Google’s globally dis- reordering before execution. This enables Janus to pro- tributed database. In Proceedings of USENIX Sympo- vide higher throughput and lower latency than the state sium on Opearting Systems Design and Implementation (OSDI) of the art when workloads have moderate to high skew, , 2012. are geo-replicated, and are realistically complex. [10] J. Cowling and B. Liskov. Granola: low-overhead dis- tributed transaction coordination. In Proceedings of USENIX Conference on Annual Technical Conference Acknowledgments (ATC), 2012. [11] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P.Vosshall, The work is supported by NSF grants CNS-1514422 and and W. Vogels. Dynamo: Amazon’s highly available key- CNS-1218117 as well as AFOSR grant FA9550-15-1- value store. In Proceedings of ACM Symposium on Oper- 0302. Lamont Nelson is also supported by an NSF grad- ating Systems Principles (SOSP), 2007. uate fellowship. We thank Michael Walfish for helping us [12] H. Garcia-Molina and K. Salem. Main memory database clarify the design of Janus and Frank Dabek for reading systems: An overview. Knowledge and Data Engineering, an early draft. We thank our shepherd, Dan Ports, and the IEEE Transactions on, 4(6):509–516, 1992. anonymous reviewers of the OSDI program committee [13] H. Garcia-Molina, R. J. Lipton, and J. Valdes. A massive for their helpful comments. memory machine. Computers, IEEE Transactions on, 100 (5):391–399, 1984. [14] M. P. Herlihy and J. M. Wing. Linearizability: A correct- References ness condition for concurrent objects. ACM Transactions on Programming Languages and Systems (TOPLAS), 12 [1] TPC-C Benchmark. http://www.tpc.org/tpcc/. (3):463–492, 1990. [2] A. Adya, R. Gruber, B. Liskov, and U. Maheshwari. Ef- [15] R. Kallman, H. Kimura, J. Natkins, A. Pavlo, A. Rasin, ficient optimistic concurrency control using loosely syn- S. Zdonik, E. P. Jones, S. Madden, M. Stonebraker,

530 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association Y. Zhang, et al. H-store: a high-performance, distributed ceedings of USENIX Symposium on Opearting Systems main memory system. In Pro- Design and Implementation (OSDI), 2014. ceedings of International Conference on Very Large Data [32] S. Mu, L. Nelson, , W. Lloyd, and J. Li. Consolidat- Bases (VLDB), 2008. ing concurrency control and consensus for commits under [16] T. Kraska, G. Pang, M. J. Franklin, S. Madden, and conflicts. Technical Report TR2016-983, New York Uni- A. Fekete. MDCC: Multi-data center consistency. In versity, Courant Institute of Mathematical Sciences, 2016. Proceedings of ACM European Conference on Computer [33] B. M. Oki and B. H. Liskov. Viewstamped replication: Systems (EuroSys), 2013. A new primary copy method to support highly-available [17] L. Lamport. Time, clocks, and the ordering of events in a distributed systems. In Proceedings of ACM Symposium distributed system. Commun. ACM, 1978. on Principles of Distributed (PODC), 1988. [18] L. Lamport. The part-time parliament. ACM Transactions [34] C. H. Papadimitriou. The serializability of concurrent on Computer Systems (TOCS), 16(2):133–169, 1998. database updates. Journal of the ACM, 26(4), 1979. [19] L. Lamport. Paxos made simple. ACM Sigact News, 32 [35] D. Peng and F. Dabek. Large-scale incremental processing (4):18–25, 2001. using distributed transactions and notifications. In Pro- ceedings of USENIX Symposium on Opearting Systems [20] L. Lamport. Generalized consensus and Paxos. Techni- Design and Implementation (OSDI) cal report, Technical Report MSR-TR-2005-33, Microsoft , 2010. Research, 2005. [36] D. R. Ports, J. Li, V. Liu, N. K. Sharma, and A. Krishna- murthy. Designing distributed systems using approximate [21] L. Lamport. Fast Paxos. , 19(2): synchrony in data center networks. In Proceedings of 79–103, October 2006. USENIX Conference on Networked Systems Design and [22] K. Li and J. F. Naughton. Multiprocessor main memory Implementation (NSDI), 2015. transaction processing. In Proceedings of the first interna- tional symposium on in parallel and distributed [37] D. J. Rosenkrantz, R. E. Stearns, and P. M. Lewis II. System level concurrency control for systems, 2000. systems. ACM Transactions on Database Systems (TODS), [23] R. J. Lipton and J. S. Sandberg. PRAM: A scalable shared 3(2):178–198, 1978. memory. Technical Report TR-180-88, Princeton Univ., [38] F. B. Schneider. Implementing fault-tolerant services us- Dept. Comp. Sci., 1988. ing the state machine approach: a tutorial. ACM Computer [24] B. Liskov and J. Cowling. Viewstamped replication revis- Surveys, 22(4), Dec. 1990. ited. Technical report, MIT-CSAIL-TR-2012-021, MIT, [39] Y. Sovran, R. Power, M. K. Aguilera, and J. Li. Trans- 2012. actional storage for geo-replicated systems. In Proceed- [25] B. Liskov, M. Castro, L. Shrira, and A. Adya. Providing ings of ACM Symposium on Operating Systems Principles persistent objects in distributed systems. In ECOOP, 1999. (SOSP), 2011. [26] W. Lloyd, M. J. Freedman, M. Kaminsky, and D. G. An- [40] M. Stonebraker, S. Madden, D. J. Abadi, S. Harizopoulos, dersen. Don’t settle for eventual: scalable causal con- N. Hachem, and P. Helland. The end of an architec- sistency for wide-area storage with COPS. In Proceed- tural era:(it’s time for a complete rewrite). In Proceedings ings of ACM Symposium on Operating Systems Principles of International Conference on Very Large Data Bases (SOSP), 2011. (VLDB), 2007. [27] W. Lloyd, M. J. Freedman, M. Kaminsky, and D. G. An- [41] A. Thomson and D. J. Abadi. The case for determin- dersen. Stronger semantics for low-latency geo-replicated ism in database systems. In Proceedings of International storage. In Proceedings of USENIX Conference on Conference on Very Large Data Bases (VLDB), 2010. Networked Systems Design and Implementation (NSDI) , [42] A. Thomson, T. Diamond, S.-C. Weng, K. Ren, P. Shao, 2013. and D. J. Abadi. Calvin: fast distributed transactions [28] H. Mahmoud, F. Nawab, A. Pucher, D. Agrawal, and for partitioned database systems. In Proceedings of ACM A. El Abbadi. Low-latency multi-datacenter databases International Conference on Management of Data (SIG- using replicated commit. In Proceedings of International MOD), 2012. Conference on Very Large Data Bases (VLDB), 2013. [43] S. Tu, W. Zheng, E. Kohler, B. Liskov, and S. Madden. [29] Y. Mao, F. P. Junqueira, and K. Marzullo. Mencius: build- Speedy transactions in multicore in-memory databases. ing efficient replicated state machines for wans. In Pro- In Proceedings of ACM Symposium on Operating Systems ceedings of USENIX Symposium on Opearting Systems Principles (SOSP), 2013. Design and Implementation (OSDI), 2008. [44] R. van Renesse and F. B. Schneider. Chain replication for [30] I. Moraru, D. G. Andersen, and M. Kaminsky. There is supporting high throughput and availability. In Proceed- more consensus in egalitarian parliaments. In Proceed- ings of USENIX Symposium on Opearting Systems Design ings of ACM Symposium on Operating Systems Principles and Implementation (OSDI), 2004. (SOSP), 2013. [45] G. Weikum and G. Vossen. Transactional Information [31] S. Mu, Y. Cui, Y. Zhang, W. Lloyd, and J. Li. Extracting Systems: Theory, Algorithms, and the Practice of Concur- more concurrency from distributed transactions. In Pro- rency Control and Recovery. Morgan Kaufmann, 2001.

USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 531 [46] A. Whitney, D. Shasha, and S. Apter. High volume trans- UW-CSE-2014-12-01, University of Washington, 2014. action processing without concurrency control, two phase [50] I. Zhang, N. K. Sharma, A. Szekeres, A. Krishnamurthy, commit, or C++. In Seventh International Workshop on and D. R. Ports. Building consistent transactions with High Performance Transaction Systems, Asilomar, 1997. inconsistent replication (extended version). Technical re- [47] C. Xie, C. Su, M. Kapritsos, Y. Wang, N. Yaghmazadeh, port, Technical Report UW-CSE-2014-12-01 v2, Univer- L. Alvisi, and P. Mahajan. Salt: Combining ACID sity of Washington, 2015. and BASE in a distributed batabase. In Proceedings of USENIX Symposium on Opearting Systems Design and [51] I. Zhang, N. K. Sharma, A. Szekeres, A. Krishnamurthy, Implementation (OSDI), 2014. and D. R. K. Ports. Building consistent transactions with [48] C. Xie, C. Su, C. Littley, L. Alvisi, M. Kapritsos, and inconsistent replication. In Proceedings of ACM Sympo- Y. Wang. High-performance ACID via modular concur- sium on Operating Systems Principles (SOSP), 2015. rency control. In Proceedings of ACM Symposium on [52] Y. Zhang, R. Power, S. Zhou, Y. Sovran, M. K. Aguilera, Operating Systems Principles (SOSP), 2015. and J. Li. Transaction chains: achieving serializability [49] I. Zhang, N. K. Sharma, A. Szekeres, A. Krishnamurthy, with low latency in geo-distributed storage systems. In and D. R. Ports. Building consistent transactions with in- Proceedings of ACM Symposium on Operating Systems consistent replication. Technical report, Technical Report Principles (SOSP), 2013.

532 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association