Causal Consistency and Latency Optimality: Friend or Foe?

Diego Didona, Rachid Guerraoui, Jingjing Wang, Willy Zwaenepoel EPFL first.last@epfl.ch

ABSTRACT the long latencies incurred by strong consistency (e.g., lin- Causal consistency is an attractive consistency model for earizability and strict serializability) [30, 21] and tolerates replicated data stores. It is provably the strongest model network partitions [37]. In fact, CC is provably the strongest that tolerates partitions, it avoids the long latencies as- consistency that can be achieved in an always-available sys- sociated with strong consistency, and, especially when us- tem [41, 7]. CC is the target consistency level of many ing read-only transactions, it prevents many of the anoma- systems [37, 38, 25, 26, 4, 18, 29], it is used in platforms lies of weaker consistency models. Recent work has shown that support multiple levels of consistency [13, 36], and it that causal consistency allows “latency-optimal” read-only is a building block for strong consistency systems [12] and transactions, that are nonblocking, single-version and single- formal checkers of distributed protocols [28]. round in terms of communication. On the surface, this la- Read-only transactions in CC. High-level operations tency optimality is very appealing, as the vast majority of such as producing a web page often translate to multiple applications are assumed to have read-dominated workloads. reads from the underlying data store [47]. Ensuring that all In this paper, we show that such “latency-optimal” read- these reads are served from the same consistent snapshot only transactions induce an extra overhead on writes; the avoids undesirable anomalies, extra overhead is so high that performance is actually jeop- in particular, the following well-known anomaly: Alice re- ardized, even in read-dominated workloads. We show this moves Bob from the access list of a photo album and adds result from a practical and a theoretical angle. a photo to it but Bob reads the original permission and the First, we present a protocol that implements “almost laten- new version of the album [37, 39]. Therefore, the vast ma- cy-optimal” ROTs but does not impose on the writes any of jority of CC systems provide read-only transactions (ROTs) the overhead of latency-optimal protocols. In this protocol, to read multiple items at once from a causally consistent ROTs are nonblocking, one version and can be configured snapshot [37, 38, 26, 4, 3, 64]. Large-scale applications are to use either two or one and a half rounds of -server often read-heavy [6, 47, 48, 40], and achieving low-latency communication. We experimentally show that this protocol ROTs becomes a first-class concern for CC systems. not only provides better throughput, as expected, but also Earlier CC ROT designs were either blocking [25, 26, 3, surprisingly better latencies for all but the lowest loads and 4] or required multiple rounds of communications to com- most read-heavy workloads. plete [37, 38, 4]. Recent work on the COPS-SNOW sys- Then, we prove that the extra overhead imposed on writes tem [39] has, however, demonstrated that it is possible to by latency-optimal read-only transactions is inherent, i.e., perform causally consistent ROTs in a single round of com- it is not an artifact of the design we consider, and cannot munication, sending only one version of the keys involved, be avoided by any implementation of latency-optimal read- and in a nonblocking fashion. Because it exhibits these three only transactions. We show in particular that this overhead properties, the COPS-SNOW ROT protocol was termed grows linearly with the number of clients. latency-optimal (LO). The protocol achieves LO by impos- ing additional processing costs on writes. One could argue arXiv:1803.04237v1 [cs.DC] 12 Mar 2018 that this is a correct tradeoff for the common case of read- heavy workloads, because the overhead affects the minority 1. INTRODUCTION of operations and is thus to the advantage of the majority Geo-replication is gaining momentum in industry [47, 40, of them. This paper sheds a different light on this tradeoff. 48, 21, 9, 19, 16, 24, 61] and academia [32, 23, 55, 67, 21, 44, Contributions. In this paper we show that the ex- 66, 65, 46] as a design choice for large-scale data platforms to tra cost on writes is so high that so-called latency-optimal meet the strict latency and availability requirements of on- ROTs in practice exhibit higher latencies than alternative line applications [52, 58, 5]. Geo-replication aims to reduce designs, even in read-heavy workloads. This extra cost not operation latencies, by storing a copy of the data closer to only reduces the available processing power, leading to lower the clients, and to increase availability, by keeping multiple throughput, but it also leads to higher resource contention, copies of the data at different data centers (DC). which results in higher queueing times, and, ultimately, in Causal consistency. Causal consistency (CC) [2] is an higher latencies. We demonstrate this counterintuitive re- attractive consistency model for building geo-replicated data sult from two angles, a practical and a theoretical one. stores. On the one hand, it has an intuitive semantics and 1) From a practical standpoint, we show how an existing avoids many anomalies that are allowed under weaker con- and widely used design of CC can be improved to achieve al- sistency properties [63, 24]. On the other hand, it avoids

1 most all the properties of a latency-optimal design, without incurring the overhead on writes that latency optimality im- plies. We implement this improved design in a system that we call Contrarian. Measurements in a variety of scenarios demonstrate that, for all but the lowest loads and the most read-heavy workloads, Contrarian provides better latencies and throughput than an LO protocol. 2) From a theoretical standpoint, we show that the extra cost imposed on writes to achieve LO ROTs is inherent to CC, i.e., it cannot be avoided by any CC system that im- plements LO ROTs. We also provide a lower bound on this extra cost in terms of communication overhead. Specifically, we show that the amount of extra information exchanged on Figure 1: Challenges in implementing CC ROTs. C1 issues writes potentially grows linearly with the number of clients. ROT (x, y). If T1 returns X0 to C1, then T1 cannot return The relevance of our theoretical results goes beyond the Y1 because there is X1 such that X0 ; X1 ; Y1. scope of CC. In fact, they apply to any consistency model strictly stronger than causal consistency, e.g., linearizabil- ity [30]. Moreover, our result is relevant also for systems that any subsequent read performed by the client on x returns implement hybrid consistency models that include CC [13] either X or a newer version. I.e., the client cannot read or that implement strong consistency on top of CC [12]. X0 : X0 ; X. A ROT operation returns item versions from Roadmap. The remainder of this paper is organized as a causally consistent snapshot [42, 37]: if a ROT returns X follows. Section 2 provides introductory concepts and defi- and Y such that X ; Y , then there is no X0 such that nitions. Section 3 surveys the complexities involved in the X ; X0 ; Y . implementation of ROTs. Section 4 presents our Contrar- To circumvent trivial implementations of causal consis- ian protocol. Section 5 compares Contrarian and an LO tency, we require that a value, once written, becomes even- design. Section 6 presents our theoretical results. Section 7 tually visible, meaning that it is available to be read by all discusses related work. Section 8 concludes the paper. clients after some finite time [11]. Causal consistency does not establish an order among con- 2. SYSTEM MODEL current (i.e., not causally related) updates on the same key. Hence, different replicas of the same key might diverge and 2.1 API expose different values [63]. We consider a system that even- We consider a multi-version key value data store. We tually converges: if there are no further updates, eventually denote keys by lower case letters, e.g., x, and the versions of all replicas of any key take the same value, for instance using their corresponding values by capital letters, e.g., X. The the last-writer-wins rule [60]. key value store provides the following operations: Hereafter, when we use the term causal consistency, even- tual visibility and convergence are implied. X ← GET(x): A GET operation returns the value of the item identified by x. GET may return ⊥ to show that there 2.3 Partitioning and Replication is no item yet identified by x. We target key value stores where the data set is split into PUT(x, X): A PUT operation creates a new version X N > 1 partitions. Each key is deterministically assigned to of an item identified by x. If item x does not yet exist, the a partition by a hash function. A PUT(x, X) is sent to the system creates a new item x with value X. partition that stores x. For a ROT(x, ...) a read request is (X, Y, ...) ← ROT (x, y, ...) : A ROT returns a vector sent to all partitions that store keys in the specified key set. (X, Y , ...) of versions of the requested keys (x, y, ... ). A Each partition is replicated at M ≥ 1 DCs. Our theoreti- ROT may return ⊥ to show that some item does not exist. cal and practical results hold for both single and replicated DCs. In the case of replication, we consider a multi-master In the remainder of this paper we focus on PUT and ROT design, i.e., all replicas of a key accept PUT operations. operations. 2.2 Causal Consistency 3. CHALLENGES IN IMPLEMENTING The causality order is a happens-before relationship be- READ-ONLY TRANSACTIONS tween any two operations in a given execution [35, 2]. For Single DC case. Even in a single DC, partitions involved any two operations α and β, we say that β causally depends in a ROT cannot simply return the most recent version of a on α, and we write α ; β, if and only if at least one of requested key if one wants to ensure that a ROT observes a the following conditions holds: i) α and β are operations in causally consistent snapshot. Consider the scenario of Fig- a single thread of execution, and α happens before β; ii) ure 1, with two keys x and y, with initial values X0 and Y0, ∃x, X such that α creates version X of key x, and β reads and residing on partitions px and py, respectively. Client C1 X; iii) ∃γ such that α ; γ and γ ; β. If α and β are performs a ROT on keys x and y, and client C2 performs a two PUTs with values X and Y respectively, then (with a PUT on x with value X1 and later a PUT on y with value slight abuse of notation) we also say Y causally depends on Y1. By asynchrony, the read on x by C1 arrives at px before X, and we write X ; Y . the PUT by C2 on x, and the read by C1 on y arrives at A causally consistent key value store respects the causality py after the PUT by C2 on y. Clearly, py cannot return order. Intuitively, if a client reads Y and X ; Y , then Y1 to C1, because a snapshot consisting of X0 and Y1, with

2 X0 ; X1 ; Y1 violates the causal consistency property for snapshots (see Section 2.2). COPS [37] presented the first solution to this problem. It encodes causality as direct dependencies of the form “ver- sion Y of y depends on version X of x”, stored with Y , and “client C has established a dependency on version X of x”, stored with C. These dependencies are passed around as necessary to maintain causality. COPS solves the aforemen- tioned challenge as follows. when C1 performs its ROT, in the first round of the protocol, px and py return the most recent version of x and y, X0 and Y1. Partition py also returns to C1 the dependency “Y1 depends on X1”. From this piece of information C1 can determine that X0 and Y1 do not form a causally consistent snapshot. Thus, in the second round of the protocol, C1 requests from px a more recent version of x to have a causally consistent snapshot, in Figure 2: COPS-SNOW design. C2 declares that Y1 depends this case X1. This protocol is nonblocking, but requires (po- on X0. Before making Y1 visible, py runs a “readers check” tentially) two rounds of communication and two versions of with px and is informed that T1 has observed a snapshot key(s) being communicated. Eiger [38] improves on this de- that does not include Y1. sign by using less meta-data, but maintains the potentially two-round, two-version implementation of ROTs. In later designs for CC systems [25, 26, 3], direct depen- timestamps of Y and Y to conclude that it needs to return dencies were abandoned in favor of timestamps, which pro- 0 1 Y , the most recent version with timestamp smaller than or vide a more compact and efficient encoding of causality. To 0 equal to 100. As with COPS, this protocol is nonblocking; maintain causality, a timestamp is associated with every ver- unlike COPS, it requires only a single version of each key, sion of every data item. Each client and each partition also but it always requires two rounds of communication [3]. maintain the highest timestamp they have observed. When A further complication arises when (loosely synchronized) performing a PUT, a client sends along its timestamp. The physical clocks are used for timestamping [26, 3], since phys- timestamp of the newly created version is then one plus ical clocks, unlike logical clocks, can only move forward with the maximum between the client’s timestamp and the par- the passage of time. As a result, in our example, when the tition’s timestamp, thus encoding causality. After complet- read on p arrives with snapshot timestamp 100, p has to ing a PUT, the partition replies to the client with this new x x wait until its physical clock advances to 100 before it can version’s timestamp. To implement ROTs it then suffices return X . This makes the protocol blocking, in addition to to pick a timestamp for the snapshot, and send it with the 0 being one-version and two-round. ROTs to the partitions. A partition first makes sure that The question then becomes: does there exist a single- its timestamp has caught up to the snapshot timestamp [3]. round, single-version, nonblocking protocol for CC ROTs? This ensures that later a version cannot be created with a This question was answered in the affirmative by a follow-up lower timestamp than the snapshot timestamp. Then, the to the COPS and Eiger systems, called COPS-SNOW [39]. partition returns the most recent key values with a times- Using again the previous example, we depict in Figure 2 how tamp smaller than or equal to the snapshot timestamp. the COPS-SNOW protocol works at a high level. Each ROT The snapshot timestamp is picked by a transaction coordi- is given a unique identifier. When a ROT T reads X , p nator [26, 3]. Any server can be the coordinator of a ROT. 1 0 x records T as a reader of x. It also records the (logical) time Thus, the client contacts the coordinator, the coordinator 1 at which the read occurred. On a later PUT on x, T is picks the timestamp, and the client or the coordinator then 1 added to the “old readers of x”, the set of transactions that sends this timestamp along with the keys to be read to the have read a version of x that is no longer the most recent partitions. The client provides the coordinator with the last version, again together with the logical time at which the observed timestamp, and the coordinator picks the transac- read occurred. tion timestamp as the maximum of the client’s timestamp When C later sends its PUT on y to p , it includes (as and its own. Observe that, in general, the client cannot pick 2 y in COPS) that this PUT is dependent on X . Partition p the snapshot timestamp itself, because the timestamp may 1 y now interrogates p as to whether there are old readers of be arbitrarily far behind, compromising eventual visibility. x x, and, if so, records the old readers of x into the old reader Timestamps may be generated by logical or by physical record of y, together with their logical time. When later clocks. Returning to our example of Figure 1, assume that the read of T on y arrives, p finds T in the old reader the logical clocks at C and C are initially 0, the logical 1 y 1 1 2 record of y. p therefore knows that it cannot return Y . clocks at p and p are initially 90. the timestamps of X y 1 x y 0 Using the logical time in the old reader record, it returns and Y are 70, and that a transaction coordinator chooses 0 the most recent version of y before that time, in this case a snapshot timestamp 100. When receiving the read of C 1 Y . In the rest of the paper, we refer to this procedure as with snapshot timestamp 100, p advances its logical clock 0 x the readers check. This protocol is one-round, one-version to 100, and returns X . When p receives PUT(x, X ), it 0 x 1 and nonblocking, and therefore termed latency-optimal. creates X with timestamp 101, and returns that value to 1 This protocol, however, incurs a very high cost on PUTs. C . C then sends the PUT(y, Y ) to p with timestamp 2 2 1 y We demonstrate this cost by slightly modifying our example. 101, and Y1 is created with timestamp 102. When the read Let us assume that hundreds of ROTs read X0 before the of C1 on y arrives with snapshot timestamp 100, py uses the PUT(x, X1) (as might well occur with a skewed workload

3 Px Px

Px

X 4 X 3 Read(x) 1 Snap Read(x, Snap) 2 3 1 Req Snap 1 ROT(x,y,z) Snap, Z 3 3 Read(z, Snap) Pz Pz Pz Pz (Coord) (Coord) (Coord) (Coord) Snap 2 Z 4 1 3 Read(y) Y Snap Read(y, Snap)Y 3 2 4 Py Py Py

(a) 1 1/2 rounds (3 communication steps). (b) 2 rounds (4 communication steps).

Figure 3: ROT implementation in Contrarian. Numbered circles depict the order of operations. The client always piggybacks on its requests the last snapshot it has seen (not shown), so as to observe monotonically increasing snapshots. Any node involved in a ROT can act as the coordinator of the ROT. Using 1 1/2 rounds reduces the number of communication hops with respect to 2 rounds, at the expenses of more messages exchanged to run a ROT. in which x is a hot key). Then all these transactions must ROTs, thereby providing low latency, resource efficiency and be stored as readers and then as old readers of x, communi- high throughput. cated to py, and examined by py on each incoming read from Our goal is not to propose a radically new design of CC. a ROT. Let us further modify the example by assuming that Rather, we aim to show how an existing and widely em- C2 reads other keys from partitions pi different from px and ployed non-latency optimal design can be improved to achieve py before writing Y1. Because C2 has established a depen- almost all the desirable properties of latency optimality with- dency on all the versions it has read, in order to compute out incurring the overhead that inherently results from achiev- the old readers for y, py needs to interrogate not only px, ing all of them (as we demonstrate in Section 6). but all the other partitions pi. Contrarian builds on the aforementioned coordinator-based Challenges of geo-replication. Further complications design of ROTs and on the stabilization protocol-based ap- arise in a geo-replicated setting with multiple DCs. We as- proach (to determine visibility of remote items ) in the geo- sume that keys are replicated asynchronously, so a new key replica- ted setting. These characteristics, all or in part, lie version may arrive at a DC before its causal dependencies. at the core of many state-of-the-art systems, like Orbe [25], COPS and COPS-SNOW deal with this situation through GentleRain [26], Cure [3] and CausalSpartan [54]. The im- a technique called dependency checking. When a new key provements we propose in Contrarian, thus, can be employed version is replicated, its causal dependencies are sent along. to improve the design of these and similar systems. Before the new version is installed, the system checks by Properties of ROTs. Contrarian’s ROT protocol runs means of dependency check messages to other partitions that in 1 1/2 rounds, is one-version, and nonblocking. While its causal dependencies are present. When its dependencies Contrarian sacrifices a half round in latency compared to the have been installed in the DC, the new key version can be theoretically LO protocol, it retains the low cost of PUTs installed as well. In COPS-SNOW, in addition, the readers as in other non-LO designs. check for the new key version proceeds in a remote data cen- Contrarian implements ROTs in 1 1/2 rounds of com- ter as it does in the data center where the PUT originated. munication, by one-round trip between the client and the To amortize the overhead, the dependency check and the partitions (one of which is chosen as the coordinator) with readers check are performed as a single protocol. an extra hop from the coordinator to the partitions. As An alternative technique, commonly used with timestamp- shown in Figure 3, this design requires only three commu- based methods, is to use a stabilization protocol [8, 26, 3]. nication steps instead of four as the classical coordinator- Variations exist, but in general each data center establishes based approach described in Section 3. Contrarian reduces a cutoff timestamp below which it has received all remote the communication hops to improve latency at the expense updates. Updates with a timestamp lower than this cutoff of generating more messages to serve a ROT with respect can then be installed. Stabilization protocols are cheaper to a 2-round approach. As we shall see in Section 5.3, this to implement than dependency checking [26], but they lead leads to a slight throughput loss. Contrarian can be config- to a complication in making ROTs nonblocking, in that one ured to run ROTs with 2 rounds (even on a per-ROT basis) needs to make sure that the snapshot timestamp assigned to maximize throughput. to a ROT is below the cutoff timestamp, so that there is no Contrarian achieves the one-version property because par- blocking upon reading. titions read the freshest version within the snapshot pro- posed by the coordinator. Contrarian implements nonblocking ROTs by using logi- 4. CONTRARIAN: AN EFFICIENT cal clocks. In the single-DC case, logical clocks allow a par- tition to move its local clock’s value to the snapshot times- BUT NOT LATENCY-OPTIMALDESIGN tamp of an incoming ROT, if needed. Hence, ROTs can be We now present Contrarian, a protocol that implements served without blocking (as described in Section 3). almost all the properties of latency-optimal ROTs, with- out incurring the overhead that stems from latency-optimal

4 We now describe how Contrarian implements geo-replica- clock. Hence, the stabilization protocol identifies fresh snap- tion and retains the nonblocking property in that setting. shots. Importantly, the correctness of Contrarian does not Geo-replication. Similarly to Cure [3], Contrarian uses depend on the synchronization of the clocks, and Contrarian dependency vectors to track causality, and employs a stabi- preserves its properties even if using plain logical clocks. lization protocol to determine a cutoff vector CV in a DC Contrarian is not the first CC system that proposes the (rather than a cutoff timestamp as discussed earlier). Ev- use of HLCs to generate event timestamps. However, exist- ery partition maintains a version vector VV with one entry ing systems use HLCs either to avoid blocking PUT - per DC. VV [m] is the timestamp of the latest version cre- tions [54], or reduce replication delays [29], or improve the ated by the partition, where m is the index of the local DC. clock synchronization among servers [43]. Here, we show VV [i], i 6= m, is the timestamp of the latest update received how HLCs can be used to implement nonblocking ROTs. from the replica in the i−th DC. A partition sends a heart- beat message with its current clock value to its replicas if it 5. EXPERIMENTAL STUDY does not process a PUT for a given amount of time. Periodically, the partitions within DCm exchange their 5.1 Summary of the results VV s and compute the aggregate minimum vector, called Global Stable Snapshot (GSS). The GSS represents a lower Main findings. We show that the resource demands to bound on the snapshot of remote items that have been in- perform PUT operations in the latency-optimal design are in practice so high that they not only affect the performance stalled by every node in DCm. The GSS is exchanged be- tween clients and partitions upon each operation to update of PUTs, but also the performance of ROTs, even with read-heavy workloads. In particular, with the exception of their views of the snapshot installed in DCm. Items track causal dependencies by means of dependency scenarios corresponding to extremely read-heavy workloads vectors DV , with one entry per DC. If X.DV [i] = t, then and modest loads, where the two designs are comparable, X (potentially) causally depends on all the items originally Contrarian achieves ROT latencies that are lower than a latency-optimal design. In addition, Contrarian achieves written in DCi with a timestamp up to t. DV [s], where s is the source replica, is the timestamp of X and it is enforced higher throughput for almost all workloads. to be higher than any other entry in DV upon creation of X, Lessons learnt. In light of our experimental findings, we to reflect causality. The remote entries of the GSS are used draw three main conclusions. to build the remote entries of DV of newly created items. i) Overall system efficiency is key to both low latency and X can be made visible to clients in a remote DCr if X.DV high throughput. It is fundamental to understand the cost is entry-wise smaller than or equal to the GSS on the server of optimizing an operation on the system even though the 0 that handles x in DCr. This condition implies that all X s optimized operation dominates the workload. dependencies have already been received in DCr. ii) The high-level theoretical model of a design may not cap- The ROT protocol uses a vector SV to encode a snapshot. ture the resource utilization dynamics incurred by the de- The local entry of SV is the maximum between the clock at sign. While a theoretical model represents a powerful lens the coordinator and the highest local timestamp seen by the to compare and qualitatively analyze designs, the choice of a client. The remote entries of SV are given by the maximum target design for a system should rely also on a more quan- between the GSS at the coordinator and the highest GSS titative analysis, e.g., by means of analytical modeling [57]. seen by the client. An item Y belongs to the snapshot en- coded by SV if Y.DV ≤ SV . This protocol is nonblocking iii) Ultimately, the optimality of a design is closely related because i) partitions can move the value of their local clock to the target workload as well as target architecture and forward to match the local entry of SV and ii) the remote available computational resources. entries of SV correspond to a causally consistent snapshot of remote items that have already been received in the DC. 5.2 Experimental environment. Implementation and optimizations. We implement Con- Freshness of the snapshots. The GSS is computed by trarian, Cure 1 and the COPS-SNOW design in the same means of the minimum operator. Because logical clocks on C++ code-base. Clients and servers use Protocol different nodes may advance at different paces, a single lag- Buffer [27] for communication. We call CC-LO the system gard node in one DC can keep entries in the GSS from that implements the design of COPS-SNOW. We improve progressing, thus increasing the staleness of the snapshot. its performance over the original design by more aggressive A solution to this problem is to use loosely synchronized eviction of transactions from the old reader record. Specif- physical clocks [25, 26, 3]. However, physical clocks cannot ically, we garbage-collect a ROT id after 500 msec from its be moved forward to match the timestamp of an incoming insertion in the readers list of a key (vs the 5 seconds of the ROT, which can jeopardize the nonblocking property [3]. original implementation) and we enforce that each readers- To achieve fresh snapshots and preserve nonblocking ROTs, check message response contains at most one ROT id per Contrarian uses Hybrid Logical Physical Clocks (HLC) [33]. client, i.e., the one corresponding to the most recent ROT In brief, an HLC is a logical clock that generates timestamps of that client. These two optimizations reduce by one order by taking the maximum between the local physical clock and of magnitude the amount of ROT ids exchanged, leading it the highest timestamp seen by the node plus one. On the one to approach the lower bound we describe in Section 6. We hand, HLCs behave like logical clocks, so a server can move use NTP [49] to synchronize clocks in Contrarian and Cure, its clock forward to match the timestamp of an incoming and the stabilization protocol is run every 5 msec. ROT request, thereby preserving the nonblocking behavior of ROTs. On the other hand, HLCs behave like physical 1Cure supports an API that is different from Contrar- clocks, because they advance even in absence of events and ian’s [3]. We modify Cure to comply with the model de- inherit the (loose) synchronicity of the underlying physical scribed in Section 2.

5 Parameter Definition Value Motivation 0.01 Extremely read-heavy workload Write/read ratio (w) #PUTS/(#PUTs+#individual reads) 0.05 Default read-heavy parameter in YCSB [20] 0.1 Default parameter in COPS-SNOW [39] Size of a ROT (p) # Partitions involved in a ROT 4,8,24 Application operations span multiple partitions [47] 8 Representative of many production workloads [6, 47, 53] Size of values (b) Value size (in bytes). Keys take 8 bytes. 128 Default parameter in COPS-SNOW [39] 2048 Representative of workloads with large items 0.99 Strong skew typical of many production workloads [6, 14] Skew in key popularity (z) Parameter of the zipfian distribution. 0.8 Moderate skew and default in COPS-SNOW [39] 0 No skew (uniform distribution) [14]

Table 1: Workload parameters considered in the evaluation. The default values are given in bold.

improvement over Cure, and by analyzing the behavior of Platform. We use 64 machines equipped with 2x4 AMD the system when implementing ROTs in 1 1/2 or 2 rounds of Opteron 6212 (16 hardware threads) and 130 GB of RAM communication. Figure 4 compares the three designs given and running Ubuntu 16.04 with a 4.4.0-89-generic kernel. the default workload in 2 DCs. We consider a data set sharded across 32 partitions. Each Contrarian achieves lower latencies than Cure, up to a partition is assigned to a different server. We consider a sin- factor of ≈ 3x (0.35 vs 1.0 msec), by implementing non- gle DC scenario and a replicated scenario with two replicas. blocking ROTs. In Cure, the latency of ROTs is affected by Machines communicate over a 10Gbps network. clock skew. At low load, the 1 1/2-round version of Con- Using only two replicas is a favorable condition for CC- trarian completes ROTs in 0.35 msec vs the 0.45 msec of LO, Since the readers check has to be performed also for the 2-round version. The two variants achieve comparable replicated updates in the remote DCs, the corresponding latencies at medium/high load (from 150 to 350 Kops/s). overhead grows linearly with the number of DCs. We also The 2-round version achieves a higher throughput than the note that the overheads of the designs we consider are largely 1 1/2-round version (by 8% in this case) because it is more unaffected by the communication latency between replicas, resource efficient by requiring fewer messages to run ROTs. because update replication is asynchronous and happens in Because we focus on latency more than throughput, here- the background. Thus, emulating a multi-DC scenario over after we report results corresponding to the 1 1/2-round a local area network suffices to capture the most relevant version of Contrarian. performance dynamics that depend on (geo-)replication [39]. Methodology. Experiments run for 90 seconds, and clients 5.4 Default workload. issue operations in closed loop. We generate different loads Figure 5 reports the performance of Contrarian and CC- for the system by spawning different numbers of client threads LO with the default workload, in the 1-DC and 2-DC scenar- (starting from one thread per client machine). We have run . Figure 5(a) reports average latencies, and Figure 5(b) each experiment up to 5 times, with minimal variations be- reports 99-th percentile. Figure 6 reports information on tween runs, and report the median result. the readers check overhead in CC-LO in the single-DC case. Workloads. Table 1 summarizes the workload parameters Latency. Figure 5 (a) shows that Contrarian achieves higher we consider. We use read-heavy workloads, in which clients latencies than CC-LO only at very moderate load. Under issue ROTs and PUTs according to a given w/r ratio (w), trivial load conditions ROTs in CC-LO take 0.3 msec on av- defined as #PUT/(#PUT + #READ). A ROT reading k erage vs the 0.35 of Contrarian. For the throughput higher keys counts as k READs. ROTs span a target number of than 60 Kops/s in the 1-DC case and than 120 Kops/s in partitions (p), chosen uniformly at random, and read one the 2-DC case Contrarian achieves lower latencies than CC- key per partition. Keys in a partition are chosen according LO. These load conditions correspond to roughly 25% of to a zipfian distribution with a given parameter (z). Every partition stores 1M keys, and items have a constant size (b). The default workload we consider uses w = 0.05, i.e., the Contrarian 2 rounds default value for the read-heavy workload in YCSB [20]; z Contrarian 1 1/2 rounds = 0.99, which is representative of skewed workloads [6]; p 2.5 Cure = 4, which corresponds to small ROTs (which exacerbate the extra communication cost in Contrarian); and b = 8, as many production workloads are dominated by tiny items [6]. 0.8 Performance metrics. We focus our study on the laten- 0.45 0.35 cies of ROTs because, by design, CC-LO favors ROT laten- Resp. time (msec,log) 50 100 150 200 250 300 350 400 450 cies over PUTs. As an aside, in our experiments CC-LO Throughput (Kops/s) incurs up to one order of magnitude higher PUT latencies than Contrarian. For space constraints, we focus on average Figure 4: Evaluation of Contrarian’s design (2-DC, default latencies. We report the 99-th percentile of latencies for a workload). Throughput vs average ROT latency (y axis in subset of the experiments. We measure the throughput of log scale). Contrarian achieves lower latencies than Cure by the systems as the number of PUTs and ROTs per second. means of nonblocking ROTs. Using 1 1/2 rounds of commu- nication reduces latency at low load, but it leads to exchange 5.3 Contrarian design more messages than using 2 rounds, and hence to a lower We first evaluate the design of Contrarian, by assessing its maximum throughput (Section 4).

6 4 CC-LO 1DC Contrarian 1DC 7.5 5 1 2.5

CC-LO 2DC 0.35 0.45 0.3 0.35 Contrarian 2DC Resp. time (msec, log) 0 50 100 150 200 250 300 350 400 450 Resp. time (msec, log) 0 50 100 150 200 250 300 350 400 450 Throughput (Kops/s) Throughput (Kops/s) (a) Throughput vs Avg. ROT latency. (b) Throughput vs 99-th percentile of ROT latencies.

Figure 5: ROT latencies (average and 99-th percentile) in Contrarian and CC-LO as a function of the throughput (default workload). The resource contention induced by the extra overhead posed by PUTs in CC-LO affects especially tail latency.

Contrarian’s peak throughput. That is, CC-LO achieves ROT ids, which almost matches the number of clients for slightly better latencies than Contrarian only for load con- this experiment. However, the same ROT id can appear in ditions that correspond to the case where the available re- the readers set of multiple keys that have to be checked at sources are severely under-utilized. different partitions. This increases the cumulative number CC-LO achieves worse latencies than Contrarian for non- of ROT ids exchanged during the readers-check phase, to trivial load conditions because of the overhead caused by the on average 855 ROT ids for each readers check (71 per con- readers check, needed to achieve latency optimality. This tacted node), corresponding roughly to 7KB of data (using overhead induces higher resource utilization, and hence higher 8 bytes per ROT id). Figure 6 shows that the average over- contention on physical resources. Ultimately, this leads to head of a readers check grows linearly with the number of higher latencies, even for ROTs. clients in the system. This result matches our theoretical Tail latency. The effect of contention on physical resources analysis (see Section 6) and highlights the inherent scalabil- is especially visible at the tail of the ROT latencies distribu- ity limitations of latency-optimal ROTs. tion, as shown in Figure 5 (b). CC-LO achieves lower 99-th percentile latencies only at the lowest load condition (0.35 5.5 Effect of write intensity. vs 0.45 msec). Figure 7 shows how the performance of the systems is Throughput. Contrarian consistently achieves a higher affected by varying the write intensity of the workload. throughput than CC-LO. Contrarian’s maximum through- Latency. Similarly to what is seen previously, for non- put is 1.45x CC-LO’s in the 1-DC case, and 1.6x in the 2-DC trivial load conditions Contrarian achieves lower ROT la- case. In addition, Contrarian achieves a 1.9x throughput tencies than CC-LO on both the 1-DC and 2-DC scenarios improvement when scaling from 1 to 2 DCs. By contrast, and with almost all of the write intensity parameters. The CC-LO improves its throughput only by 1.6x. This result only exception occurs in the case of the lowest write in- is due to the higher replication costs in CC-LO, which has tensity, and even in this case the differences remain small, to communicate the dependency list of a replicated update, especially for the replicated environment. and perform the readers check in the remote DC. For w = 0.01 in the single-DC case (Figure 7a), at the low- Overhead analysis. To provide a sense of the overhead est load CC-LO achieves an average ROT latency of 0.3 msec of the readers check, we present some data collected on vs 0.35 of Contrarian; at high load (200 Kops/s), ROTs in the singe-DC platform at the load value at which CC-LO CC-LO completes in 1.11 msec vs 1.33 msec in Contrarian. achieves its peak throughput (corresponding to 256 client In the 2-DC deployment, however, the latencies achieved by threads). A readers check targets on average 20 keys, caus- the two systems are practically the same, except for triv- ing the checking partition to contact on average 12 other ial load conditions (Figure 7b). This change in the relative partitions. A readers check collects on average 252 distinct performances of the two systems is due to the higher repli- cation cost of CC-LO during the readers check, which has to be performed for each update in each DC.

2500 Throughput. Contrarian achieves a higher throughput 1500 Cumulative Distinct than CC-LO in almost all scenarios (up to 2.35x for w=0.1 500 and 2 DCs). The only exception is the w = 0.01 case in the 150 single DC deployment (where CC-LO achieves a throughput that is 10% higher). The throughput of Contrarian grows with the write intensity, because PUTs only touch one parti- # ROTs ids (log) 10 tion and are faster than ROTs. Instead, higher write inten- 10 120 360 560 # Clients/DC (log) sities hinder the performance of CC-LO, because they cause more frequent execution of the expensive readers check. Figure 6: ROT ids collected on average during a readers Overhead analysis. Surprisingly, the latency benefits of check in CC-LO (1-DC, default workload). The amount CC-LO are not very pronounced, even at the lowest write of information exchanged grows linearly with the number of intensities. This is due to the inherent tension between the clients, matching the bound stated in Section 6. The average frequency of writes and their costs. A low write intensity number of servers contacted during a readers check is 12. leads to a low frequency at which readers checks are per-

7 CC-LO w=0.01 0.05 0.1 Contrarian w=0.01 0.05 0.1

2.5 2.5

1 1

0.35 0.35 0.3 0.3 Resp. time (msec,log) 0 50 100 150 200 250 Resp. time (msec,log) 0 100 200 300 400 500 Throughput (Kops/s) Throughput (Kops/s) (a) Throughput vs Avg. ROT latency (1 DC). (b) Throughput vs Avg. ROT latency (2 DCs).

Figure 7: Performance with different w/r ratios. Contrarian achieves lower ROT latencies than CC-LO, except at very moderate load and for the most read-heavy workload. Contrarian also consistently achieves higher throughput. Higher write intensities hinder the performance of CC-LO because the readers check is triggered more frequently. formed. However, it also means that every write is depen- clients (which matches our later theoretical analysis). dent on many reads, resulting in more costly readers checks. 5.7 Effect of size of transactions. 5.6 Effect of skew in data popularity. Figure 9 shows the performance of the systems while vary- Figure 8 depicts how performance varies with skew in data ing the number of partitions involved in a ROT. We again popularity, in the single-DC platform. We focus on this de- report results corresponding to the single-DC platform. ployment scenario to factor out the replication dynamics of Latency. Contrarian achieves ROT latencies that are lower CC-LO and focus on the inherent costs of latency optimality. than or comparable to CC-LO’s for any number of partitions Latency. Similarly to the previous cases, Contrarian achieves involved in a ROT. The latency benefits of CC-LO over Con- ROT latencies that are lower than CC-LO’s for non-trivial trarian at low load decrease as p grows, because contacting load conditions (> 70 Kops/s, i.e., 30% of Contrarian’s max- more partitions amortizes the cost of the extra communica- imum throughput). tion needed by Contrarian to execute a ROT. Throughput. The data popularity skew does not sensi- Throughput. Contrarian achieves higher throughput than bly affect Contrarian, whereas it hampers the throughput CC-LO (up to 1.45x higher, with p=4) for any value of p. of CC-LO. The performance of CC-LO degrades because a The throughput gap between the two systems shrinks with higher skew causes longer causal dependency chains among p, because of the extra messages that are sent in Contrar- operations [11, 26], leading to a higher overhead incurred by ian from the coordinator to the other partitions involved in the readers checks. a ROT. The fact that only one key per partition is read Overhead analysis. With low skew, a key x is infrequently in our experiment is an adversarial setting for Contrarian, accessed, so it is likely that many entries in the readers of because it exacerbates the cost of the extra communication x can be garbage-collected by the time x is involved in a hop used to implement ROTs. Such communication cost readers check. With higher skew levels, a few hot keys are would be amortized if ROTs read multiple items per parti- accessed most of the time, which leads to the old reader tion. Contrarian can be configured to resort to the 2-round record with many fresh entries. High skew also leads to more ROT implementation when contacting a large number of duplicates in the ROT ids retrieved from different partitions, partitions, to increase resource efficiency. We are currently because the same ROT id is likely to be present in many the testing this optimization. old reader record. Our experimental results (not reported for space constraints) confirm this analysis. They also show 5.8 Effect of size of values. that, at any skew level, the number of ROT ids exchanged Larger items naturally result in higher CPU and network during a readers check grows linearly with the number of costs for marshalling, unmarshalling and transmission oper-

CC-LO z=0.99 0.8 0 CC-LO p=4 8 24 Contrarian z=0.99 0.8 0 Contrarian p=4 8 24 5 2.5 2.5

1 1 0.5 0.35 0.3 0.3 Resp. time (msec,log) 0 50 100 150 200 250 Resp. time (msec,log) 0 50 100 150 200 250 Throughput (Kops/s) Throughput (Kops/s)

Figure 8: Effect of the skew in data popularity (single-DC). Figure 9: Effect of ROT sizes (single-DC). The latency ad- Skew hampers the performance of CC-LO, because it leads vantage of CC-LO at low load decreases as p grows, because to long causal dependency chains among operations and thus contacting more partitions amortizes the cost of the extra to much information exchanged during the readers check. communication needed by Contrarian to execute a ROT.

8 ations. As a result, the performance gap between the sys- returns either X or some X0 of which PUT(x, X0) starts no tems shrinks as the size of the item values increases. Even in earlier than T ; we say X is visible since τX . the case corresponding to large items, however, Contrarian We assume the same APIs as described in Section 2.1. achieves ROT latencies lower than or comparable to the ones Clients and partitions exchange messages of which delays achieved by CC-LO, and a 43% higher throughput (in the are finite, but can be unbounded. Clients and partitions can single-DC scenario). We omit plots and additional details use their local clocks; however clock drift can be arbitrarily for space constraints. large and infinite (so for some time moment T , some clock can never reach T ). To capture the design of CC-LO, we also assume that an idle client sends no message to any par- 6. THEORETICAL RESULTS tition; when performing an operation on some keys, a client Our experimental study shows that the state-of-the-art sends messages only to the partitions which store values for CC design for LO ROTs delivers sub-optimal performance, these keys; a partition sends messages to client c only when caused by the overhead (imposed on PUTs) for dealing with responding to some operation issued by c; and clients do not old readers. One can, however, conceive of alternative im- communicate with each other. For simplicity, we consider plementations. For instance, rather than storing old readers any client issuing a new operation only after its previous with the data items in the partitions, one could contemplate operation returns. We assume at least two partitions and a an implementation which stores old readers at the client potentially growing number of clients. which does a PUT and forwards this piece of information to other partitions when doing next PUTs. Albeit in a dif- 6.2 Properties of LO ROTs ferent manner, this implementation still communicates the We adopt the definition of LO ROTs from [39], which old readers to the partition where a PUT is performed. One refers to three properties: one-round, one-version, and non- may then wonder: is there an implementation that avoids blocking. The one-round property states that for every client this overhead altogether in order not to exhibit the perfor- c’s ROT α, c sends one message to and receives one message mance issues we have seen with CC-LO in Section 5? from each partition involved in α. The nonblocking property We now address this question. We show that the extra states that for any partition p to which c sends a message, overhead on PUTs is inherent to LO by Theorem 1. Fur- p eventually sends one message (the one defined in the one- thermore, we show that the extra overhead grows with the round property) to c, even if p receives no message from a number of clients, implying the growth with the number server during α. A formal definition of one-version property of ROTs and echoing the measurement results we have re- is more involved. Basically, for every client c’s ROT α, we ported in Section 5. Our proof is by contradiction and con- consider the maximum amount of information that may be sists of three steps. First, we construct a set E of at least calculated by any implementation algorithm of c based on two executions in each of which, different clients issue the the messages which c receives during α.2 The one-version same ROT on keys x, y and then causally related PUTs on property specifies that given the messages which c receives x, y occur. Our assumption for contradiction is as follows: from any (non-empty) subset P ar of partitions during α, the in our construction, although different clients issue the same maximum amount of information contains only one version ROT, the communication between servers remains the same. per key for the keys stored in P ar. (In other words, roughly speaking, servers do not notice all clients that issue the ROT.) Then based on our assumption, 6.3 The cost of LO we are able to construct another execution E∗ in which some We say a PUT operation α completes if i) α returns to clients issue the ROT while causally related PUTs on x, y the client that issued α; and ii) the value written by α be- occur. Finally, still based on our assumption, we show that comes visible. Our theoretical result (Theorem 1) highlights in E∗, although the ROT is in parallel with the causally that the cost of LO may occur before any dangerous PUT related PUTs, no server is able to tell so and then the ROT completes. (We say a PUT operation α is dangerous if α returns a causally inconsistent snapshot. This completes causally depends on some PUT that causally depends on our proof: (roughly speaking) servers must communicate all and overwrites a non-⊥ value.) clients that issue a ROT and the worst-case communication is then linear in the number of clients. Theorem 1 (Cost of LO ROTs). Achieving LO ROT re- Our theorem applies to the system model described in quires communication, potentially growing linearly with the Section 2. Below we start with an elaboration of our system number of clients, before every dangerous PUT completes. model (Section 6.1) and the definition of LO (Section 6.2). The intuition behind the cost of LO is that a (dangerous) Then we present and prove our theorem (Section 6.3). PUT operation, PUT(y, Y1), eventually completes; however, due to the asynchronous network, a request resulting from 6.1 System Model a ROT operation α which reads keys x, y may arrive after For the ease of definitions (as well as proofs), we assume PUT(y, Y1) completes, regardless of the other request(s) re- the existence of an accurate real-time clock to which no par- sulting from α. Suppose that α has returned value X0 with tition or client has access. When we mention time, we refer respect to value X1 such that X0 ; X1 ; Y1, then α can to this clock. Furthermore, when we say that two client op- erations are concurrent, we mean that the duration of the 2We consider the amount of information instead of the plain- two operations overlap according to this clock. text as values can be encoded in different ways. For example, if a message contains X1 and X1 ⊕ X2 for two values X1,X2 Among other things, this clock allows us to give a precise of the same key, then in the plaintext, there is only one definition of eventual visibility. If PUT(x, X) starts at time version yet some implementation can calculate two versions T (and eventually ends), then there exists finite time τX ≥ T from the plaintext. The definition of one-version property such that any ROT that reads x and is issued at time t ≥ τX excludes such message as well as such implementation.

9 ∗ 2 (a) Execution E2 (b) Execution E (with rx omitted)

Figure 10: Two (in)distinguishable executions in the proof of Theorem 1

be at risk of breaking causal consistency. As a result, the in E, t1, t2, t3 take the same values while τY1 actually denotes partition which provides X0 should notify others of the risk the maximum value. and hence the communication. To emphasize the burden on py, we consider communica- Inspired by our intuition, we assume that keys x, y belong tion that precedes a message that py receives: we say mes- to different partitions px and py, respectively. We call client sage a precedes message b if (1) some process p sends b after 3 c an old reader of x, with respect to PUT(y, Y1), if c issues p receives a, or (2) ∃ message c such that a precedes c and a ROT operation which (1) is concurrent with PUT(x, X1) c precedes b. Clearly, the executions in E are the same until and PUT(y, Y1) and (2) returns X0. In general, if c issues a time t1. Since t1, these executions, especially, the commu- ROT operation that reads x, then we say c is a reader of x. nication between px and py may change. We construct all Thus the risk lies in the fact that due to the asynchronous executions in E altogether: if at some time point, in one ex- network, any reader can potentially be an old reader. ecution, some server sends a message, then we construct all To have X0 ; X1 ; Y1, for simplicity, we consider a other executions such that the same server sends the same scenario where some client cw does four PUT operations in message except that the server is px, py or contaminated by the following order: PUT(x, X0), PUT(y, Y0), PUT(x, X1) px or py. By contamination, we mean that at some point, and PUT(y, Y1), and cw issues each PUT (except for the px or py sends message m but we are unable to construct first one) after the previous PUT completes. To prove The- all other executions to do the same; then the message m orem 1, we consider the worst case: all clients except cw can and server s which receives m are contaminated and s can be readers. We identify similar executions where a different further contaminate other servers. In our construction, we subset of clients are readers. Let D be the set of all clients focus on the non-contaminated messages which are received except cw. We construct the set E such that each execution at the same time across all executions in E. For other mes- has one subset of D as readers. Hence E contains 2|D| exe- sages, if in at least two executions, the same contaminated cutions in total. We later show one execution in E in which message m can be sent, then we let m to be received at the the communication carrying readers grows linearly with |D| same time across these executions; otherwise, We do not and thus prove Theorem 1. restrict the schedule. We show that the worst-case execution exists in our con- 2|D| executions E. Each execution E ∈ E is based on a sub- struction of E. To do so, we first show a property of E; set R of D as readers. Every client c in R issues ROT({x, y}) i.e., for any two executions E1, E2 in E (with different read- at the same time t1. By one-round property, c sends two ers), the communication of px and py must be different, as 5 messages mx,req, my,req to px and py respectively at t1. We formalized in Lemma 1. denote the event that px receives mx,req by rx, the event Lemma 1 (Different readers, different messages). Consider that py receives my,req by ry. By the nonblocking property, any two executions E1,E2 ∈ E. In Ei, i ∈ {1, 2}, denote by px and py can be considered to receive messages from c and 4 Mi the messages which px or py sends to a process other send messages to c at the same time t2. Finally, c receives than D and which precedes some message that py receives messages from px and py at the same time t3. We order during [t1, τY ] in Ei, and denote by stri the concatenation events as follows: X0 and Y0 are visible, t1, rx = ry = t2, 1 of ordered messages in Mi ordered by the time when every PUT(x, X1) is issued, t3, PUT(y, Y1) is issued. Let τY1 be message is sent. Then str1 6= str2. the time when PUT(y, Y1) completes. For every execution The main intuition behind the proof is that if commu- nication were the same regardless of readers, p would be 3 The definition of an old reader of x here specifies a certain Y 5 PUT on y and is thus more specific than the definition in Lemma 1 abstracts ways of communication between px and CC-LO, an old reader of x in general. The reason to specify py so that it is independent of certain implementations, and a certain PUT is to emphasize the causal relation X1 ; Y1. covers the following example implementations of communi- The proof hereafter takes the more specific definition when cation for old readers as in CC-LO, as the example intro- mentioning old readers. duced at the beginning of this section, as well as the follow- 4 Clearly, px and py may receive messages at different time, ing: py keeps asking px whether a reader of Y0 is a reader and the proof still holds. The same time t2 is assumed for of X0 to determine whether all readers of X0 have arrived the simplicity of presentation. at py (so that there is no old reader with respect to Y1).

10 unable to distinguish readers from old readers. Suppose now we can show that ∀b1, b2 ∈ B, b1 6= b2. Then |B| = |E| = |D| by contradiction that str1 = str2. Then our construction of 2 . Therefore, it is impossible that every element in B E allows us to construct an special execution E∗ based on has fewer than |D| bits. In other words, in E, we have at E2 (as well as E1). Let the subset of D for Ei be Ri for least one execution E = E(R) where b(R) takes at least ∗ i ∈ {1, 2}. W.l.o.g., R1\R2 6= ∅. We construct E such that log (2|D|) = |D| bits, a linear function in |D|. ∗ 2 clients in R1\R2 are old readers (and show that E breaks causal consistency due to old readers). Recall that |D| is a variable that grows linearly with the ∗ ∗ Execution E with old readers. In E , both R1 and number of clients. Thus following Lemma 2, we find E con- R2 issue ROT({x, y}) at t1. To distinguish between events tains a worst-case execution that supports Theorem 1 and (as well as messages) resulting from R1 and R2, we use su- we thus complete the proof of Theorem 1. perscripts 1 and 2 to denote the events, respectively. For Remark on implementations. The proof shows the nec- simplicity of notations, in E , we call the two events at the 2 essary communication of readers when each client issues one server-side (for which p and p receive messages from R x y 2 operation. Here we want to make the link back to the imple- respectively) also r2 and r2, illustrated in Figure 10a. In E∗, x y mentation of LO ROTs in CC-LO. The reader may wonder we now have four events at the server-side: r1, r1, r2, r2. x y x y in particular about the relationship between the transaction We construct E∗ based on E by scheduling r1 and r2 in E∗ 2 x y identifiers that are sent as old readers in CC-LO, and the at t (the same time as r2 and r2 in E ), and postponing r1 2 x y 2 y worst-case communication linear in the number of clients (as well as r2), as illustrated in Figure 10b. The ordering of x derived in the theorem. In fact, the CC-LO implementa- events in E∗ is thus different from E . More specifically, the 2 tion considers that clients may issue multiple transactions order is: X and Y are visible, t , r1 = r2 = t , PUT(x, X ) 0 0 1 x y 2 1 at the same time, and then different ROTs of a single client is issued, PUT(y, Y ) is issued, τ , r1 (for every client in 1 Y1 y should be considered as different readers, hence the use of R \R as r2 has occurred), r2 (for every client in R \R , 1 2 y x 2 1 transaction identifiers to distinguish one from another. not shown in Figure 10b), R \R returns ROT. By asyn- 1 2 A final comment is on a straw-man implementation where chrony, the order is legitimate, which results in old readers each operation is attached to the output of a Lamport Clock R \R . 1 2 [35] (called logical time below) alone. Such implementa- tion (without communication of potentially old readers) still Proof of Lemma 1. Our proof is by contradiction. As str1 = fails. The problem is that the number of increments in logi- str2, according to our construction, py does not receive any message preceded by some different contaminated message cal time after ROTs is at most the number of all ROTs, i.e., 2 1 |D|. Then for some E1 and E2, Lemma 1 does not hold, i.e., in E1 and E2. Therefore even if we replace rx in E2 for rx ∗ the communication is the same. Although when issuing the in E (as in E1), then by τY1 , pY is unable to distinguish ∗ ROT, client c in R1\R2 can send logical time to servers, the between E2 and E . ∗ logical time sent in E2 and E is the same and thus does not Previously, our construction of E2 is until τY1 . Let us now ∗ ∗ help py to distinguish between E2 and E , resulting in the extend E2 so that E2 and E are the same after τY . Namely, 1 violation of causal consistency again. Hence communication in E2, after τY1 , every client c1 ∈ R1\R2 issues ROT({x, y}); 1 of readers, as Theorem 1 indicates, is still required for this and as illustrated in Figure 10, ry is scheduled at the same ∗ straw-man implementation. time in E2 and in E . Let ~v be the return value of c1’s ROT in either execution. By eventual visibility, in E2, vy = Y1. We now examine 7. RELATED WORK ∗ E . By eventual visibility, as t1 is after X0 and Y0 are 1 Causally consistent systems. Table 2 classifies existing visible, vx, vy 6= ⊥. As rx is before PUT(x, X1) is issued, ∗ systems with ROT support according to the cost of perform- vx 6= X1. By py’s indistinguishability between E2 and E , ing ROT and PUT operations. COPS-SNOW is the only and according to the one-version property, vy = Y1 as in E2. ∗ latency-optimal system. COPS-SNOW achieves latency op- Thus in E , vx = X0 and vy = Y1, a snapshot that is not timality at the expense of more costly writes, which carry causally consistent. A contradiction. detailed dependency information and incur extra communi- Lemma 1 demonstrates a property for any two executions cation overhead. Previous systems fail to achieve at least in E, which implies another property of E: if for any two one of the sub-properties of latency optimality. executions, communication has to be different, then for all ROTs in COPS and Eiger might require two rounds of executions, the number of possibilities of what is communi- client-server communication to complete. The second round cated grows with the number of elements in E. Recall that is needed if the client reads, in the first round, two items |E| is a function of |D|. Hence, we connect the communica- that might belong to different causally consistent snapshots. tion and |D| in Lemma 2. COPS and Eiger rely on fine-grained protocols to track and check the dependencies of replicated updates (see Section 3), Lemma 2 (Lower bound on the cost). Before PUT(y, Y1) which have been shown to limit their scalability [25, 26, 3]. completes, in at least one execution in E, the communication ChainReaction uses a potentially-blocking and potentially of px and py takes at least L(|D|) bits where L is a linear multi-round protocol based on a per-DC sequencer node. function. Orbe, GentleRain, Cure and POCC use a coordinator- based approach similar to what described in Section 3, and Proof of Lemma 2. We index each execution E by the set require two communications rounds. These systems use R of clients which issue ROT({x, y}) at time t1. We have physical clocks and may block ROTs either because of clock therefore 2|D| executions: E = {E(R)|R ⊆ D}. Let b(R) skew or to wait for the receipt of remote updates. be the messages which px and py send in E(R) as defined Occult uses a primary-replica approach and use HLCs to in Lemma 1, and let B = {b(R)|R ⊆ D}. By Lemma 1, avoid blocking due to clock skew. Occult implements ROTs

11 ROT latency optimality Write cost System Communication Meta-data Clock Nonblocking #Rounds #Versions c ↔ s s ↔ s c ↔ s s↔s COPS [37]  ≤ 2 ≤ 2 1 - |deps| - Logical Eiger [38]  ≤ 2 ≤ 2 1 - |deps| - Logical ChainReaction [4]  ≥ 2 1 1 ≥ 1 |deps| M Logical Orbe [25]  2 1 1 - NxM - Logical GentleRain [26]  2 1 1 - 1 - Physical Cure [3]  2 1 1 - M - Physical OCCULT† [43]  ≥ 1 ≥1 1 - O(P) - Hybrid POCC [56]  2 1 1 - M - Physical COPS-SNOW [39]  1 1 1 O(N) |deps| O(K) Logical Contrarian  1 1/2 (or 2) 1 1 - M - Hybrid

Table 2: Characterization of CC systems with ROTs support, in a geo-replicated setting. N, M and K represent, respectively, the number of partitions, DCs, and clients in a DC. † indicates a single-master system, and P represents the number of DCs that act as master for at least one partition. c ↔ s, resp., s ↔ s, indicates client-server, resp. inter-server, communication. that run in potentially more than one round and that po- implies an extra cost on writes, which is inherent and sig- tentially span multiple DCs (which makes the system not nificant. always-available). Occult requires at least one dependency Bailis et al. study the overhead of replication and de- timestamp for each DC that hosts a master replica. pendency tracking in geo-replicated CC systems [10]. By Unlike these systems, Contrarian leverages HLCs to im- contrast, we investigate the inherent cost of latency-optimal plement ROTs that are always-available, nonblocking and CC designs, i.e., even in absence of (geo-)replication. always complete in 1 1/2 (or 2) rounds of communication. Other CC systems include SwiftCloud [64], Bolt-On [11], 8. CONCLUSION Saturn [18], Bayou [51, 59], PRACTI [15], ISIS [17], lazy replication [34], causal memory [2], EunomiaKV [29] and Causally consistent read-only transactions are an attrac- CausalSpartan [54]. These systems either do not support tive primitive for large-scale systems, as they eliminate a ROTs, or target a different model from the one considered number of anomalies and facilitate the task of developers. in this paper, e.g., they do not implement sharding the data Furthermore, given that most applications are expected to set in partitions. Our theoretical results require at least be read-dominated, low latency of read-only transactions two partitions. Investigating the cost of LO in other system is of paramount importance to overall system performance. models is an avenue for future work. It would therefore appear that latency-optimal read-only CC is also implemented by systems that support differ- transactions, which provide a nonblocking, single-version ent consistency levels [22], implement strong consistency on and single-round implementation, are particularly appeal- top of CC [12], and combine different consistency levels de- ing. The catch is that these latency-optimal protocols im- pending on the semantics of operations [36, 13] or on target pose an overhead on writes that is so high that it jeopardizes SLAs [5, 58]. Our theorem provides a lower bound on the performance, even in read-heavy workloads. overhead of latency-optimal ROTs with CC. Hence, any sys- In this paper, we present an “almost latency-optimal” pro- tem that implements CC or a strictly stronger consistency tocol that maintains the nonblocking and one-version as- level cannot avoid such overhead. We are investigating how pects of their latency-optimal counterparts, but sacrifices the lower bound on this overhead varies depending on the the one-round property and instead runs in one and a half consistency level, and what is its effect on performance. rounds. On the plus side, however, this protocol avoids the entire overhead that latency-optimal protocols impose on Theoretical results on causal consistency. Causality writes. As a result, measurements show that this “almost was introduced by Lamport [35]. Hutto and Ahamad [31] latency-optimal” protocol outperforms latency-optimal pro- provided the first definition of causal consistency, later revis- tocols, not only in terms of throughput, but also in terms of ited from different angles [45, 1, 22, 62]. Mahajan et al. have latency, for all but the lowest loads and the most read-heavy proved that real-time CC is the strongest consistency level workloads. that can be obtained in an always-available and one-way In addition, we show that the overhead of the latency- convergent system [41]. Attiya et al. have introduced the optimal protocol is inherent. In other words, it is not an observable CC model and have shown that it is the strongest artifact of current implementations. In particular, we show that can be achieved by an eventually consistent data store that this overhead grows linearly with the number of clients. implementing multi-value registers [7]. The SNOW theorem [39] shows that LO can be achieved by any system that i) is not strictly serializable [50] or ii) 9. REFERENCES does not support write transactions. Based on this result, [1] Adya, A. Weak Consistency: A Generalized Theory the SNOW paper suggests that any protocol that matches and Optimistic Implementations for Distributed one of these two conditions can be improved to be latency- Transactions. PhD thesis, Cambridge, MA, USA, optimal. The SNOW paper indicates that a way to achieve 1999. AAI0800775. this is to shift the overhead from ROTs to writes. In this [2] Ahamad, M., Neiger, G., Burns, J. E., Kohli, P., paper, we prove that achieving latency optimality in CC and Hutto, P. W. Causal memory: Definitions,

12 implementation, and programming. Distributed pp. 6:1–6:16. Computing 9, 1 (1995), 37–49. [13] Balegas, V., Li, C., Najafzadeh, M., Porto, D., [3] Akkoorath, D. D., Tomsic, A. Z., Bravo, M., Li, Clement, A., Duarte, S., Ferreira, C., Gehrke, Z., Crain, T., Bieniusa, A., Preguica, N., and J., Leitao,˜ J., Preguic¸a, N., Rodrigues, R., Shapiro, M. Cure: Strong semantics meets high Shapiro, M., and Vafeiadis, V. Geo-replication: availability and low latency. In 2016 IEEE 36th Fast if possible, consistent if necessary. Data International Conference on Distributed Computing Engineering Bulletin 39, 1 (Mar. 2016), 81–92. Systems (ICDCS) (June 2016), vol. 00, pp. 405–414. [14] Balmau, O., Didona, D., Guerraoui, R., [4] Almeida, S., Leitao,˜ J. a., and Rodrigues, L. Zwaenepoel, W., Yuan, H., Arora, A., Gupta, Chainreaction: A causal+ consistent datastore based K., and Konka, P. TRIAD: Creating synergies on chain replication. In Proceedings of the 8th ACM between memory, disk and log in log structured European Conference on Computer Systems (New key-value stores. In 2017 USENIX Annual Technical York, NY, USA, 2013), EuroSys ’13, ACM, pp. 85–98. Conference (USENIX ATC 17) (Santa Clara, CA, [5] Ardekani, M. S., and Terry, D. B. A 2017), USENIX Association, pp. 363–375. self-configurable geo-replicated cloud storage system. [15] Belaramani, N., Dahlin, M., Gao, L., Nayate, In 11th USENIX Symposium on Operating Systems A., Venkataramani, A., Yalagandula, P., and Design and Implementation (OSDI 14) (Broomfield, Zheng, J. Practi replication. In Proceedings of the 3rd CO, 2014), USENIX Association, pp. 367–381. Conference on Networked Systems Design & [6] Atikoglu, B., Xu, Y., Frachtenberg, E., Jiang, Implementation - Volume 3 (Berkeley, CA, USA, S., and Paleczny, M. Workload analysis of a 2006), NSDI’06, USENIX Association, pp. 5–5. large-scale key-value store. In Proceedings of the 12th [16] Bernstein, P. A., Burckhardt, S., Bykov, S., ACM SIGMETRICS/PERFORMANCE Joint Crooks, N., Faleiro, J. M., Kliot, G., International Conference on Measurement and Kumbhare, A., Rahman, M. R., Shah, V., Modeling of Computer Systems (New York, NY, USA, Szekeres, A., and Thelin, J. Geo-distribution of 2012), SIGMETRICS ’12, ACM, pp. 53–64. actor-based services. Proc. ACM Program. Lang. 1, [7] Attiya, H., Ellen, F., and Morrison, A. OOPSLA (Oct. 2017), 107:1–107:26. Limitations of highly-available eventually-consistent [17] Birman, K. P., and Joseph, T. A. Reliable data stores. In Proceedings of the 2015 ACM communication in the presence of failures. ACM Symposium on Principles of Distributed Computing Trans. Comput. Syst. 5, 1 (Jan. 1987), 47–76. (New York, NY, USA, 2015), PODC ’15, ACM, [18] Bravo, M., Rodrigues, L., and Van Roy, P. pp. 385–394. Saturn: A distributed metadata service for causal [8] Babaoglu,˘ O., and Marzullo, K. Consistent global consistency. In Proceedings of the Twelfth European states of distributed systems: Fundamental concepts Conference on Computer Systems (New York, NY, and mechanisms. In Distributed Systems (2Nd Ed.), USA, 2017), EuroSys ’17, ACM, pp. 111–126. S. Mullender, Ed. ACM Press/Addison-Wesley [19] Calder, B., Wang, J., Ogus, A., Nilakantan, N., Publishing Co., New York, NY, USA, 1993, pp. 55–96. Skjolsvold, A., McKelvie, S., Xu, Y., Srivastav, [9] Bacon, D. F., Bales, N., Bruno, N., Cooper, S., Wu, J., Simitci, H., Haridas, J., Uddaraju, B. F., Dickinson, A., Fikes, A., Fraser, C., C., Khatri, H., Edwards, A., Bedekar, V., Gubarev, A., Joshi, M., Kogan, E., Lloyd, A., Mainali, S., Abbasi, R., Agarwal, A., Haq, M. Melnik, S., Rao, R., Shue, D., Taylor, C., F. u., Haq, M. I. u., Bhardwaj, D., Dayanand, S., van der Holst, M., and Woodford, D. Spanner: Adusumilli, A., McNett, M., Sankaran, S., Becoming a sql system. In Proceedings of the 2017 Manivannan, K., and Rigas, L. Windows azure ACM International Conference on Management of storage: A highly available cloud storage service with Data (New York, NY, USA, 2017), SIGMOD ’17, strong consistency. In Proceedings of the Twenty-Third ACM, pp. 331–343. ACM Symposium on Operating Systems Principles [10] Bailis, P., Fekete, A., Ghodsi, A., Hellerstein, (New York, NY, USA, 2011), SOSP ’11, ACM, J. M., and Stoica, I. The potential dangers of pp. 143–157. causal consistency and an explicit solution. In [20] Cooper, B. F., Silberstein, A., Tam, E., Proceedings of the Third ACM Symposium on Cloud Ramakrishnan, R., and Sears, R. Benchmarking Computing (New York, NY, USA, 2012), SoCC ’12, cloud serving systems with ycsb. In Proceedings of the ACM, pp. 22:1–22:7. 1st ACM Symposium on Cloud Computing (New [11] Bailis, P., Ghodsi, A., Hellerstein, J. M., and York, NY, USA, 2010), SoCC ’10, ACM, pp. 143–154. Stoica, I. Bolt-on causal consistency. In Proceedings [21] Corbett, J. C., Dean, J., Epstein, M., Fikes, A., of the 2013 ACM SIGMOD International Conference Frost, C., Furman, J. J., Ghemawat, S., on Management of Data (New York, NY, USA, 2013), Gubarev, A., Heiser, C., Hochschild, P., Hsieh, SIGMOD ’13, ACM, pp. 761–772. W., Kanthak, S., Kogan, E., Li, H., Lloyd, A., [12] Balegas, V., Duarte, S., Ferreira, C., Melnik, S., Mwaura, D., Nagle, D., Quinlan, S., Rodrigues, R., Preguic¸a, N., Najafzadeh, M., Rao, R., Rolig, L., Saito, Y., Szymaniak, M., and Shapiro, M. Putting consistency back into Taylor, C., Wang, R., and Woodford, D. eventual consistency. In Proceedings of the Tenth Spanner: Google’s globally distributed database. ACM European Conference on Computer Systems (New Trans. Comput. Syst. 31, 3 (Aug. 2013), 8:1–8:22. York, NY, USA, 2015), EuroSys ’15, ACM,

13 [22] Crooks, N., Pu, Y., Alvisi, L., and Clement, A. [34] Ladin, R., Liskov, B., Shrira, L., and Ghemawat, Seeing is believing: A client-centric specification of S. Providing high availability using lazy replication. database isolation. In Proceedings of the ACM ACM Trans. Comput. Syst. 10, 4 (Nov. 1992), Symposium on Principles of Distributed Computing 360–391. (New York, NY, USA, 2017), PODC ’17, ACM, [35] Lamport, L. Time, clocks, and the ordering of events pp. 73–82. in a distributed system. Commun. ACM 21, 7 (July [23] Crooks, N., Pu, Y., Estrada, N., Gupta, T., 1978), 558–565. Alvisi, L., and Clement, A. Tardis: A [36] Li, C., Leitao,˜ J. a., Clement, A., Preguic¸a, N., branch-and-merge approach to weak consistency. In Rodrigues, R., and Vafeiadis, V. Automating the Proceedings of the 2016 International Conference on choice of consistency levels in replicated systems. In Management of Data (New York, NY, USA, 2016), Proceedings of the 2014 USENIX Conference on SIGMOD ’16, ACM, pp. 1615–1628. USENIX Annual Technical Conference (Berkeley, CA, [24] DeCandia, G., Hastorun, D., Jampani, M., USA, 2014), USENIX ATC’14, USENIX Association, Kakulapati, G., Lakshman, A., Pilchin, A., pp. 281–292. Sivasubramanian, S., Vosshall, P., and Vogels, [37] Lloyd, W., Freedman, M. J., Kaminsky, M., and W. Dynamo: Amazon’s highly available key-value Andersen, D. G. Don’t settle for eventual: Scalable store. In Proceedings of Twenty-first ACM SIGOPS causal consistency for wide-area storage with cops. In Symposium on Operating Systems Principles (New Proceedings of the Twenty-Third ACM Symposium on York, NY, USA, 2007), SOSP ’07, ACM, pp. 205–220. Operating Systems Principles (New York, NY, USA, [25] Du, J., Elnikety, S., Roy, A., and Zwaenepoel, 2011), SOSP ’11, ACM, pp. 401–416. W. Orbe: Scalable causal consistency using [38] Lloyd, W., Freedman, M. J., Kaminsky, M., and dependency matrices and physical clocks. In Andersen, D. G. Stronger semantics for low-latency Proceedings of the 4th Annual Symposium on Cloud geo-replicated storage. In Proceedings of the 10th Computing (New York, NY, USA, 2013), SOCC ’13, USENIX Conference on Networked Systems Design ACM, pp. 11:1–11:14. and Implementation (Berkeley, CA, USA, 2013), [26] Du, J., Iorgulescu, C., Roy, A., and nsdi’13, USENIX Association, pp. 313–328. Zwaenepoel, W. Gentlerain: Cheap and scalable [39] Lu, H., Hodsdon, C., Ngo, K., Mu, S., and causal consistency with physical clocks. In Proceedings Lloyd, W. The snow theorem and latency-optimal of the ACM Symposium on Cloud Computing (New read-only transactions. In Proceedings of the 12th York, NY, USA, 2014), SOCC ’14, ACM, pp. 4:1–4:13. USENIX Conference on Operating Systems Design [27] Google. Protocol buffers. and Implementation (Berkeley, CA, USA, 2016), https://developers.google.com/protocol-buffers/, 2017. OSDI’16, USENIX Association, pp. 135–150. [28] Gotsman, A., Yang, H., Ferreira, C., [40] Lu, H., Veeraraghavan, K., Ajoux, P., Hunt, J., Najafzadeh, M., and Shapiro, M. ’cause i’m Song, Y. J., Tobagus, W., Kumar, S., and strong enough: Reasoning about consistency choices Lloyd, W. Existential consistency: Measuring and in distributed systems. In Proceedings of the 43rd understanding consistency at facebook. In Proceedings Annual ACM SIGPLAN-SIGACT Symposium on of the 25th Symposium on Operating Systems Principles of Programming Languages (New York, Principles (New York, NY, USA, 2015), SOSP ’15, NY, USA, 2016), POPL ’16, ACM, pp. 371–384. ACM, pp. 295–310. [29] Gunawardhana, C., Bravo, M., and Rodrigues, [41] Mahajan, P., Alvisi, L., and Dahlin, M. L. Unobtrusive deferred update stabilization for Consistency, availability, convergence. Tech. Rep. efficient geo-replication. In 2017 USENIX Annual TR-11-22, Computer Science Department, University Technical Conference (USENIX ATC 17) (Santa of Texas at Austin, May 2011. Clara, CA, 2017), USENIX Association, pp. 83–95. [42] Mattern, F. Virtual time and global states of [30] Herlihy, M. P., and Wing, J. M. Linearizability: A distributed systems. In Parallel and Distributed correctness condition for concurrent objects. ACM Algorithms (1989), North-Holland, pp. 215–226. Trans. Program. Lang. Syst. 12, 3 (July 1990), [43] Mehdi, S. A., Littley, C., Crooks, N., Alvisi, L., 463–492. Bronson, N., and Lloyd, W. I can’t believe it’s not [31] Hutto, P. W., and Ahamad, M. Slow memory: causal! scalable causal consistency with no slowdown weakening consistency to enhance concurrency in cascades. In 14th USENIX Symposium on Networked distributed shared memories. In Proceedings.,10th Systems Design and Implementation, NSDI 2017, International Conference on Distributed Computing Boston, MA, USA, March 27-29, 2017 (2017), Systems (May 1990), pp. 302–309. pp. 453–468. [32] Kraska, T., Pang, G., Franklin, M. J., Madden, [44] Moniz, H., Leitao,˜ J. a., Dias, R. J., Gehrke, J., S., and Fekete, A. Mdcc: Multi-data center Preguic¸a, N., and Rodrigues, R. Blotter: Low consistency. In Proceedings of the 8th ACM European latency transactions for geo-replicated storage. In Conference on Computer Systems (New York, NY, Proceedings of the 26th International Conference on USA, 2013), EuroSys ’13, ACM, pp. 113–126. (Republic and Canton of Geneva, [33] Kulkarni, S. S., Demirbas, M., Madappa, D., Switzerland, 2017), WWW ’17, International World Avva, B., and Leone, M. Logical physical clocks. In Wide Web Conferences Steering Committee, Principles of Distributed Systems (OPODIS) (2014), pp. 263–272. Springer International Publishing, pp. 17–32.

14 [45] Mosberger, D. Memory consistency models. Publishers, 2010. SIGOPS Oper. Syst. Rev. 27, 1 (Jan. 1993), 18–26. [58] Terry, D. B., Prabhakaran, V., Kotla, R., [46] Nawab, F., Arora, V., Agrawal, D., and Balakrishnan, M., Aguilera, M. K., and El Abbadi, A. Minimizing commit latency of Abu-Libdeh, H. Consistency-based service level transactions in geo-replicated data stores. In agreements for cloud storage. In Proceedings of the Proceedings of the 2015 ACM SIGMOD International Twenty-Fourth ACM Symposium on Operating Conference on Management of Data (New York, NY, Systems Principles (New York, NY, USA, 2013), USA, 2015), SIGMOD ’15, ACM, pp. 1279–1294. SOSP ’13, ACM, pp. 309–324. [47] Nishtala, R., Fugal, H., Grimm, S., [59] Terry, D. B., Theimer, M. M., Petersen, K., Kwiatkowski, M., Lee, H., Li, H. C., McElroy, Demers, A. J., Spreitzer, M. J., and Hauser, R., Paleczny, M., Peek, D., Saab, P., Stafford, C. H. Managing update conflicts in bayou, a weakly D., Tung, T., and Venkataramani, V. Scaling connected replicated storage system. In Proceedings of memcache at facebook. In Proceedings of the 10th the Fifteenth ACM Symposium on Operating Systems USENIX Conference on Networked Systems Design Principles (New York, NY, USA, 1995), SOSP ’95, and Implementation (Berkeley, CA, USA, 2013), ACM, pp. 172–182. nsdi’13, USENIX Association, pp. 385–398. [60] Thomas, R. H. A majority consensus approach to [48] Noghabi, S. A., Subramanian, S., Narayanan, P., concurrency control for multiple copy databases. ACM Narayanan, S., Holla, G., Zadeh, M., Li, T., Trans. Database Syst. 4, 2 (June 1979), 180–209. Gupta, I., and Campbell, R. H. Ambry: Linkedin’s [61] Verbitski, A., Gupta, A., Saha, D., scalable geo-distributed object store. In Proceedings of Brahmadesam, M., Gupta, K., Mittal, R., the 2016 International Conference on Management of Krishnamurthy, S., Maurice, S., Kharatishvili, Data (New York, NY, USA, 2016), SIGMOD ’16, T., and Bao, X. Amazon aurora: Design ACM, pp. 253–265. considerations for high throughput cloud-native [49] NTP. The network time protocol. relational databases. In Proceedings of the 2017 ACM http://www.ntp.org, 2017. International Conference on Management of Data [50] Papadimitriou, C. H. The serializability of (2017), SIGMOD ’17, ACM, pp. 1041–1052. concurrent database updates. J. ACM 26, 4 (Oct. [62] Viotti, P., and Vukolic, M. Consistency in 1979), 631–653. non-transactional distributed storage systems. ACM [51] Petersen, K., Spreitzer, M. J., Terry, D. B., Comput. Surv. 49, 1 (2016), 19:1–19:34. Theimer, M. M., and Demers, A. J. Flexible [63] Vogels, W. Eventually consistent. Commun. ACM update propagation for weakly consistent replication. 52, 1 (Jan. 2009), 40–44. In Proceedings of the Sixteenth ACM Symposium on [64] Zawirski, M., Preguic¸a, N., Duarte, S., Operating Systems Principles (New York, NY, USA, Bieniusa, A., Balegas, V., and Shapiro, M. Write 1997), SOSP ’97, ACM, pp. 288–301. fast, read in the past: Causal consistency for [52] Rahman, M. R., Tseng, L., Nguyen, S., Gupta, client-side applications. In Proceedings of the 16th I., and Vaidya, N. Characterizing and adapting the Annual Middleware Conference (New York, NY, USA, consistency-latency tradeoff in distributed key-value 2015), Middleware ’15, ACM, pp. 75–87. stores. ACM Trans. Auton. Adapt. Syst. 11, 4 (Jan. [65] Zhang, I., Lebeck, N., Fonseca, P., Holt, B., 2017), 20:1–20:36. Cheng, R., Norberg, A., Krishnamurthy, A., [53] Reda, W., Canini, M., Suresh, L., Kostic,´ D., and Levy, H. M. Diamond: Automating data and Braithwaite, S. Rein: Taming tail latency in management and storage for wide-area, reactive key-value stores via multiget scheduling. In applications. In 12th USENIX Symposium on Proceedings of the Twelfth European Conference on Operating Systems Design and Implementation (OSDI Computer Systems (New York, NY, USA, 2017), 16) (GA, 2016), USENIX Association, pp. 723–738. EuroSys ’17, ACM, pp. 95–110. [66] Zhang, I., Sharma, N. K., Szekeres, A., [54] Roohitavaf, M., Demirbas, M., and Kulkarni, S. Krishnamurthy, A., and Ports, D. R. K. Building Causalspartan: Causal consistency for distributed consistent transactions with inconsistent replication. data stores using hybrid logical clocks. In 2017 IEEE In Proceedings of the 25th Symposium on Operating 36th Symposium on Reliable Distributed Systems Systems Principles (New York, NY, USA, 2015), (SRDS) (Sept. 2017), pp. 184–193. SOSP ’15, ACM, pp. 263–278. [55] Sovran, Y., Power, R., Aguilera, M. K., and Li, [67] Zhang, Y., Power, R., Zhou, S., Sovran, Y., J. Transactional storage for geo-replicated systems. In Aguilera, M. K., and Li, J. Transaction chains: Proceedings of the Twenty-Third ACM Symposium on Achieving serializability with low latency in Operating Systems Principles (New York, NY, USA, geo-distributed storage systems. In Proceedings of the 2011), SOSP ’11, ACM, pp. 385–400. Twenty-Fourth ACM Symposium on Operating [56] Spirovska, K., Didona, D., and Zwaenepoel, W. Systems Principles (2013), SOSP ’13, ACM, Optimistic causal consistency for geo-replicated pp. 276–291. key-value stores. In 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS) (June 2017), pp. 2626–2629. [57] Tay, Y. C. Analytical Performance Modeling for Computer Systems, 1st ed. Morgan and Claypool

15