(will be inserted by the editor)

Dahlia Malkhi · Doug Terry Concise Version Vectors in WinFS

Received: date / Accepted: date

Abstract Conflicts naturally arise in optimistically object versions across replicas and be alerted to any replicated systems. The common way to detect up- conflicts. date conflicts is via version vectors, whose storage This system model arises naturally within the and communication overhead are number of repli- scope of ’s WinFS platform, which was cas × number of objects. These costs may be pro- designed to provide peer-to-peer weakly consistent hibitive for large systems. replicated storage facilities. The model is funda- This paper presents predecessor vectors with ex- mental in distributed systems, and numerous repli- ceptions (PVEs), a novel optimistic replication tech- cation methods exist to support it. However, the nique developed for Microsoft’s WinFS system. The applications that are targeted by WinFS mandate paper contains a systematic study of PVE’s per- taking scale more seriously than ever before. In formance gains over traditional schemes. The re- particular, e- repositories, log files, digital li- sults demonstrate a dramatic reduction of storage braries, and application databases can easily reach and communication overhead in normal scenarios, millions of objects. Hence, communicating even a during which communication disruptions are infre- single bit per object (e.g., a ‘dirty’ bit) in order to quent. Moreover, they identify a cross-over thresh- be able synchronize replicas might simply be too old in communication failure-rate, beyond which costly. PVEs loses efficiency compared with traditional schemes. In this paper, we present a precise description and correctness proof of the replica reconciliation and conflict detection mechanism inside Microsoft’s WinFS system. We name the scheme predecessor 1 Introduction vectors with exceptions (PVE). We produce a sys- tematic study of the performance gains of PVE Consider an information system, such as an e-mail and provide a comparison with the traditional opti- client, that is composed of multiple data objects, mistic replication scheme. The results demonstrate holding folders, files and tags. Data may be repli- a substantial reduction in the storage and commu- cated in multiple sites. For example, a user’s mail- nication overhead associated with replica synchro- box may reside at a , on the user’s office and nization in most cases. In conditions that allow syn- home PCs, on the user’s laptop, and on a PDA. chronizations to complete without communication The system allows concurrent, optimistic updates breaks, a pair of replicas needs only communicate to its objects from distributed locations without a constant number of bits per replica in order to communication or centralized control. For exam- detect discrepancies in their states. Moreover, they ple, the user might hop on the plane with a copy of need to maintain only a single counter per object her mailbox on a laptop and edit various parts of in order to determine the causal ordering of ob- it while disconnected; she may introduce changes jects’ versions and detect any conflicting versions on a PDA, and so on. At some point, when these arising from concurrent updates. Our study also computers are connected, she wishes to synchronize demonstrates the “cut-off” point in the communi- cation fault-rate, beyond which the PVE technique D. Malkhi becomes less attractive than the alternatives. Microsoft Research Silicon Valley In order to understand the efficiency leap of- E-mail: [email protected] fered by the PVE scheme, let us review the most Doug Terry well known alternative. Version Vectors (VVs) [6] Microsoft Research Silicon Valley are traditionally used in optimistic replication sys- e tems in order to decide which replica has the more down close to O(q × R), which is the lowest up-to-date contents for each object, as well as to possible communication overhead with the VV detect conflicting versions. Per object version vec- scheme. tors were pioneered in the Locus distributed file system [6], and subsequently employed in various These cost measures and their analysis are made optimistic replication systems [7,11,9]. more precise later in the paper. Note that even for The version vector for a data object is an array moderate numbers of replicas R, storing N × R counters is a substantial burden when N is large. of size R, where R is the number of replicas in the e system. Each replica has a pair hreplica, counteri Moreover, communicating between O(q × R) and e in the vector, indicating the number of modifica- O(N × R) overhead bits may be prohibitive. tions performed on the object by the replica. For In WinFS, the goal is to quickly synchronize example, suppose that we have three replicas, A, B, heavy-volume replicas, each carrying large magni- and C. An object is initialized with version vector tudes of objects. In situations where communica- (hA, 0i, hB, 0i, hC, 0i). An update to the object ini- tion disruptions are not the norm, the innovative tiated at replica A increments A’s component, and PVE mechanism in WinFS reduces storage and com- so generates version vector (hA, 1i, hB, 0i, hC, 0i). munication costs by a considerable amount. Repli- e Later, B may obtain the updated object from A, cas need only exchange O(R) information bits to along with its version vector, and produce another determine their differences, i.e. the equivalent of update on the object. The newer object state re- communicating a single version vector. In addi- ceives version vector (hA, 1i, hB, 1i, hC, 0i). And so tion, in most cases, per object meta-information on. storage and communication is constant (one counter). A version vector V dominates another vector Table 1 in Section 6 contains a summary of these W if every component of V is no less than W ; V complexities. In the remainder of this paper, we de- strictly dominates W , if it dominates W and one scribe the foundations of the PVE replication pro- component of V is greater. Due to optimism, ob- tocol and compare it against the traditional VV jects on different replicas may have version vectors scheme. that are incomparable by the domination relation. The contributions of this paper are three-fold. This corresponds to conflicting versions, indicating First, we give a precise and detailed formulation of that simultaneous updates were introduced to the the PVE replica reconciliation protocol employed in object at different replicas. For example, continuing WinFS. Second, we develop a performance model the scenario above, suppose that all replicas have capturing the cost measures of interest and quan- version vector (hA, 1i, hB, 1i, hC, 0i) for a stored ob- tify the performance gains of the PVE scheme com- ject. Now replicas A and C produce diverging up- pared with known methods. Third, we evaluate the dates on the object simultaneously. These updates PVE scheme via simulation under complex system generate version vectors (hA, 2i, hB, 1i, hC, 0i) and conditions with increasing communication failures (hA, 1i, hB, 1i, hC, 1i), respectively, which are con- rates. This evaluation reveals a cut-off point that flicting since neither one dominates the other. characterizes the benefit area of the PVE scheme Consider a system with N objects replicated over traditional version vectors. We note that the across R replicas. Further, consider the synchro- full design and architecture of the WinFS replica- nization between two replicas that have differing tion platform is the result of a large team effort versions for q objects. The VV scheme is designed and is beyond the scope of this paper; we refer the for synchronizing replicas object by object, and in- interested reader to [10] for a wider coverage of the curs the following costs: WinFS architecture. 1. Store a version vector per object, incurring a The next section gives a detailed problem state- e 1 storage overhead of O(N × R) bits; ment. Section 3 provides an informal overview of 2. Communicate information that allows the two the PVE scheme. Section 4 provides a precise de- replicas to determine which objects one should scription of the method and lays the foundation for send the other and to detect conflicts. A naive reasoning about its correctness. Correctness proofs implementation sends all N version vectors, in- are provided in Section 5. Section 6 contains a per- e formance study. Section 7 surveys related work, and curring a communication overhead of O(N ×R). Section 8 concludes. If the replicas store logs of recent updates and maintain additional information about the log entries that have previously been synchronized with other replicas, they may bring the cost 2 Problem Statement

1 For simplicity of notation, the notation Oe(·) indi- cates the same complexity order as O(·) up to logarith- In this section, we begin with the precise speci- mic factors of N and R, which may be required to code fication of our problem. Later sections provide a any single value in our settings. rigorous treatment of the solution.

2 The system consists of a collection of data ob- When two versions do not follow one another, they jects, potentially numerous. Each object might be are conflicting. In other words, if w 6≺ v ∧ v 6≺ w, quite small, e.g. a mail entry or even a status word. then v and w are said to conflict. Objects are replicated on a set of hosts. Each host It is desirable to detect and resolve conflicts, ei- may locally introduce updates to any object with- ther automatically (when application-specific con- out any concurrency control. These updates create flict resolution code is available) or by alerting users a partial ordering of object versions, where updates who can resolve conflicts manually. In either case, that sequentially follow one another are causally re- a resolution of conflicting versions is a version that lated, but unrelated updates are conflicting. causally follows both. For example, here is a con- Our focus is on distributed systems in which up- flict and its resolution: v0 ≺ v ≺ w ; v 6≺ u ; u 6≺ dates overwrite previous versions. The alternative v ; v0 ≺ u ≺ w . would be database or journal systems in which the New versions override previous ones, and so repli- history of updates on an object is logged and ap- cas are generally only interested in the most re- plied at every replica. State-based replication saves cent version available; versions that causally pre- storage and computation and is suitable for the cede it are obsolete and carry no valuable informa- kind of information-intensive applications for which tion. This simple rule is complicated by the fact WinFS was designed, e.g. a user’s Outlook files, that multiple conflicting versions may exist; in this where updates may be numerous. In state-based case, replicas are interested in all concurrent ver- replication systems, only the most recent version of sions until they are resolved. any object needs to be sent during synchronization. Nevertheless, it is worth noting that the method presented in this paper can work with (minor) ap- 2.1 Performance Measures propriate modifications for log-based replication sys- tems. For brevity, we omit this from discussion in This paper is concerned with mechanisms that fa- this paper. cilitate synchronization of different replicas. The The goal is to provide a lightweight replica rec- challenge is to bring the storage and communica- onciliation and conflict detection mechanism. The tion costs associated with replica reconciliation (sig- mechanism should provide two communicating repli- nificantly) down. More precisely, we focus on two cas with the means to detect precedence ordering performance measures: on object versions that they hold and detect any conflicts while requiring only a small amount of per- Storage is the total number of overhead bits stored object overhead. With this mechanism, replicas can in order to preserve version ordering. bring each other up-to-date and report conflicts. Communication is (i) the total number of bits com- More precisely, we now describe objects, ver- municated between two replicas in order to de- sions, and causality. An object is identified uniquely termine which updates are known to one replica by its name. Objects are instantiated with versions, but not the other , and (ii) any overhead data where an object instance has the following fields: that is transferred along with objects’ states in name: the unique identifier. order to determine precedence/conflicts. version: a pair hreplica id, counteri. predecessors: a set of preceding versions (including the current version). 3 Overview of the PVE Method data: opaque application-specific information. Because versions uniquely determine objects’ in- This section provides an informal overview of the stances, we simply refer to any particular instance PVE scheme. Later sections provide a more formal by its version. Over space and time, there may be description and a proof of correctness. multiple versions with the same object name. We The PVE scheme works as follows. An object say that these are versions of the same object. version is a pair hreplica, counteri. Instead of using There is a partial, causal ordering among differ- separate counters for distinct objects, the scheme ent versions of the same object. When a replica A uses one per-replica counter to enumerate the ver- creates an instance of an object o with version v, sions that the replica generates on all objects (the the set W of versions that are previously known by counter is across all objects). For example, suppose replica A on o causally precedes version v. In nota- that replica A first introduces an update to object tion, W ≺ v. For every version w ∈ W, we likewise o1 and second to o2. The versions corresponding to say that w causally precedes v; in notation, w ≺ v. o1 and to o2 will be hA, 1i, hA, 2i, respectively. Note Causality is transitive. that versions are not full vectors, as in the tradi- Since the system permits concurrent updates, tional VV scheme described in the Introduction. the causality relation is only a partial order, i.e. Each object has, in addition to its version, a pre- multiple versions might follow any single version. decessor set that captures the versions that causally

3 precede the current one. Predecessor sets are cap- object, but are fine since they are never compared tured in PVE using version vectors, though we will against each other. show momentarily that, in most cases, PVE can re- Hence, comparing different versions for the same place these vectors with a null pointer. In order to object is now possible as in the traditional use of distinguish these vectors from the traditional ver- version vectors. Namely, the same domination re- sion vectors, we call them predecessor vectors. A lation among predecessor vectors and versions can predecessor vector (PV) contains one version, the determine precedence and conflicts of updates to latest, per replica. When a replica A generates a the same object. new object version, the PV associated with the new version contains the latest versions known by A on Reducing the Overhead. So far we have not intro- the object from each other replica. For example, duced any space savings over traditional version suppose we have three replicas, A, B, and C. A vectors. The surprising benefit of PVs is as follows. new object starts with a zeroed predecessor vec- Let X.knowledge denote the component-wise max- tor (hA, 0i, hB, 0i, hC, 0i). Consider the two versions imum of the PVs of all the versions held by a replica generated by replica A above on o1 and o2, hA, 1i X. The performance savings stems from the follow- and hA, 2i, respectively. When A creates these ver- ing fact: In order to represent ordering relations of sions, no other versions are known on either o1 or all the versions X stores for all objects, it suffices o2, hence the PV of hA, 1i is (hA, 1i, hB, 0i, hC, 0i), for replica X to store only X.knowledge. Knowl- and the PV of hA, 2i is (hA, 2i, hB, 0i, hC, 0i). A edge aggregates the predecessor vectors of all ob- (causally) subsequent update to o1 by replica B cre- jects and is used instead of per-object PVs. More ates version hB, 1i, with PV (hA, 1i, hB, 1i, hC, 0i). specifically, knowledge replaces PVs as follows: This predecessor vector represents the latest ver- sions known by B on object o1. – No PV is stored per object at all. The only vec- The formal definitions capturing the predeces- tor stored by a replica is its aggregate knowledge sor vectors scheme are given below. vector. – In order for A to determine which data ob- Definition 1 (Per-Replica Counter) jects in its store are more up-to-date than B’s Let X be a replica. The versions generated by store, B simply needs to send B.knowledge to replica X on objects are the ordered sequence {hX,ii}i=1,2A,..... Using the difference between A.knowledge and B.knowledge, A can determine which ver- Definition 2 (Predecessor Vectors) sions it should send B. Let X1, ..., XR be the set of replicas. A prede- – Having determined the q relevant newer objects, cessor vector (PV) is an R-array of tuples of the A sends these objects with (only) a single ver- form (hX1,i1i, ..., hXR,iRi). sion each. In addition, A needs to send (once) A predecessor vector (hX1,i1i, ..., hXR,iRi) dom- its knowledge vector. inates another vector (hX1, j1i, ..., hXR, jRi) if ik ≥ jk for k = 1..R, and it strictly dominates if The reader may be concerned at this point that iℓ > jℓ for some 1 ≤ ℓ ≤ R. information is lost regarding the ability to deter- By a natural overload of notation, we say that mine version precedence. We now demonstrate why a predecessor vector (hX1,i1i,..., hXR,iRi) domi- this is not the case. When two replicas, A and B, nates a version hXk, jki if ik ≥ jk; strict domination wish to compare their latest versions of the same follows accordingly with strong inequality. object o, say hr, nri and hs,nsi respectively, they simply compare these against B.knowledge and A.knowledge The reader should first note that despite the respectively. If A.knowledge dominates hs,nsi, then aggregation of multiple-object versions using one the version currently held by A for object o, namely counter, predecessor vectors can express precedence hr, nri, strictly succeeds hs,nsi. And vice versa. If relations between versions of the same object. For none of these knowledge vectors dominates the other example, in the scenario above, version hA, 1i pre- version, then these are conflicting versions. cedes hB, 1i, and is indeed dominated by the PV as- Going back to the scenario presented above, sociated with version hB, 1i. Moreover, PVs do not replica A has in store the following: o1.version = create false conflicts. The reason is that incompara- hA, 3i; o2.version = hA, 2i; knowledge = (hA, 3i, hB, 1i, hC, 0i). ble predecessor vectors conflict only if they belong Replica C stores the following: o1.version = hB, 1i; to the same object. So for example, suppose that o2.version = hC, 1i; knowledge = (hA, 2i, hB, 1i, hC, 1i). continuing the scenario above, replica A introduces When comparing their versions for object o1, A and version hA, 3i to object o1 with predecessor vector C will find that A’s version is more recent, and (hA, 3i, hB, 1i, hC, 0i); and simultaneously, replica when comparing their versions of object o2, they C introduces version hC, 1i on o2, with the corre- will find C’s version to be the recent one. sponding PV (hA, 2i, hB, 0i, hC, 1i). These versions The result is that storage overhead in WinFS is e e would be conflicting had they belonged to the same O(N +R), instead of O(N ×R). More dramatically,

4 the communication overhead associated with syn- with replica A. During this synchronization, only chronization is reduced. The communication over- the most recent versions of objects o1 and o2 are e head of sending knowledge is O(R), and the total sent to replica C. In this scenario, there is sim- communication overhead associated with synchro- ply no way to prevent holes. Replica C may first e nizing replicas is O(q + R). obtain o1’s recent version, hB, 2i, and then have its communication cut. Then version hB, 1i (which Dealing with Disrupted Synchronization. Synchro- happens to belong to o2) is missing. A similar situ- nization among two replicas may fail to complete ation occurs if replica C obtains o2’s recent version due to network disruption. One way of coping with first and is then disconnected. this is to abort incomplete synchronization proce- It is worth noting that although seemingly we dures; then no further complication to the above don’t care about the missing, obsoleted versions, we scheme is needed. cannot ignore them. If the subsequent versions are However, in reality, due to large volumes that lost from the system for some reason, inconsistency may need to be synchronized, aborting a partially- may result. For example, in the first case above, completed synchronization may not be desirable the missing o2 version hB, 1i is subsumed by a later (and in fact, may create increasingly larger and version hA, 2i. However, if replica C simply includes larger synchronization demands that might become hB, 2i in its knowledge vector, and replica A crashes less and less likely to complete). The aggregate such that hA, 2i is forever lost from the system, knowledge method above introduces a new source C might never obtain the latest state of o2 from of difficulty due to incomplete synchronizations. Let replica B. us demonstrate this problem. When replica A re- The price paid in the PVE scheme for its sub- ceives an object’s new version from another replica stantial storage and communication reduction is B, that object does not carry a specific PV. Sup- the need to maintain information about such excep- pose that before synchronizing with B, the highest tions. In the above scenario, replica C will need to version A stores from B on any object is hB, 10i. store exception information as follows. First, C.knowledge If B sends hB, 14i, then clearly versions hB, 11i, will contain (hA, 0i, hB, 2i, hC, 0i) with an exception hB, 12i, and hB, 13i are missing in A’s knowledge, heB, 1i.2 hence there are “holes”. It is tempting to try to solve this by a policy Definition 3 (PVs with Exceptions) that mandates sending all versions from one replica A predecessor vector with exceptions (PVE) is R X1,i1 eX1,i eX1,i , in an order that respects their generation order. an -array of tuples of the form (h ih j1 ih jk1 i ..., X ,i eX ,i eX1,i In the above scenario, send hB, 11i before hB, 14i, h R Rih R jR ih jkR i). unless that version has been obsoleted by another A version hXk, jki is dominated by a predeces- version. Then, when hB, 14i is received, A would sor vector X with exceptions as above if ik ≥ jk, know that it must already reflect hB, 11i, hB, 12i, and jk is not among the exceptions in the k’th po- and hB, 13i. sition in X. Unfortunately, this strategy is impossible to en- A predecessor vector with exceptions X domi- force, as illustrated in the following scenario. Ob- nates another vector Y if the respective PVs with- ject o1 receives an update from replica A with ver- out the exceptions dominate, and no exception in- sion hA, 1i and PV (hA, 1i, hB, 0i, hC, 0i). Mean- cluded X is dominated by Y . while, object o2 is updated by B, producing ver- sion hB, 1i with PV (hA, 0i, hB, 1i, hC, 0i). Replica Second, we require that a replica maintain an A and B synchronize and exchange their latest up- explicit PV for every new version it obtains via a dates. Subsequently, object o1 is updated at replica partial synchronization. These explicit PVs may be B with version hB, 2i and PV (hA, 1i, hB, 2i, hC, 0i); omitted only if the replica’s knowledge dominates and object o2 is updated at replica A with version them. Continuing the scenario above, we demon- hA, 2i and PV (hA, 2i, hB, 1i, hC, 0i). The orderings strate a subtle chain of events which necessitates between all versions is as follows: this additional overhead. C o1 : [hA, 1i;PV=(hA, 1i, hB, 0i, hC, 0i)] ≺ [hB, 2i;PV =Consider the information stored at replica (hA, 1i, hB, 2i, hC, 0i)] after partial synchronization: o1.version = hB, 2i; o2.version knowledge A, , B, eB, , C, o2 : [hB, 1i;PV = (hA, 0i, hB, 1i, hC, 0i)] ≺ = ⊥; = (h 0i h 2ih 1i h 0i). [hA, 2i;PV=(hA, 2i, hB, 1i, hC, 0i)] Suppose that A synchronizes with C and sends it 2 An alternative form of exception is to store Then Replica B synchronizes with replica A, send- (hA, 0i, hB, 0i, hC, 0i) with a ‘positive exception’ ing it all of its recent updates. Replica A now stores: heB, 2i. The two alternatives result in different storage o1.version = hB, 2i; o2.version = hA, 2i; knowledge load under different scenarios, positive exceptions being preferable under long synchronization gaps. = (hA, 2i, hB, 2i, hC, 0i). For simplicity, we use negative exceptions in the Now suppose that replica C, which has been description here, although the method employed in detached for a while, comes back and synchronizes WinFS uses positive exceptions.

5 update hA, 1i on o1. This update clearly does not that are added to the replica’s knowledge, as are follow hB, 2i (the current version of o1 held by C), versions received from other replicas during syn- but according to C’s knowledge, neither is it suc- chronization. ceeded by it – a conflict! The problem, of course, is that C’s knowledge no longer dominates version Definition 5 (The Predecessors Invariant) For hA, 1i. all object instances v and w, we require r to main- Only at the end of the synchronization proce- tain a set w.predecessors such that v ∈ w.predecessors dure, the knowledge of the sending replica is merged if and only if v ≺ w. with the knowledge of the receiving replica. At that point, knowledge at the receiving replica will clearly When a new data object is created, its predecessors dominate all of the versions it received during syn- set contains only its own version. Each update to chronization, and their PV may be omitted. But if the object produces a new version that is added to synchronization is cut in the middle, some of these the previous version’s predecessors to get the pre- PVs must be kept until such time when the replica’s decessors set for the new instance of the object. knowledge again dominates them. Given the above two invariants, it is possible In our performance analysis and comparison with to determine if a version is included in a replica’s other methods, we take into account this cost and storage and if one version precedes another or they measure its effect. Note that it is incurred only due conflict. to communication disruptions that prevent synchro- nization procedures from completing. Our simu- lations vary the number of such disruptions from 4.1 A Synchronization Framework small to aggressively high. Figure 1 presents an asymmetric synchronization protocol that uses knowledge and predecessors. 4 Causality-Based Replica Reconciliation The protocol is a one-way protocol between a re- questing replica and a source replica. The requestor In this section, we begin to provide the formal treat- contacts a source and obtains all the versions in the ment of the PVE replica reconciliation mechanism. source replica’s knowledge. These versions are in- Our approach builds the description in two steps. tegrated into the requestor’s storage, and conflict First, we give a generic set-oriented method for alarms are raised where needed. replica reconciliation and define the invariants that it maintains. This protocol introduces the basic syn- chronization and is simple enough that one can easily argue that it satisfies the stated correct- 1. Requestor r sends source s its knowledge set ness properties. Second, we instantiate the method r.knowledge. with the PVE concise predecessor vectors devel- 2. Source s responds with the following: oped for WinFS. This is an abstract version of the (a) For every object o it stores, for which o.version 6∈ r.knowledge, it sends o actual synchronization protocol used in WinFS. Its 3. For every version o received by from s, requestor r correctness is then proven in Section 5. does the following: The key enabler of replica synchronization is a (a) For every object w in store, such that w.name = mechanism for representing sets of versions, through o.name: which precedence ordering can be captured. To this if o ∈ w.predecessors then ignore o and stop; else if w.version ∈ o.predecessors then delete end, replicas store the following information con- w; cerning causality. First, replica r maintains infor- else alert conflict. mation about the entire set of versions it knows (b) Insert o.version into r.knowledge. of, represented in r.knowledge. Second, each ver- (c) Integrate o.predecessors into r.knowledge. (d) store o sion v stored at replica r contains in v.predecessors a representation of the entire set of causally pre- Fig. 1 A generic replica synchronization protocol using ceding versions. More specifically, we require the causality information. maintenance of a set r.knowledge per replica r and v.predecessors per stored version v, as follows: This protocol clearly maintains the two desired Definition 4 (The Knowledge Invariant) For invariants. However, for practical purposes, repre- every replica r and version v, we require r to main- senting the full knowledge and predecessors sets is tain a set r.knowledge, such that if v ∈ r.knowledge too costly. The challenge is to represent causality in then replica r stores version v or a version w such a space-efficient manner, suitable for very large ob- that v ≺ w. ject sets and moderate-size replica sets, while main- When a replica is first created, its knowledge taining the invariants. The detailed solution follows set is empty. Local updates produce new versions in the next section.

6 4.2 Concise Version Vectors Generating a New Version. When a replica r gen- erates a new update on an object o, the new version The key to our novel conflict-detection technique hr, ci is inserted into r.knowledge right away. Then, is to transform the predecessors sets into different if o.predecessor is ⊥, nothing needs to be done to it. sets that can be represented more efficiently. Implicitly, this means that the versions dominated We first require the following technical defini- by r.knowledge causally precede the new version. If tion: o.predecessors is not empty, then the new version is inserted to o.predecessors without exceptions. Im- Definition 6 (Extrinsic) Let o be some object plicitly this means that the set of versions that were and o.predecessors its predecessor set. Let S be a dominated by the previous o.predecessors causally set of versions. We denote by S |o.name the reduc- precede the new update. tion of S to versions pertaining to object o.name only. S is called extrinsic to o if S | = o.predecessors. o.name Synchronization. Figure 2 below describes the full The surprising storage saving is derived in PVE PVE synchronization protocol. Space saving using from the following realization. For any object o, we empty predecessors requires caution in maintaining can use an extrinsic set to o in place of predecessors the extrinsic nature of predecessor sets throughout throughout the protocol. In particular, when a replica’sthe synchronization protocol. knowledge set is extrinsic to o, the knowledge may First, suppose that a requestor r receives from a be used as the predecessors set for o. The main stor- source s a version v with an extrinsic v.predecessors age savings is derived from using an empty prede- set. Unlike the simplified protocol in Figure 1, it is cessors set for an object to denote (by convention) incorrect to merge v.predecessors into the r.knowledge that the replica’s knowledge set may be used in- set right away, since v.predecessors may contain stead. The following rule is the root of the PVE versions of objects different from v that r does storage and communication savings: not have. Hence, only v itself can be inserted into r.knowledge. Property 1 At any point in the protocol, any predecessorsSecond, consider the state of r.knowledge at the set may be replaced with an extrinsic set. By con- end of r’s synchronization with s. Every version vention, an empty predecessors set indicates the v sent by s has been inserted into r.knowledge. replica’s knowledge set. However, there may be some versions, e.g. w ≺ v, that r.knowledge does not contain. Source replica We are now ready to introduce the PVE novel s does not explicitly send w, because it is included conflict detection scheme, which considerably re- in v.predecessors. But since predecessor sets are duces the size of representations of predecessor ver- not merged into r.knowledge, it may be left not sions in normal cases. containing w. To address this, at the end of an un- interrupted synchronization with s, the requestor r merges s.knowledge into r.knowledge. The goal of Versions and Predecessor Vectors. The scheme uses the merging is to produce a vector that represents the per-replica counter defined in Definition 1, which a union of all the versions included in r.knowledge enumerates updates generated by the replica on all and s.knowledge, and replace r.knowledge with it. r objects. Hence, a replica maintains a local counter For example, merging s.knowledge = (hA, 3i, hB, 5iheB, 4i, hC, 6i) c r . When replica generates a version on an object into r.knowledge = (hA, 7iheA, 6i, hB, 3iheB, 2i, hC, 1i) o , it increments the local counter and creates ver- yields (hA, 7iheA, 6i, hB, 5iheB, 4i, hC, 6i). sion hr, ci on object o. Predecessors are represented Third, should synchronization ever be disrupted using the PVEs as defined in Definition 3. in the middle, a requestor r may be left with r.knowledge lacking some versions. This happens if a version v Knowledge. A replica r maintains in r.knowledge was incorporated into r.knowledge, but some pre- a PVE representing all the versions it knows of. ceding version w ≺ v has not been merged in. As Inserting a new version hs,nsi into r.knowledge is a consequence, in a future synchronization request, done by updating the highest version seen by s to say with s′, r may (inefficiently) receive w from ′ hs,nsi, and possibly inserting exceptions if there s . Hence, r checks if it can discard w by testing are holes between ns and the previous highest ver- whether w is contained in v.predecessors (and if sion from s. yes, r also inserts w into r.knowledge for efficiency).

Object Predecessors. As already mentioned, an empty (⊥) predecessors set is used whenever r.knowledge 4.3 Properties is extrinsic to an object. In all other cases, predecessor contains a PVE, describing the set of causally pre- The following properties are maintained by our pro- ceding versions on the object. tocol, and are derived from the two invariants given

7 this version is held in storage by r. Moreover, ′ ′ 1. Requestor r sends source s its knowledge set any preceding version hr, nri, nr < nr, must r.knowledge. also be stored in r, or has been obsoleted by 2. Source s responds with the following: another version on the same object. Hence, no (a) It sends s.knowledge. exceptions are needed. (b) For every object o it stores, for which o.version 6∈ r.knowledge, it sends o. If – During synchronization, when a new version for s.knowledge is not extrinsic to o, s some object o arrives, that version is inserted sends o.predecessors (otherwise, leave into r.knowledge, along with any required ex- o.predecessors empty). ceptions. Clearly, since o is stored by r (or is 3. For every version o received from s, requestor r does obsoleted by some causally succeeding version the following: of o), the knowledge invariant holds. (a) For every object w in store, such that w.name = o.name: – At the end of synchronization, the knowledge if o.version ∈ w.predecessors or vector s.knowledge is merged into r.knowledge. w.predecessors = ⊥ and o.version ∈ The merging produces a vector that represents r.knowledge then ignore o and stop; a union of s.knowledge and r.knowledge. Since else if w.version ∈ o.predecessors or o.predecessors = ⊥ and w.version ∈ merging is done only at the end of a complete s.knowledge then delete w; synchronization, the requestor r must already else alert conflict. store all versions included in s.knowledge or (b) store o later ones. Hence, replacing r.knowledge with a (c) If o.predecessors = ⊥, then set s.knowledge o.predecessors = s.knowledge. vector representing the union with (d) For every object w in store, such that maintains the Knowledge Invariant. w.name = o.name (these must be conflict- ing versions), if w.predecessors = ⊥ then Lemma 2 Let o and w be two versions of an ob- set w.predecessors = r.knowledge. ject. Let So be an extrinsic set to o, and Sw to w. (e) Insert o.version into r.knowledge. Then the Predecessors Invariant of Definition 5 is 4. Merge s.knowledge into r.knowledge. 5. (Lazily) go through versions v such that maintained if we use So in place of o.predecessors v.predecessors =6 ⊥, and if r.knowledge is ex- and Sw in place of w.predecessors. trinsic to v then set v.predecessors = ⊥. Proof If o ≺ w, then by definition o ∈ Sw. The con- Fig. 2 Synchronization using extrinsic predecessors; verse holds if w ≺ o. If neither version precedes the modifications from the generic protocol are indicated in boldface. other, then since the set of versions pertaining to object o.name in So is the same as o.predecessors, we have that w 6∈ So. The converse holds for o 6∈ in Definition 4 and Definition 5 (proofs are pro- Sw. vided in the next section). Lemma 3 The Predecessors Invariant of Definition 5 Safety: Every conflicting version received by a re- holds throughout the updates made by the above pro- questor is detected. tocols. Nontriviality: Only true conflicts are alerted. Liveness: At the end of a complete execution of a Proof According to Lemma 2, replacing a predeces- synchronization procedure, for all objects, the sor vector with an extrinsic set cannot falsify the requestor r stores versions that are identical, Predecessors Invariant. Let r be any replica, and o or that causally follow, the versions stored by an object. Initially, o.predecessors is set to empty source s . if the replica’s knowledge is extrinsic to o. The rela- tionship between o.predecessors and r.knowledge need to be re-examined in the following events. 5 Correctness – Replica r generates an update on o. In this case, This section provides a correctness proof for the first r.knowledge is updated with the new ver- protocols presented in Section 4.2. sion, hence if r.knowledge was extrinsic to o be- fore the update, it continues being so after it. Lemma 1 The Knowledge Invariant of Definition 4 – During synchronization, a new version of o is is maintained throughout the update generation and received from a source s. In this case, the re- synchronization protocol. questor r explicitly stores the predecessor vector that arrives with the new version. Hence, the re- Proof Initially, r.knowledge is empty, and so the o.predecessors r.knowledge invariant trivially holds. The set is updated in the lationship between and is verified. following events. – At the end of synchronization, the source’s knowl- – When replica r generates a new version hr, nri, edge s.knowledge is merged into r.knowledge. the version is inserted into r.knowledge. Clearly, Here, the only case we must consider is that

8 o.predecessors was ⊥ before this step. Then we When failures occur, the overhead of VV re- need to prove that the merged knowledge must mains unchanged, but the PVE scheme may grad- remain extrinsic to o after the merge. ually suffer increasing storage overheads. There are Let w be any version of object o.name included two sources of additional complexity. The first is in s.knowledge. If w ∈ r.knowledge, then by the the need to keep exceptions in the knowledge. The predecessors invariant it also precedes o. Hence, second is the explicit predecessor vectors (and their the extrinsic relation holds. corresponding exceptions) kept for versions which In all other cases, we reach a contradiction. Specif- the replica’s knowledge does not dominate. In the- ically, if o ≺ w, then w would have replaced o ory, neither of these components has any strict up- in r’s storage, and it is impossible that o is still per bound since the set of exceptions may grow stored. There remains the case that w and o with the number of updates. These formal upper are conflicting versions. However, in this case, bounds are also summarized in Table 1 below. Later in step 3(d) of the protocol above in Figure 2, we provide simulation results that demonstrate stor- o.predecessors would not be left empty. Again, age growth in the PVE scheme relative to failure we reach an impossible state according to the rates. protocol. The communication overhead associated with synchronization also has two parts. First, a source Theorem 1 The protocols above maintain Safety, and a requestor need to determine which objects Liveness, and Nontriviality, as defined in Section 4.3 have versions yet unknown to the requestor. In the above. PVE scheme, this is done by conveying the requestor’s knowledge vector to the source. The faultless over- Proof (Sketch) We already argued that the sim- e ple synchronization framework in Figure 1 main- head here is O(R); the upper bound is again theo- tains the desired Safety, Liveness, and Nontrivial- retically unbounded. ity properties, given any predecessor and knowledge Let q denote the number of object versions that sets that maintain the two invariants in Definition 4 the source determines it has to send to the re- and Definition 5. Since we proved the maintenance questor. The second component of the communica- of the invariants, our proof is done. tion overhead is the extra precedence information associated with these q objects. In faultless runs of the PVE scheme, this information consists of one e 6 Performance version per object. Hence, the overhead is O(q). In case of faults, as explained before, some objects The storage overhead associated with precedence sent during synchronization may have explicit pre- and conflict detection comprises two components. decessor vectors and an unbounded number of ex- The per-replica knowledge vector contains aggre- ceptions associated with them. Hence, there is no gate information about all known versions at the formal upper bound on the communication over- replica. In typical, faultless scenarios, the PVE schemehead. Here again, our simulation studies relate this e complexity with the fault rate. requires O(R) space per replica for the knowledge representation. By comparison, the VV scheme has As for the VV scheme, the only way to convey no aggregate information on a replica’s knowledge. knowledge of the latest versions held by a replica is by explicitly listing all of them, which requires Additional storage overhead stems from prece- e dence information. In the PVE scheme, faultless O(N × R) bits. Therefore, in realistic deployments scenarios result in one version being maintained of VV, replicas may keep a log of all the objects e per object, incurring a space of O(N). By compar- that received updates since the last synchroniza- tion with the requestor and send only the version Oe R N ison, the VV scheme keeps ( × ) storage, i.e. vectors associated with these objects. The commu- one version vector per object. In fairness, the per- e nication complexity will be between O(q × R) and replica counters used to generate versions in the e PVE scheme may desire a larger number of bits O(N × R), but the storage overhead increases due than those used in a conventional version vector to logging. scheme since they are incremented for all updates to any object. However, WinFS can easily deal with Version vectors PVE counters that wraparound by retiring replicas once storage l.b. Oe(N × R) Oe(N + R) they have generated too many updates and replac- e ing them with new replicas with fresh counters. storage u.b O(N × R) unbounded comm l.b. Oe(q × R) Oe(q + R) Thus, the basic comparison between PVE and VV e schemes in reliable communication environments comm u.b. O(N × R) unbounded remains valid. The fault-free (lower-bound) storage overhead Table 1 Lower and Upper Bounds Comparison of PVE with the version-vector scheme. for PVE and VV are summarized in Table 1.

9 In face of communication faults, replicas using Bayou [14]. Even though WinFS provides a richer the PVE method might accumulate over time both than a conventional file system, we adopted knowledge exceptions and object versions that re- the same notion of conflicts as in previous repli- quire explicit predecessors. There is no simple for- cated file systems but devised a new scheme for mula that describes how frequently exceptions are detecting when conflicts occur. accrued, as this depends on a variety of parameters In a client-server architecture where a hub-and- and exact causal ordering. spoke or star topology is used for replication, con- In order to evaluate the effect of communica- flict detection is relatively easy. For example, in the tion disruptions on storage in the PVE scheme, system [8], servers maintain a version number we conducted several simple simulations. We ran for each file. A client records, for each file that it R = 50 replicas, generating version updates to ob- locally caches, the version number that the file had jects at random. The number of objects varied be- when it was retrieved from the server. When the tween N = 100 and N = 1000. Every 100 total up- client reconciles its local updates with the server dates, a synchronization round was carried out in a after a period of disconnection, the client checks round-robin manner, with replica 1 serving updates the current version of each file on the server. If the to 2, replica 2 serving 3, and so on, up to replica servers version for a file differs from the version R sending updates back to 1. This was repeated on which the client based its update, then a con- 100 times. We expect that other communication flict has occurred. This simple form of optimistic patterns, such as randomly selecting synchroniza- concurrency control works because the server is a tion partners, would yield similar simulation re- central authority for each file. sults. A failure-probability variable pfail controlled Version vectors were devised for systems in which the chances of a communication disruption within replicas reconcile with each other in a peer-to-peer every pairwise synchronization. The disruption oc- fashion. Locus was the first system to use version curred at the end of the synchronization proce- vectors to detect concurrent updates to files [6], al- dure, thus potentially causing the maximal num- though Fischer and Michael proposed a similar data ber of exceptions. We measured the resulting av- structure for resolving insert/delete ambiguities in erage communication and storage overhead. These replicated dictionaries [4]. Locus stored a version are depicted in Figure 3 for two cases, 100 and vector with each replica of each file. A files version 1000 objects. We normalize the overhead to per- vector included an entry for each site on which a object overhead. For reference, the per-object stor- copy of the file was replicated; entries in the ver- age overhead in standard VVs is exactly R = 50. sion vector indicated the number of updates made The best achievable communication overhead with to the file by each site. Two copies of a file are de- VVs (without logging) is also R = 50, and is de- termined to be in conflict if their associated version picted for reference. vectors are incompatible, meaning that one version The figure clearly indicates a tradeoff in the vector does not dominate the other. Follow-on sys- PVE scheme. When communication disruptions are tems to Locus, such as Ficus, Rumor, and Roam, reasonably low, PVE storage and communication utilize this same technique to detect conflicting file overhead is substantially reduced compared with updates [7,11]. the VV scheme, even for a relatively small number It has been shown that version vectors or re- of objects. As failure rate increases, the number of lated data structures, like vector clocks, are nec- exceptions in aggregate vector rises, and the total essary to detect the causal ordering of events in a storage used for knowledge and for predecessor sets distributed system [3,13]. Concerns about the un- increases. The point at which the per-object amor- bounded size of version vectors have caused some tized overhead passes that of a single VV depends researchers to propose compact representations [1, on the number of objects. For quite moderate size 2,5,15] or techniques for pruning entries that are systems (1000 objects), the cut-off point is beyond no longer needed, such as entries that are glob- a 90 percent communication disruption rate. ally known by all replicas [11,12]. These techniques could be adopted for use in WinFS, though they are less necessary since, as described in earlier sections, 7 Related Work the PVE scheme maintains a single version vector for an entire replica rather than one for each file. In weakly consistent replicated databases and file systems, conflicts are generally defined as concur- rent updates to an object or file. In other words, 8 Conculsions conflicts arise when two clients independently up- date the same file at different replicas. More se- In optimistically replicated systems, must mantic definitions of update conflicts that take into be maintained for detecting conflicts caused by con- account the needs of particular applications have current updates to data objects. The overhead for been supported in replicated databases systems like storing and communicating such metadata can be

10 prohibitive. The traditional technique of using per- References object version vectors simply does not scale to sys- tems with thousands of replicas and large numbers 1. J. B. Almeida, P. S. Almeida, and C. Bacquero. of objects. WinFS, a new structured storage plat- Bounded version vectors. In Proceedings In- form developed at Microsoft, was designed to sup- ternational Symposium on Distributed Computing (DISC), pages 102–116, 2004. port information management applications, such as 2. A. Arora, S.S. Kulkarni, and M. Demirbas. Reset- electronic mail and applications, with po- table vector clocks. In 19th Symposium on Princi- tentially millions of fine-grained data objects. WinFS ples of Distributed Computing (PODC), 2000. allows objects to be replicated across machines run- 3. C. Fidge. Timestamps in message-passing systems that preserve the partial ordering. In In Proceed- ning Windows, and thus must to scale from a hand- ings of the 11th Australian Computer Science Con- ful of replicas in a home to thousands of replicas ference, pages 56–66, 1988. within a global corporation. In this paper, we have 4. M. J. Fischer and A. Michael. Sacrificing seri- shown that the WinFS design meets these scala- alizability to attain high availability of data in an unreliable network. In Proceedings SIGACT- bility demands by requiring only a single version SIGMOD Symposium on Principles of Database vector per machine along with simple versions for Systems, March 1982. each object. We provide the first proof that this 5. Y.-W. Huang and P. Yu. Lightweight version vec- unusually small amount of metadata is sufficient to tors for pervasive computing devices. In Proceed- ings IEEE International Workshops on Parallel detect concurrent writes to any data object. Processing, pages 43–48, 2000. An analytical comparison of the PVE scheme of 6. D. S. Parker (Jr.), G. J. Popek, G. Rudisin, WinFS to the conventional version vector scheme A. Stoughton, B. J. Walker, E. Walton, J. M. Chow, S. Kiser D. Edwards, and C. Kline. Detection of showed a factor of 50 improvement for scenarios mutual inconsistency in distributed systems. IEEE with 50 replicas. These results were confirmed through Transactions on Software Engineering, 9(3):240– simulation. As the number of replicas increases and 247, May 1983. the number of data objects increases, the benefits 7. T. W. Page (Jr.), R. G.. Guy, J. S. Heidemann, of the PVE design becomes even more pronounced. D. H. Ratner, P. L. Reiher, A. Goel, G. H. Kuen- ning, and G. Popek. Perspectives on optimistically However, when running over unreliable networks, replicated peer-to-peer filing. Software – Practice PVE synchronization sessions can be disrupted be- and Experience, 11(1), December 1997. fore their full completion, and the metadata main- 8. J. J. Kistler and M. Satyanarayanan. Disconnected tained by PVE can grow over time due to holes in a operation in the coda file system. ACM Transac- tions on Computer Systems, 10(1):3–25, February replicas knowledge. In theory, PVE overheads can 1992. exceed the overhead of per-object version vectors. 9. R. Ladin, B. Liskov, L. Shrira, and S. Ghe- Our simulation studies show that this is only a con- mawat. Providing high availability using lazy repli- cern, even for a small number of objects (100) and a cation. ACM Transactions on Computer Systems, 10(4):360–391, 1992. modest number of replicas (50), when the percent- 10. L. Novik, I. Hudis, D. B. Terry, S. Anand, V. J. age of failed synchronization sessions exceeds 40%, Jhaveri, A. Shah, and Y. Wu. Peer-to-peer replica- an unacceptably high unreliability in practice. For tion in winfs. Technical Report MSR-TR-2006-78, systems of 1000 objects, PVE overheads are strictly Microsoft, June 2006. less even if 95% of synchronization sessions termi- 11. D. H. Ratner. Roam: A Scalable Replication System for Mobile and Distributed Computing. PhD thesis, nate prematurely. 1998. UCLA Technical report UCLA-CSD-970044. In the future, we plan to evaluate the PVE de- 12. Y. Saito. Unilateral version vector pruning us- ing loosely synchronized clocks. Technical Report sign with real workloads gathered from the emerg- Technical Report HPL-2002, HP. ing WinFS applications. These studies should shed 13. R. Schwarz and F. Mattern. Detecting causal re- further light on the practical benefits of the new lationships in distributed computations: In search conflict detection scheme developed for WinFS. of the holy grail. Distributed Computing, 7(3):149– 174, 1994. 14. D. B. Terry, M. M. Theimer, K. Petersen, A. J. Demers, M. J. Spreitzer, and C. Hauser. Managing update conflicts in bayou, a weakly connected repli- cated storage system. In Proceedings 15th Sym- posium on Operating Systems Principles (SOSP), Acknowledgements pages 172–183, December 1995. 15. F. Torres-Rojas and M. Ahamad. Plausible clocks: The protocol described in this paper was designed constant size logical clocks for distributed systems. Distributed Computing, 12(4):179–196, 1999. by the Microsoft WinFS product team, which in- cluded one of the authors (Doug Terry). We espe- cially acknowledge Irena Hudis and Lev Novik for pushing the idea of concise version vectors. Harry D. Malkhi Dahlia Malkhi is a Principal Researcher in the Microsoft Research Silicon Valley lab. She received Li, Yuan Yu, and Leslie Lamport helped with the her Ph.D., M.Sc. and B.Sc. degrees in 1994, 1988, 1985, formal specification of the replication protocol and respectively, from the Hebrew University of Jerusalem, the proof of its correctness. Israel. During the years 1995-1999 she was a member

11 of the Secure Systems Research Department at AT&T Labs-Research in Florham Park, New Jersey. Her re- search interests include all areas of distributed systems.

D. Terry Doug Terry is a Principal Researcher in the Microsoft Research Silicon Valley lab. His research fo- cuses on the design and implementation of novel dis- tributed systems and addresses issues such as infor- mation management, fault-tolerance, and mobility. He currently is serving as Chair of ACM’s Special Interest Group on Operating Systems (SIGOPS). Prior to join- ing Microsoft, Doug was the co-founder and CTO of Cogenia, Chief Scientist of the Computer Science Lab- oratory at Xerox PARC, and an Adjunct Professor in the Computer Science Division at U. C. Berkeley, where he regularly teaches a graduate course on distributed systems. Doug has a Ph.D. in Computer Science from U. C. Berkeley.

12 100 object system 100 PVE storage overhead 3 overhead PVE comm overhead + 80 VV 2 + +3 3+ + 60 3+ +3 3+ +3 + 3 2222222222222222222222222222222222222222222222222223+22222222222222222222222222222222222222222222222223+ 3 3 + 3+ +3 3+ 3 40 + +3 3+ 3 3+ 20 +3

0+3 0 20 40 60 80 100 percent of disrupted synchronizations 1000 object system 100 PVE storage overhead 3 overhead PVE comm overhead + 80 VV 2

60 2222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222+ + + 3 + + 3 + 3 40 3 + 3 3 + 3 + + 20 + + + 3 3 3 + 3 + 3 3 3+ 3+ 3+ +3 3 3 3 0+3 + 0 20 40 60 80 100 percent of disrupted synchronizations Fig. 3 Per- and communication overheads for varying communication failure frequency with N = 100 objects (top) and N = 1000 objects (bottom).

13