Concise Version Vectors in Winfs

(will be inserted by the editor) Dahlia Malkhi · Doug Terry Concise Version Vectors in WinFS Received: date / Accepted: date Abstract Conflicts naturally arise in optimistically object versions across replicas and be alerted to any replicated systems. The common way to detect up- conflicts. date conflicts is via version vectors, whose storage This system model arises naturally within the and communication overhead are number of repli- scope of Microsoft’s WinFS platform, which was cas × number of objects. These costs may be pro- designed to provide peer-to-peer weakly consistent hibitive for large systems. replicated storage facilities. The model is funda- This paper presents predecessor vectors with ex- mental in distributed systems, and numerous repli- ceptions (PVEs), a novel optimistic replication tech- cation methods exist to support it. However, the nique developed for Microsoft’s WinFS system. The applications that are targeted by WinFS mandate paper contains a systematic study of PVE’s per- taking scale more seriously than ever before. In formance gains over traditional schemes. The re- particular, e-mail repositories, log files, digital li- sults demonstrate a dramatic reduction of storage braries, and application databases can easily reach and communication overhead in normal scenarios, millions of objects. Hence, communicating even a during which communication disruptions are infre- single bit per object (e.g., a ‘dirty’ bit) in order to quent. Moreover, they identify a cross-over thresh- be able synchronize replicas might simply be too old in communication failure-rate, beyond which costly. PVEs loses efficiency compared with traditional schemes. In this paper, we present a precise description and correctness proof of the replica reconciliation and conflict detection mechanism inside Microsoft’s WinFS system. We name the scheme predecessor 1 Introduction vectors with exceptions (PVE). We produce a systematic study of the performance gains of PVE Consider an information system, such as an e-mail and provide a comparison with the traditional opti- client, that is composed of multiple data objects, mistic replication scheme. The results demonstrate holding folders, files and tags. Data may be repli- a substantial reduction in the storage and commu- cated in multiple sites. For example, a user’s mail- nication overhead associated with replica synchro- box may reside at a server, on the user’s office and nization in most cases. In conditions that allow syn- home PCs, on the user’s laptop, and on a PDA. chronizations to complete without communication The system allows concurrent, optimistic updates breaks, a pair of replicas needs only communicate to its objects from distributed locations without a constant number of bits per replica in order to communication or centralized control. For exam- detect discrepancies in their states. Moreover, they ple, the user might hop on the plane with a copy of need to maintain only a single counter per object her mailbox on a laptop and edit various parts of in order to determine the causal ordering of ob- it while disconnected; she may introduce changes jects’ versions and detect any conflicting versions on a PDA, and so on. At some point, when these arising from concurrent updates. Our study also computers are connected, she wishes to synchronize demonstrates the “cut-off” point in the communication fault-rate, beyond which the PVE technique D. Malkhi becomes less attractive than the alternatives. Microsoft Research Silicon Valley In order to understand the efficiency leap of- E-mail: [email protected] fered by the PVE scheme, let us review the most Doug Terry well known alternative. Version Vectors (VVs) [6] Microsoft Research Silicon Valley are traditionally used in optimistic replication sys- e tems in order to decide which replica has the more down close to O(q × R), which is the lowest up-to-date contents for each object, as well as to possible communication overhead with the VV detect conflicting versions. Per object version vec- scheme. tors were pioneered in the Locus distributed file system [6], and subsequently employed in various These cost measures and their analysis are made optimistic replication systems [7,11,9]. more precise later in the paper. Note that even for The version vector for a data object is an array moderate numbers of replicas R, storing N × R counters is a substantial burden when N is large. of size R, where R is the number of replicas in the e system. Each replica has a pair hreplica, counteri Moreover, communicating between O(q × R) and e in the vector, indicating the number of modifica- O(N × R) overhead bits may be prohibitive. tions performed on the object by the replica. For In WinFS, the goal is to quickly synchronize example, suppose that we have three replicas, A, B, heavy-volume replicas, each carrying large magni- and C. An object is initialized with version vector tudes of objects. In situations where communica- (hA, 0i, hB, 0i, hC, 0i). An update to the object ini- tion disruptions are not the norm, the innovative tiated at replica A increments A’s component, and PVE mechanism in WinFS reduces storage and com- so generates version vector (hA, 1i, hB, 0i, hC, 0i). munication costs by a considerable amount. Repli- e Later, B may obtain the updated object from A, cas need only exchange O(R) information bits to along with its version vector, and produce another determine their differences, i.e. the equivalent of update on the object. The newer object state re- communicating a single version vector. In addi- ceives version vector (hA, 1i, hB, 1i, hC, 0i). And so tion, in most cases, per object meta-information on. storage and communication is constant (one counter). A version vector V dominates another vector Table 1 in Section 6 contains a summary of these W if every component of V is no less than W ; V complexities. In the remainder of this paper, we de- strictly dominates W , if it dominates W and one scribe the foundations of the PVE replication pro- component of V is greater. Due to optimism, ob- tocol and compare it against the traditional VV jects on different replicas may have version vectors scheme. that are incomparable by the domination relation. The contributions of this paper are three-fold. This corresponds to conflicting versions, indicating First, we give a precise and detailed formulation of that simultaneous updates were introduced to the the PVE replica reconciliation protocol employed in object at different replicas. For example, continuing WinFS. Second, we develop a performance model the scenario above, suppose that all replicas have capturing the cost measures of interest and quan- version vector (hA, 1i, hB, 1i, hC, 0i) for a stored ob- tify the performance gains of the PVE scheme com- ject. Now replicas A and C produce diverging up- pared with known methods. Third, we evaluate the dates on the object simultaneously. These updates PVE scheme via simulation under complex system generate version vectors (hA, 2i, hB, 1i, hC, 0i) and conditions with increasing communication failures (hA, 1i, hB, 1i, hC, 1i), respectively, which are con- rates. This evaluation reveals a cut-off point that flicting since neither one dominates the other. characterizes the benefit area of the PVE scheme Consider a system with N objects replicated over traditional version vectors. We note that the across R replicas. Further, consider the synchro- full design and architecture of the WinFS replica- nization between two replicas that have differing tion platform is the result of a large team effort versions for q objects. The VV scheme is designed and is beyond the scope of this paper; we refer the for synchronizing replicas object by object, and in- interested reader to [10] for a wider coverage of the curs the following costs: WinFS architecture. 1. Store a version vector per object, incurring a The next section gives a detailed problem state- e 1 storage overhead of O(N × R) bits; ment. Section 3 provides an informal overview of 2. Communicate information that allows the two the PVE scheme. Section 4 provides a precise de- replicas to determine which objects one should scription of the method and lays the foundation for send the other and to detect conflicts. A naive reasoning about its correctness. Correctness proofs implementation sends all N version vectors, in- are provided in Section 5. Section 6 contains a per- e formance study. Section 7 surveys related work, and curring a communication overhead of O(N ×R). Section 8 concludes. If the replicas store logs of recent updates and maintain additional information about the log entries that have previously been synchronized with other replicas, they may bring the cost 2 Problem Statement 1 For simplicity of notation, the notation Oe(·) indi- cates the same complexity order as O(·) up to logarith- In this section, we begin with the precise speci- mic factors of N and R, which may be required to code fication of our problem. Later sections provide a any single value in our settings. rigorous treatment of the solution. 2 The system consists of a collection of data ob- When two versions do not follow one another, they jects, potentially numerous. Each object might be are conflicting. In other words, if w 6≺ v ∧ v 6≺ w, quite small, e.g. a mail entry or even a status word. then v and w are said to conflict. Objects are replicated on a set of hosts. Each host It is desirable to detect and resolve conflicts, ei- may locally introduce updates to any object with- ther automatically (when application-specific con- out any concurrency control. These updates create flict resolution code is available) or by alerting users a partial ordering of object versions, where updates who can resolve conflicts manually.

Concise Version Vectors in Winfs

Partitioner Och Filsystem 2

Jim Allchin on Longhorn, Winfs, 64-Bit and Beyond Page 34 Jim

DB/C Newsletter February 2006

IT Acronyms.Docx

Using Virtual Directories Prashanth Mohan, Raghuraman, Venkateswaran S and Dr

Azure Data Lake Store: a Hyperscale Distributed File Service for Big Data Analytics

Make It Simple a Survey of Information Technology October 30Th 2004

Exploration of Windows Vista Advanced Forensic Topics – Day 1

How Microsoft Lost the API War Developers, Developers, Developers, Developers

New Seekers a New Generation of Search Tools Aim to Help Corporate Users Find Valuable Content and Information Buried Deep on Their PC Hard Drives

The Definitive Guide to Building Code Quality

10 Compelling Reasons to Upgrade to Windows Server 2012 | Techrepublic