<<

Current Concepts in Systems

Petr Baudiˇs∗ 2009-09-11

Abstract of thousands of developers on huge projects. This work will focus on the latter end of the We give the reader a comprehensive overview of scale, describing the modern approaches for ad- the state of the Version Control engi- vanced version control with emphasis on soft- neering field, describing and analysing the con- ware development. We will compare various sys- cepts, architectural approaches and methods re- tems and techniques, but not judge their general searched and included in the currently widely suitability as the set of requirements for version used version control systems and propose some control can vary widely. We will not focus on possible future research directions. integration of version control in the con- figuration management systems1 since SCM is an extremely wide area and many other papers Preface already cover SCM in general. [3][15][70] This work aims to fill the sore need for an This paper has been originally written in May introduction-level but comprehensive summary 2008 as an adaptation of my Bachelor thesis and of the current state of art in version control the- I have been meaning to get back to it ever since, ory and practice. We hope that it will help new to add in some missing information, especially developers entering the field to quickly under- on some of my research ideas, describing the stand all the concepts and ideas in use. Since the algebra more formally and detailing on author is (was) a developer, we focus some- the more interesting chapters of historical de- what on the Git version control system in our velopment in the field. examples, but we try to be comprehensive and Unfortunately, it is not likely that I will ever cover other systems fairly as much as is within get the time to polish the paper up to my full our power and knowledge. content; in the meantime, it is slowly getting obsolete and I believe it can have a good value We also describe for the first time some tech- even as it stands, especially as an introductory niques so far present only as in material for people getting into this area. Maybe various systems without any explicit documen- arXiv:1405.3496v1 [cs.SE] 14 May 2014 I have missed some of the latest developments; tation and present some of our original work. I ask the reader for tolerance of the omissions, One problem in the field of version control is and my fellow researches and authors for fol- that since most innovation and research hap- lowups on this work that would fill in the gaps pens not in the academia or even classical in- and explain ongoing new ideas in the field. dustry, but the open community, very few of the — Petr Baudiˇs results get published in the form of a scientific paper or technical report and many of the con- cepts are known only from mailing list posts, 1 Introduction random writeups scattered around the web or merely source code of the tools. Version control plays an important role in most Thus, another aim of this work is tie up all these of the creative processes involving a . sources and provide an anchor point for future Ranging from simple undo/redo capabilities of researchers. most office tools to complicated version control supporting branching and diverse collaboration 1Source Configuration Management covers areas like ∗The work on this paper was in part sponsored by version control, build management, bug tracking or de- (SUSE Labs). velopment process management.

1 2 BASIC DESIGN 2

2 Basic Design full version tracking capabilities and not only is all the history usually copied over (pulled), The basic task of version control system is to but it is also possible to new changes preserve history of a collection of files (project) locally; later, they can be sent (pushed) to an- so that the user can compare two historical ver- other repository in order to be shared with other sions of the content easily, return to an older developers and users. This approach effectively version or review the list of changes between two makes every copy a stand-alone repository with versions. stand-alone branches; nevertheless, most sys- tems make it easy to emulate the centralized 2.1 Development Process Model version control model as a special-case at least for the read-only users. We will focus on version control systems that are based on the distributed development paradigm. 2.2 Snapshot-Oriented History While the concept dates many years back, dis- tributed development has only recently seen ma- There are two possible approaches to modeling jor adoption and so far mostly only in the area the history. The first-class object is usually the 4 of open source development. particular version as a state (snapshot) of the The classical approach to version control is to content at a given moment, while the changes have a single central server2 storing the project are deduced only by comparing two versions. history; the client can have the current version However there is also a competing approach cached locally, and possibly also (parts of) the which focuses on the changes instead: in that history, but any operation involving change in case, the important thing when marking new the history (e.g. committing new version) re- version is not the new state of the content, but quires cooperation with the central server. the difference between the old and new state; a particular version is then regarded as a cer- This centralized development approach has 5 two major shortcomings: first, the developer tain combination of changes. has to have network access during development, which is problematic in the era of mobile com- 2.2.1 Branches puting — i.e., when developing during travel or With regard to the snapshot-oriented history fixing problems at the customer site. Second, the model, it may suffice in the trivial case to merely central repository access implies some inherent represent the succeeding versions as a list with bureaucracy overhead — the developer has to the versions ordered chronologically. But over gain appropriate commit permissions before be- time it may become feasible to work on several ing able to version-control any of his work, de- versions of the content in parallel, while also velopment of third-party patches is difficult and recording further modifications using the ver- since the repository is shared, the developer is sion control system; the development has effec- usually restricted in experimentation and creat- tively been split to several variants (branches). ing short-lived branches, often resorting to using Handling branches is an obvious feature of a some kind of manual version control during reg- version control system as the necessity to branch ular workflow and using the VCS only on official the development is frequent for many projects occassions. targetted at wider audience. Commonly, aside On the other hand, since all the develop- of the main development branch another branch ment is tracked at a central place, this model is kept based on the latest public release of the may pose advantages to corporate shops because project and it is dedicated to bugfixes for this work and performance of all developers can be release. 3 tracked easily. Moreover, branches have many other uses as In the distributed development model, a copy well — it may be feasible for individual develop- of the project created by version control has ers or teams to store work in progress yet unsuit-

2Possibly with several read-only mirrors for replica- 4We use the terms version, revision and commit inter- tion, as in the BSD ports setup. changeably in this paper based on the context, according 3In case of in-company development using distributed to usual practice. version control, this can be alleviated by setting appro- 5Note that the conceptual model may have no relation priate policy for the developers to push their work to to actual storage, which can well use deltas even within designated central repositories in regular intervals. snapshot-oriented systems. 2 BASIC DESIGN 3 able for the main development branch in sepa- systems and can result in trouble when merging rate branches, or even for a developer to save the branches later, both in duplicated changes incomplete modifications during short-time re- stored in the history and increased con- assignment to a different task or when moving flicts rate.7 to another machine. The prerequisite for this is that creating a separate branch must have only minimal overhead, both technical (time 2.2.2 Data Structures and space required to create a new branch) and social (it must be possible to easily remove Given that the development may fork at some the branch too; also, the unimportant branches point in history, we cannot use simple lists any- should not cloud the view of the important ma- more to adequately represent this event but we jor ones). have to model the history as a directed tree The ability to properly record and manipu- instead, with succeeding versions connected by late branches nevertheless becomes truly vital edges; the node representing a fork point (not as the development gets distributed, since in ef- only one, but several branches grow from the fect every clone of the project becomes a sep- node) has multiple children. arate branch. If the system aims to fully and comfortably support the distributed develop- However, when we take merging into con- ment model, it must make sure that there are sideration, it becomes necessary to also repre- no hurdles for creating many new branches; fur- sent the past merges between branches if we are thermore, there can be no central registry of to properly support repeated merges and pre- branches, thus the branch-related data struc- serve detailed history of both branches across tures must allow for branches to grow from dif- the merge. The commonly used method is to ferent points in history with full independence use a more general (DAG; and dealing with naming conflicts,6 shall they figure2 several pages later) to represent the his- ever meet again in some repository. tory; compared to a directed tree, DAG lifts the Aside of the obvious operations with branches restriction for a node to have only a single par- (referring to them, comparing them, storing ent. Thus, versions representing merges of sev- them in a reasonable manner, switching between eral branches have the last (“head”) versions of them), perhaps the technically most challenging all the merged branches as their parents. The one is merging — combining changes made sepa- acyclic property makes sure that a commit still rately in two branches in a single revision. In the cannot be its own ancestor, thus keeping the past, only the most crude methods for merging history sane (time travel and quantum effects two branches were available, but again with the notwithstanding). onset of distributed version control this aspect An interesting problem to note is how to actu- has grown in importance (a merge essentially ally define a branch. In case of trees, the branch happens every time a developer exchanges mod- is defined naturally by the tree structure, how- ifications with the outside world). We will ex- ever in a DAG this becomes less obvious. One plain the challenges associated with merging in approach is to record branch of each revision section6; one important consideration we shall explicitly (e.g. ), another is to give up mention now is that many of the merging tech- on sorting out revisions to explicit branches al- niques require to find the latest point in history together (e.g. Git) — in that case, a “branch” common to both of the to-be-merged branches is merely a named pointer to the commit at the (or, more generally, which changes are not com- head of the branch, with no clear distinction of mon to the history of both branches). which ancestors belong to the same branch; it Aside of merging, another desirable opera- might be tempting to keep branch correspon- tion is so-called cherry-picking: taking individ- dence through fixed parent nodes ordering in ual changes from one branch to another with- merge nodes (“first parent comes from the same out merging them completely. While most mod- branch”), however this approach has caveats as ern systems support this in a rudimentary way, described in 6.1. the operation is not natural in snapshot-oriented

6In independent repositories, branches and revisions 7Git provides a git-rerere tool for automatically re- can be given conflicting names, as will be covered in de- solving merge conflicts that have been resolved in a cer- tail later. tain way before, thus somewhat reducing this problem. 2 BASIC DESIGN 4

2.3 Changes-Oriented History tributed system, we cannot see all the used identifiers in all repositories (since they can be Alternatively to snapshot-oriented history, a ra- on disconnected systems), thus we meet with dically different approach is to focus on the special problems when having to simultanously changes as the first-class object — the user cre- assign new identifiers in different disconnected ates and manipulates not a particular version repositories. Eventually, we are met with three of the project, but instead a particular change; conflicting requirements8 [44][6]: a “version” is then described simply by given combination of changes recorded. The major ad- • Uniqueness: The identifier is guaranteed to vantage is that merging operations may be much be unique and always identifies only a single more flexible, and inherently more data is pre- history object in a given repository. (Note served as the changes themselves are recorded that this requirement is frequently slightly and do not have to be deduced a posteriori from compromised by only hoping that the iden- checking the difference between two snapshots. tifier is unique with a very high probabil- The main representative of this class is ity.) 9 (see 3.7) with its main trait of the powerful for- malisation patch algebra (see 6.6) which makes • Simplicity: The identifier is easy to use — it possible to easily perform merging of two ver- reasonably short and preferrably informa- sions by adding the changes missing in the other tive to the user (e.g. giving some informa- version and reordering the changes in a well de- tion on ordering of the revisions). fined way. • Stability: The identifier does not change In order to make full use of the smart merging over time and between different reposito- techniques like patch algebra, in some cases the ries. operation time can degrade exponentially to the number of revisions in the current implementa- The requirements are of conflicting nature — tions [21]; making the algorithms more efficient we can generally always pick only two and sacri- is subject of active research, however at least un- fice the third. Different version control systems til very recently, practicality of changes-oriented make different choices here: for example, Git, history model was severely limited and this ap- Mercurial and sacrifice some usability proach was not suitable for large histories; we and assigns every commit a 40-digit SHA1 hash are not aware of performance improvement stud- [57] depending on the current content and his- ies for the recent Darcs improvements and it is tory of the version.10 To alleviate the UI prob- uncertain if these complexity issues are inherent lem somewhat, the systems allow shortcutting to the whole changes-oriented model. the long hashes to only few leading digits, as long as they match uniquely.

2.4 Objects Identification 8For clarity, in this context we use “simple” instead of “memorable” and “stable” instead of “global” presented The most visible (and diverse) aspect of the his- in the literature. The latter correspondence is somewhat tory models is the problem of identifying the loose and given under the assumption that it does not history objects (snapshots or changes, depend- make sense to use central authority for assigning identi- ing on the model). fiers in case of distributed version control — thus, either the identifiers are globally-meaningful and stable, or they The old RCS and CVS tools use a simple have only local meaning and thus inter-repository colli- scheme where every version is identified by a sions are inevitable and relabeling in that case necessary. string of dot-separated numbers; the first num- 9Here we mean uniqueness guaranteed by technical means, preferrably provably secure. E.g., GNU Arch does ber is usually 1, the second number is the ver- use identifiers that are “unique”, but only by policy as sion number on the main branch. If it is re- one part of the id is user’s e-mail address — these are quired to identify a version on a different branch, unique, but there is no technical safeguard that different BASE VERSION.BRANCH.VERSION scheme is used users will not input the same e-mail address; this ap- proach requires trusting the other repository database — BRANCH is the number of the branch forked to contain correct content. from the base version and VERSION is the se- 10This also technically sacrifies some uniqueness, but quence number of the given version on the the probability of collision of two hashes is so small that branch; this scheme can be used recursively to it is deemed to be unimportant for practical purposes. This has been disputed by [5][34], however that paper identify versions on branches of branches etc. has been criticized by the version control [40] and cryp- However, we shall remember that in a dis- tographic [14] community. 3 CURRENT SYSTEMS 5

BitKeeper and Bazaar11 sacrifice the stabil- sion to record file combinations makes the situ- ity and use RCS-style identifiers that are unique ation even more challenging.13 only within a single repository and can shift A somewhat controversial alternative ap- during merges. Darcs sacrifies the uniqueness proach has been taken by Git, where file re- and uses user-assigned identifiers for individual names and copying is not explicitly recorded changes (however, the identifiers are encouraged with the rationale that Git aims to track content to be rather elaborate, so collision probability on a more fine-grained level and hard-coding should not be very high). renames information would pollute the history with potentially bogus information. [25] Instead, Git can heuristically detect the events by com- 2.5 Content Movement Tracking puting similarity indexes between files in the compared revisions,14 and it provides a so-called One interesting problem in version control is “pick-axe” mechanism to show changes adding following particular content across file renames, or removing a particular string. Thus, when copies, etc. Practically all systems offer a way to feeding pick-axe with some chunk of code, it will list all revisions changing given file or set of files, show the commits that introduced it to the files, however things become more troublesome when as well as commits that removed it from files, ef- a file is renamed or copied or when we want to fectively giving the user a rudimental ability to check history of finer-grained objects than whole trace the code history.15 files. Most systems support so-called “annotated view” of a file, for each line showing the commit The problem of showing history of particular that introduced it in its current version. How- parts of files and following semantic units across ever, this operation is frequently very expensive files is still open and will in our opinion benefit and on many occassions the line-level granular- from further research (as elaborated in 7.1). ity is not suitable either. While old systems like CVS do not support tracking file renames at all and the new files 3 Current Systems start with fresh history unless the repository is We have already mentioned several version con- manually modified,12 most newer systems sup- trol systems when describing basic design as- port file renames, usually by providing a spe- pects; now, we shall briefly describe the most cial command that records the movement — less widespread and interesting ones in more detail. common approach used e.g. by GNU Arch is to We will focus on systems that are freely available autodetect movement by tracking unique tags and in use in the open-source community, and within the objects (strings within files or hid- even so not go through all of them but instead den files in directories). highlight the most influential ones. We will not In that case, showing history of the new file cover most proprietary systems like ClearCase will usually automatically switch to the older file or SourceSafe in detail since we feel they gener- name at the point of the rename, and there is ally do not bring much innovation into the ver- also a possibility of cross-rename merges when a sion control field itself (focusing more on pro- file has been renamed between the two revisions. cess management, integration with other soft- However, this carries e.g. the danger of making ware etc.) and detailed information is hard to the merging operation complexity proportional gather from open sources. to project history, something many systems seek Up to CVS, the evolution was rather slow, but to avoid. Also, as the support gets naturally ex- then with growing need for distributed version tended to recording not only renames but also control, many short-lived projects sprang up file copies, it becomes much less obvious which file instances to merge; another natural exten- 13The underlying design of recording renames varies — for example Subversion records renames as copy+delete 11Technically, Bazaar internally uses globally unique operations, which can bring some inherent problems e.g. identifiers, but RCS-style identifiers are the primary way for using the rename information during merges. [50] of user interaction. 14The similarity index is computed by hashing line- 12The modification usually involves manually copying based file chunks of the files and comparing how many the file under new name within the repository database. of the hashes exist in both files. [29] However, this presents problems when the working tree 15An obvious idea for an extension is to have the GUI is seeked to an older point in the history or in case of frontends integrate mouse text selection with pickaxe, branches. further simplifying the process. 3 CURRENT SYSTEMS 6 and explored various approaches; now the field proprietary version control system that provides has rather stabilized again, with three major rather advanced version control features, though systems currently used: Bazaar, Git and Mer- it keeps within the centralized model. Com- curial. pared to CVS, it supports atomic commits (ty- ing changes in multiple files together), tracking 3.1 The Two Grandfathers file renames and more comfortable branching and merging support (branches are presented as The first true version control ever in use was separate paths in the repository and repeated SCCS (Source Code Control System) [63], in- merges are supported). troduced in 1972 at Bell Labs and distributed In the open-source version control realm, commercially. It uses the weave file format SVN [59] has been developed as a direct CVS (4.2.5) to track individual files seprately and replacement and currently appears to remain provides only a rather arcane user interface. Al- the final step in the evolution of centralized ver- though its use in the industry is very rare nowa- sion control. It shares many of the fea- days, the weave storage format has resurfaced tures, though many workflows are different. It recently. is seeing rapid adoption [58] and basically de- A competing system RCS (Revision Control throned CVS as the default choice of centralized System) [54] published in 1985 has seen much version control for open-source projects. more success, even though it also operates only on isolated files. It uses a delta sequence file for- mat (4.2.4) and the rapid adoption rate com- 3.4 Older Distributed Systems pared to SCCS is likely to be accounted mainly BitKeeper [11] is another proprietary version to much more user-friendly interface (though control system worth mentioning, since it pi- it is still rather rudimentary by current stan- oneered the distributed version control field17 dards). RCS is still sometimes used in present and also has relatively well-understood capabil- for version-controlling individual files, e.g. when ities and internal working as it was for long time maintaining system configuration. However, its available for free usage by open source projects concepts and file format has seen even much and some prominent projects like the Ker- more widespread adoption thanks to CVS. nel used it. BitKeeper supports atomic commits (though 3.2 Concurrent Version System files have revisions tracked individually as well), generic DAG history and the basic distributed CVS [17] is essentially an RCS spinoff, bringing version workflows;18 however, each repository two major extra capabilities: ability to version- contains only one branch. RCS-style revision control multiple files at once and support for numbers are used, same revisions can have dif- parallel work, including network support. It has ferent numbers in different repositories and re- seen extremely widespread adoption and has vision numbers on branches can shift during long remained the de-facto standard in version merges. Internally, BitKeeper uses the weave file control, at least among the free systems. format and makes full use of its annotation capa- However, CVS is directly based on RCS, and bilities in the user interface and during merges it is ridden with inherited artifacts of individ- (see 6.4). It is still actively developed and re- ual file revision control — the need for elabo- portedly commercially popular. rate locking, difficult handling of branching and With regard to the history, we shall also men- merging and lack of commits atomicity: when tion GNU Arch [32] as the first open-source committed change spans multiple files, the com- distributed version control system that has seen mit will be stored for each file separately and wider usage, however it suffered by bad perfor- there is no reliable way to reconstruct the whole 16 mance and difficult user interface; the project change. has been basically discontinued (though still for- mally maintained), but Bazaar (see below) can 3.3 Modern Centralized Systems 17The concept of distributed version control has been Perforce [49] is a widely commercially used explored before by e.g. Aegis, but BitKeeper was the first system to bring it to widespread use. 16However, relatively reliable heuristics are imple- 18That is, the pull/push operations and good merging mented, e.g. by the cvsps project. [18] support. 3 CURRENT SYSTEMS 7 trace back its ancestry to GNU Arch to a certain vision object or another object containing the degree; Darcs has been also somewhat inspired reference and also some tag description, possi- by its -oriented design. bly digitally signed by the project maintainer to confirm that the tagged revision is an official re- lease; thanks to the usage of cryptographically- 3.5 Monotone strong hashes for references, the cryptographic Another important milestone is Monotone [39] trust extends to the complete revision content. with its design built around a graph of ob- jects identified by hash numbers computed from 3.6 The Three Rulers their contents19 serving as direct inspiration for Git and Mercurial. Monotone itself puts strong In spring of 2005 a rapid series of events re- emphasis on security and authentication, using sulted in termination of the permission to use RSA certificates during communication, etc. It BitKeeper for free software projects, provid- is actively developed but does not see as wide ing an immediate impulse for improvement in adoption as the three systems below, mainly due the area of open-source version control sys- to performance issues. tems, especially to the commu- Git and Mercurial build upon the Mono- nity — almost in parallel, the Git and Mercurial tone object model, which is generally built projects were started. Independently, as follows (Monotone implementation details Ltd. (the vendor of popular notwithstanding): There are three main object ) started the Bazaar21 project around classes — files (or blobs), manifests (or directory the same time. trees) and revisions (or commits/). All three systems provide atomic commits, All objects are referenced by large-size hash of store history in generic DAG and support ba- their content that should have guarantees akin sic distributed version control workflows. to cryptographical hashes (currently, SHA1 is Git [24][25] re-uses the generalized Mono- used in all popular systems). File objects con- tone object model; by default it stores each ob- tain raw contents of particular file versions. ject in a separate file in the database, thus hav- Manifest objects contain listing of all files and ing a very simple and robust but inefficient data 20 directories at a particular point of history, model;22 however, extremely efficient storage us- with references to objects with appropriate con- ing Git Packs (4.3.2) is also provided. In its user tent. Revision objects form the history DAG as interface, Git emphasizes (but does not make a subgraph of the general object graph by refer- mandatory) the rather unique concept of in- encing the parent revision (or revisions, in case dex, providing a staging area above the work- of merge) and the corresponding manifest ob- ing tree for composing the next commit.23 Git ject. Usually, log message and authorship infor- is written in (and a mix of shell and mation is also part of the revision object. The reader should notice that this scheme 21At that time, it has been called BazaarNG — a suc- is cascading — by changing a single file, new cessor to an original Baz(aar) project which was itself a manifest object with unique id will be gener- fork of GNU Arch. 22This scheme precludes need for any locking what- ated (the file hash inside the manifest object soever, but there is no compression in use and using changed, thus manifest object hash will change many small files means inherent overhead on the filesys- too), and this will in turn result in a revision ob- tem level. 23Originally a low-level mechanism, the popularity of ject with unique id as well. Even committing the index usage among developers elected it to a first-class same content with the same log message in dif- user interface concept. If users choose to use it, they ferent branches will not create an id clash since manually add files (or even just chunks of changes) from the parent revision hash will be different within the working tree to the index at various points and then the index is committed, regardless of working tree state; both objects. Tagging can be realized by sim- thus, the user may modify a file, add it to index and then ply creating named reference to either the re- before committing modify it again for example with the changes required to compile it locally — the commit op- 19Technically, the first system using hashes as identi- eration will use the older version of the file. When explic- fiers is OpenCM [46], but this system has never really itly recording all files to commit in the index, greater dis- taken off as far as we can tell. cipline of making appropriately fine-grained commits can 20Git uses one object per directory instead; there is be encouraged. Also, non-trivial conflicts are recorded in a trade-off between object reuse and dereferencing over- the index, making for a natural way of resolving them head in this choice. by simply adding the desired resolution to the index. 4 STORAGE MODELS 8 scripts) and is tightly tailored to POSIX sys- A particularly interesting feature of Darcs is tems, which historically meant portability prob- precisely how it handles patch merges — it uses lems especially with regard to Windows. It is de- a formal system called patch algebra (6.6) to de- signed as a tool, providing large number scribe dependencies between patches and basic of high- and low-level commands, and it is tradi- operations on the patch stack required to prop- tionally extended by tools calling the commands erly merge two sets of patches (branches). externally. Darcs is written in Haskell, which is some- Mercurial [37][68] has very similar object what exotic choice, but is well portable. Unfor- model to Git and Monotone, but uses the revlog tunately, as far as we are aware it suffers from se- format (4.3.3) for efficient storage. It is writ- vere performance problems (as discussed in 2.3) ten mainly in Python and thus does not suf- when used on large repositories with long his- fer from the portability problems of Git; since tories; more efficient merging algorithms as well it uses an interpreted language, the main ex- as proving formal correctness of the underlying tension mechanism is writing tightly integrated patch theory is subject of ongoing research. plugins. Performance-sensitive portions are im- plemented in C. While both of the systems above always em- 4 Storage Models phasized their robust data models and perfor- mance, Bazaar’s [9] main focus is simple and The storage model of a system can be quite dif- easy user interface; though internally using elab- ferent from the logical design in order to ac- orate unique revision identifiers, it presents pri- comodate for practical considerations of space marily the RCS-like revision numbers to the and time effectivity. Here we shall explore pos- user, and its user interface has direct support sible approaches to permanent storage of the for some common workflows like emulation of data and metadata tracked by version control. the centralized version control setup (checkouts) All version control systems differ in implemen- or Patch Queue Manager, an advanced al- tation details and various individual tweaks; we ternative to all developers pushing to a central will not try to be exhaustive here and will focus repository.24 Bazaar is also written in Python, only on the most widespread and most interest- making the same tradeoffs as Mercurial. ing ideas. First, we will dwell on the currently popular 3.7 Darcs algorithms to compare two objects and create a delta: complete technical description of their Darcs [19] is a somewhat unusual specimen in differences.26 Second, we shall elaborate on pos- the arena of version control as it focuses on sible ways of representing an individual delta as changes as the first-class objects instead of trees; well as whole single-file history, then we will look any particular version of the repository is merely at the approaches on organization of the whole a combination of patches on an empty tree, and repository. manipulating the patches is the primary focus. Darcs supports single branch per repository 4.1 Delta Algorithms and names the patches based on user input; par- ticular combinations of patches can be tagged to There is a certain variety of delta algorithms in mark specific versions of the tree.25 Darcs does use, and there can even be multiple algorithms not organize the patches in a particular graph, used within a single system — some algorithms they are structured more like a stack, though provide suitable and sensible output for human a graph could be imagined based on interde- review while others are focused on finding the pendencies between the patches; merges of sev- smallest set of differences and providing minimal eral patches create auxiliary patches represent- required output for storage. ing the merges.

24Developers send their pull requests or actual patches 4.1.1 Common Techniques to a public mailing list, where they can be acknowledged or vetoed by others and then a robot automatically picks First, we shall describe several common tech- them up, checks if they pass the required testsuites and niques used when dealing with deltas. The eventually merges them to the main branch. [47] 25In fact, a Darcs tag is itself a patch that is however 26E.g. a series of insert–copy commands or a “diff” empty and depends on all patches currently applied. tool output text. 4 STORAGE MODELS 9

first technique worth mentioning is used to our 4.1.3 Patience Diff knowledge by all commonly used systems newer An interesting variation of the above algo- than CVS: delta combination [45][68]. When ap- rithm was recently introduced as the default diff plying a chain of deltas, they are first merged method to the Bazaar system, devised by Bram to a single large delta, and only then applied. Cohen. [10] A common problem with using the This reduces the number of data copies from Myers’ algorithm is that the common subsec- O(deltas) to O(1) and takes advantage of cache tions search misaligns additions of parentheses locality, dramatically increasing performance of on separate lines or other simple syntactic con- applying delta chains. structs, making it harder to view by humans. Another interesting technique used most no- The main idea in this algorithm is to “take tably in Subversion is using skip-deltas when the longest common subsequence on lines which storing delta chains. [45] In order to limit chain occur exactly once on both sides, then recurse lengths, the deltas aren’t always created against between lines which got matched on that pass.” neighboring revision but instead construct a bi- [61] Also, it achieves better computational com- nary tree alike structure — thus, reconstruct- plexity by using Patience Sorting [36] to find the ing particular revision does not require O(revs) longest common subsequence.28 deltas but only O(log revs). However, when adding new revisions, parts of the tree or the whole tree needs to be reconstructed, and the 4.1.4 BDiff deltas are usually larger. Newer systems usually The bdiff algorithm [68] improving the Ratcliff– solve the problem of too long delta chains by Obershelp pattern recognition [48] comes from simply inserting the whole revision text when the Python difflib library [53] and is used by the chain becomes too long or large in size. Mercurial, both for delta storage and provid- ing difference listing to the user. In contrast to looking for the longest common subsequence, it 4.1.2 Myers’ Longest Common Subse- searches for the longest common continuous sub- quence string within the objects and recursively in the Perhaps the most widely used algorithm was parts preceding/succeeding it. It is quadratic proposed by Eugene Myers in [7] and is based like the Myers algorithm, but has been mea- on recursively finding the longest subsequence sured [68] to have better average performance of common lines in the list of lines of the com- and output more human-friendly deltas. pared objects; it is used most notably by the GNU diff tool [33], but also as the user-visible 4.1.5 XDelta diff algorithm of most systems. The main idea of the algorithm is to perform two breadth-first The tool xdelta [72][73] popularized use of the searches through the space of possible edita- Rabin Fingerprinting technique [23] for delta tions (line adds/removals/keeps), starting from generation: it is effectively a simple modifica- the beginning and end of the object in paral- tion of the Rabin–Karp substring matching al- lel; when the two searches meet, we have found gorithm, inspired by the work on [22]. the shortest editation sequence. The algorithm Subversion [60], Monotone [41] and Git [28] use has quadratic computational complexity, so it is this variation for internal storage — generat- unsuitable for very large inputs. ing one-way binary diffs between arbitrary non- textual blobs. As noted in [68], while this algorithm is quite The algorithm produces and in- suitable for human consumption and does pro- Copy Insert structions on the output and depends on fast duce optimal longest common subsequences, it hashing. The first object is divided into small does not produce optimal binary deltas since blocks29 and each block is hashed and added to it weights additions, changes and removals all the same, while binary deltas need to actually 28The referenced paper describes finding the longest record only added data and removed data does increasing subsequence. To find longest common subse- 27 quence, a sequence of lines corresponding to one object not need to be spelled out. is created, with the elements being line numbers of the same lines in the other object; lines that are present more 27However, this should be easy to work around by ad- than once or never are ignored in the search. justing the breadth-first search to add 0-length edges at 29The window size is 16 bytes in Git; small power of the beginning of the queue and 1-length edges at the end. 2 is recommended in general. 4 STORAGE MODELS 10 a hash table (proportional to input size). Then, change itself: added lines preceded by a + sign a running hash of the same-sized window is be- and removed lines by a - sign; modified lines are ing computed on the second file; the moment written as a +/- line pair. Unidentified lines in a match is hit, all found blocks of the first ob- the diff are ignored, which allows version con- ject are stretched to largest possible matches to trol systems to include custom metadata in the the current position in the second object and diff.31 the Copy instruction is generated for the largest This format has number of advantages: it is match. Insert instruction is generated for un- easy to generate, it is easy to inspect by a human matched data in second object. for the purpose of change review (including eas- This algorithm is very fast (having much bet- ily tweaking details of the change in the unified ter computational complexity compared to My- diff itself), it is easy to apply automatically32 ers’ — O(n) vs O(n2)) and produces efficient and very importantly, it is possible to apply the diffs for arbitrary data, however it is entirely differences to another version than the one the unsuitable for human consumption. diff was created from. This is allowed by the A similar scheme is also used in zdelta [74] presence of the context lines: even if the start- with reportedly better efficiency and with code ing line numbers in the chunk header do not released under a more permissive licence. ZDelta fit the context, the program can automatically reuses the LZ77 algorithm [4] for finding the search the vincinity of the area for the context matching parts and Huffman codes for encod- lines and apply the patch fuzzily. At the same ing the result. time, the context lines provide a way to verify that the change is still applicable as-is to the target version. 4.2 Delta Formats We should also mention a Git extension to 4.2.1 Trivial Approach the unified diff for showing diff representation of merges — combined diff [26] (see figure1 for Not really a delta format by itself, the most an example). Instead of a single column with naive approach to store history is to simply store the ± marks, multiple columns are included, every version of every object separately, possibly one per merge parent, allowing concise descrip- using some compression method (not taking ad- tion of which lines differ against which parents. vantage of historical context of the object). This Furthermore, by excluding hunks that contain is the obvious format used for “poor man’s ver- only changes coming from one of the parents, sion control” methods like archiving compressed we get a diff describing only changes actually snapshots of the whole project at regular in- introduced by the merge itself — manual reso- tervals and backing them up. However, perhaps lutions of merge conflicts.33 surprisingly this method does see usage in some modern systems as well — namely, it is used as 4.2.3 VCDIFF (RFC3284) the basic in-flight storage format in Git. The VCDIFF format [64] deserves a brief men- 4.2.2 Unified Diff tion, being the official standardized way to ex- change binary deltas over the network, being The unified diff format [69] — albeit not used specified by RFC3284. However, to our knowl- in the changes storage itself — is the de-facto edge no widely used version control system ac- standard for interchange of plaintext changes, tually uses this format for anything, though it being the usual format used in “patch” files.30 is the default output generated by the xdelta For each file, the diff contains all the change tool (4.1.5). The main reason is that it is eas- chunks — the areas of the file where changes oc- ier and faster for individual systems to use their cured. Each chunk consists of the chunk header native delta formats even for data interchange localizing the chunk in both versions of the file and that the VCDIFF format is rather baroque, (starting line number and line length of the 31Commonly, informational-only notes about the com- chunk), context lines immediately surrounding pared revisions are inserted. However, e.g. Git uses this the changed lines (preceded by a space) and the to extend the diffs with file renames/copies information and can make use of the metadata when applying the 30Sometimes, a very similar format called “context patches. diff” is also used, however unified diffs are much more 32UNIX provides a standard patch(1) tool. [52] common. 33Conflicts are further explained in 6.2. 4 STORAGE MODELS 11 --cc git-am.sh index 75886a8,4dce87b..b48096e --- a/git-am.sh +++ b/git-am.sh @@@ -9,9 -8,9 +9,9 @@@ git-am [options] |.. git-am [options] --skip -d,dotest= use

and not .dotest +d,dotest= (removed -- do not use) i,interactive run interactively - b,binary pass --allo-binary-replacement to git-apply + b,binary pass --allow-binary-replacement to git-apply 3,3way allow fall back on 3way merging if needed

Figure 1: Combined diff example. [26] The diff and index lines are Git-inserted metadata, and the trailing text at the @@@ line is only to give the reader a semantic context of the hunk. implementing it is not trivial and it is generally 4.2.5 Weave not as compact as the native protocols.

The weave format introduced by SCCS [63] used 4.2.4 RCS Delta to be rather obscure for long time, but cur- rently many version control developers are fa- RCS uses the simplest representation of changes miliar with it due to its prominent usage by Bit- sequence that also spilled over to CVS and heav- 34 Keeper and some newly developed merging al- ily influenced the SVN BerkeleyDB schema. gorithms. Still, it has not seen widespread adop- [54] The representation is in text format and tion.35 is optimized for quick access to the latest revi- sion, which is stored at the top of the file and Instead of storing each revision separately, always in full. Older revisions are then stored this format intersperses all revisions, listing all each as a delta against the succeeding one. In the lines that ever appeared in the file together case of branches, the delta direction is reversed with the list of revisions they appear in. Then, and newer revisions are represented by forward retrieving any revision means simply extracting delta — thus, to get a tip of a non- branch, all the lines that have the revision in their set. delta sequence all the way from the latest trunk revision to branch base and then back to the This format shares many disadvantages with branch tip must be applied. the RCS delta format, especially with regard to The delta itself is simply composed by lines locking and data corruption potential. In addi- each containing an add a or delete d command, tion, retrieving any revision is uniformly pro- followed by line number in base data and num- portional to history size. On the other hand, 36 ber of lines, with the lines to insert by a follow- the line annotation is available very cheaply ing the command immediately. (compared to systems using delta-based stor- Aside of very slow reconstruction of non-trunk age where retrieving this involves walking all the branch tips and old revisions in general, the ma- history) and having per-line history information jor issue with this format is the requirement to enables some interesting merging methods to be rewrite the whole file in case of commit, mak- used (see 6.4). ing the repository database much prone to lock contention and data corruption and making cre- ation of new revision quite an expensive opera- tion. 35The Bazaar project actually used it for brief period but then switched to a more traditional delta chain rep- 34SVN FSFS format is quite different and perhaps resentation for the benefits of append-only data struc- most resembles the Mercurial revlogs (4.3.3). We chose ture. [13] not to describe SVN formats in detail since we deem 36Information about who, when and in which revision them not particularly interesting in themselves. added a particular line. 4 STORAGE MODELS 12

4.3 Repository Database Formats size, largest objects first — thus, deltas are gen- erated from larger objects to smaller ones, al- 4.3.1 Trivial Approach lowing more efficient representation. The trivial approach is to simply use separate Note that the delta order is in almost no way files for appropriately fine-grained objects. This related to physical ordering of the objects in usage is very common in many systems; it is the the pack file. There are no theoretical restric- approach taken by CVS and Mercurial where tions on that, however Git orders them based every file in the project directory has its history on breadth-first search on the graph of all ob- stored in corresponding file in the repository. It jects starting from the top commit objects on is also the basic storage format of Git, where the defined branches. In other words, the objects each object (commit, tree or blob — particu- necessary for recreating the latest commits are lar state of a version-tracked file) is stored in a lumped together at the beginning of the pack spearate file. file. This particular ordering greatly optimizes the I/O operations locality for the most com- mon I/O patterns. 4.3.2 Git Packs One specific exception in the physical order- ing is that in case of delta objects, the base ob- As mentioned above, Git simply stores its first- ject is stored before the delta object; this en- class objects directly on the disk as its primary sures locality when reconstructing delta chains. format. However, as explained earlier this is very In practice, this rule does not reorder the ob- inefficient for long-term storage; for that pur- jects significantly — as the Linus’ Law states: pose, Git Packs offer a storage format showing “Files grow” [16]; thus, the base objects (larger extreme size effectivity in practice.37 files) tend to come before delta objects (smaller A Git Pack [26] consists of a pack file and files) anyway and while it is not enforced, the index file. The index file provides a simple two- delta objects tend to become “backward deltas” level index of the objects keyed by the object most of the time. id. The pack file contains the compressed Git Git Packs use no locking mechanism — they objects of all basic types — commit, tree and are not created on the fly, but only at certain blob — but can also accomodate an extra object points in time during a “garbage collection” op- type delta: eration over the repository, from loose one-per- The pack-specific delta objects represent a object files created in the repository by regular given object by describing binary differences usage.40 against another object of the same type,38 as yielded by the xdelta algorithm (see 4.1.5). In theory, any object of the same type will do as 4.3.3 Mercurial Revlogs a base, but Git sorts the objects using a spe- Another popular data structure are so-called cial heuristics and then considers n following revlogs used by Mercurial. [68][38] While Git 39 objects in the list as delta candidates, pick- packs are designed around heuristics, guarantee ing the smallest delta generated. very little about worst-case performance and are The list heuristics is the core of the Git Pack optimized for work with hot cache,41 revlogs can efficiency. [16][28] The important criteria are guarantee good worst-case seek performance (at that blob objects are sorted primarily by the file- the cost of less flexibility for improving average name they were first reached though, with the performance) and are tailored for cold cache ac- last characters being the most significant ones. cess.42 Unfortunately, while there are some ca- (Thus, all .c files are grouped together, then all sual benchmarks available on the web, detailed files with the same name in different directories, comparison with clearly defined cache status to etc.) The secondary sorting order is by object confirm these expectations is yet to be pub- lished. 37Git also uses the same format for native network data transfer. 40One exception is when new commits are pulled by 38Or, of course, another delta object. However, the the native protocol, which itself is in the Git Pack for- delta chain is upper-bounded, by default by the value of mat; the data is then saved directly as a pack. 10. 41That is, when the user uses the system continuously 39By default, the delta window is n = 10, but com- for longer time and most of the relevant data is cached monly, values like 50 and rarely even 100 are used for in memory. generating ultra-dense packfiles. 42With “cold cache”, most of the data are still on disk. 5 USER INTERACTION 13

Mercurial uses separate revlogs for each interfaces for inspection of repositories. tracked file and two extra revlogs for the man- ifests and changesets. Each revlog consists of 5.1 Command-line Interfaces an index file for revision lookup43 and data file containing a hunk per revision, every hunk be- One important conisderation with the ing either a full revision in verbatim or a delta command-line interfaces is to keep in mind against the previous revision — the total size of that users are actually going to type many the delta chain is limited by a small multiple of commands — e.g., GNU Arch requires usage the revision size.44 In case of the revision being of extremely long revision identifiers, and for represented by a delta, the index will contain systems using hashes for object identification, pointer to the base revision so that all the nec- it is important to allow shortcutting mentioned essary data can be read from the file in a single earlier. swipe. Revisions from different branches are all Another considerable point is to design the stored in a single revlog, but when forming delta interface carefully and logically around a mini- chains only the order of recording the revisions mal set of commands, also hiding implementa- matters, not their branch. tion details whenever possible (e.g., by trying When recording new revision in a revlog, the to intelligently default to a suitable merge al- revlog is locked for writing and a new hunk is gorithm instead of requiring the user to decide appended to the data file and then recorded in on one). Git provides currently about 140 com- the index. Since the revlogs are append-only, no mands, while large portion of them is in fact locking is required for readers and it is a matter almost never interesting to end users and is use- of simply truncating the file to recover from an ful only as helper commands within scripts or interrupted write. Furthermore, when working other commands.45 with the whole tree (both reading and writing), Note that even regarding traditional all the revlog files are visited in a fixed order command-line interfaces, there can be in- compatible with the standard system files or- teresting variations — for example, Git extends dering; this way, the filesystem can maintain the the basic edit–commit workflow with the op- most desired on-disk layout and using standard tional Git index concept, presenting a staging system tools to copy a repository will result in area inbetween the working tree and next a good on-disk layout as well. commit; see 3.6 for details. User interface can take very unusual forms; a specific one is integration of version control 5 User Interaction within the filesystem interface. In that case, ac- cessing a particular revision of a file is usu- In the development of new generation version ally performed by simply opening a specifically control tools, the importance of good user inter- crafted interface. This concept goes as far back face has been underestimated in the past, while as to the VMS where the it tends to be one of the most important fac- filesystem interface had builtin support for file tors for users when deciding which system to versioning. [71] A configuration management use; good user interface is therefore one of main system with version control capabilities Vesta considerations in the development of currently provides primarily filesystem-based access to in- widely used tools. dividual revisions (furthermore with O(1) ac- Traditionally, all of the tools provide cess guarantees). [65] There also exist alterna- command-line interface, which tends to be the tive filesystem-based interfaces for more tradi- main way of interaction with the system from tional version control systems as well (though the side of the developers. However, many authors also explore ways to interact with the 45To simplify the usage of Git, we created a frontend system efficiently using GUI, both for actually Cogito that used lowlevel Git interface to create an easy- to-use version control tool carefully designed to simplify operating the system and merely exploring many workflows and present a simple and consistent set the history. Another separate area are web of commands to the user, with consideration for users used to older systems like CVS or Subversion. Since most 43A slightly modified RevlogNG format used by newer features provided by Cogito have been adopted or obso- Mercurial versions supports interleaving of index infor- leted by upstream Git development, Cogito is not main- mation and data hunks. tained anymore (but we still consider the Git UI some- 44Mercurial 1.0 uses the value of 2. what lacking). 5 USER INTERACTION 14 mostly in early development stages), e.g. Bzr- not a simple tree of commits, but a more com- FS [12] or GitFS [31]. plex graph with possibly many branchings and merges and the relationships between revisions may not be immediately obvious to the user. 5.2 Basic GUI Tools The first popular history visualizer were Bit- There are several types of GUI tools. The Keeper’s revtool and Monotone-viz [42] which most obvious candidates are all-in tools cov- focuses on visualising the revision graph, offer- ering most of the system functionality, often ing it as the main visual object to the user; click- reusing other helper applications (such as his- ing on nodes will display revision details. During tory browsers) — most systems have tools like Git infancy, the tool was adopted as Git-viz, but that, e.g. WinCVS, git-gui, Bzr-Gtk, etc. it did not see widespread usage; however, similar An interesting approach is to integrate the tool is being used in Bazaar as part of BzrTools. version control system with basic system shells Instead, a much denser history presentation or file managers (like the Windows Explorer or gained popularity in the Git community within Konqueror); version-controlled files include the a tool gitk. Its primary visual objects are the state in their icon (clean, modified, containing revisions themselves, with the first line of the merge conflicts, ...) and all versioning operations commit message (traditionally regarded as kind are available through the context menu. The of “subject”) and authorship information shown classical example being TortoiseCVS, clones for each commit; the graph is still shown, how- like TortoiseSVN, TortoiseBzr, Tortoise- ever only in narrow area at the margin of the Git, TortoiseHg, etc. also exist. Similarly, ver- screen, with nodes aligned to the line structure sion control capabilities are often integrated of commits. Reused in other Git user interfaces with IDEs like . (both graphical ones and tig), this way of visu- Another family of tools are simple GUI alization became popular in the Mercurial com- helpers designed merely to be used in concert munity as well (hgk), Monotone even produces with the classical commandline controls; this has similar kind of output in ASCII art with its na- been mostly pioneered by BitKeeper, providing tive mtn log command; there also exist projects citool, mergetool etc. to allow to perform oper- like git-browser which use AJAX technologies ations like reviewing to-be-committed state or to bring this interface to the web. merging files more comfortably. Currently, e.g. Git frontends like git-gui or tig provide a way 5.4 Web Interfaces to control to-be-committed content down to the per-hunk level of diffs and there exist system- Important part of user interface of a version con- agnostic tools for graphical review of diffs (xxd- trol system is a web interface — most projects iff, kdiff3, ...) and even merging assistance (es- have some kind of web interface available so that pecially ). people wanting to look at the source and the his- Visual UI tools do not need to be always tory do not have to install the particular version graphical. Many developers prefer exclusively control tool and download the whole project. terminal work and sometimes it may be neces- Good interface also should not assume users’ fa- sary to work on a remote server over ssh ses- miliarity with the particular system, since it gets sion; tig is a good example of terminal tool that much more diverse audience than other tools of provides ASCII-art46 history diagrams and fron- the system. tends many of Git’s functionality (including in- Historically, cvsweb, viewcvs and similar dex management down to diff-hunk level), with- tools for SVN focused on primarily presenting out requiring any graphics rendering. the tree of files, with support for inspecting in- dividual history of files when needed. While sim- ilar approaches are also present in some web in- 5.3 History Browsing terfaces for distributed version control systems (e.g. GitHub), the paradigm shift to primarily With the advent of distributed version control, presenting the project history is visible here as it became much more important to properly well, in the widely used gitweb interface and visualize the project history — suddenly it is the hgweb clone. On the project front page, 46Graphics constructed in text mode from standard the user is not presented with root directory of letters and symbols. the file tree, but instead with the summary of 6 MERGING METHODS 15 the latest commits, available branches and tags, etc. — browsing the tree is of course possible, 0 but not the primary activity anymore. Something of a rarity, the Gist service47 uses modern web technologies to allow users build 1.1 2.1 3.1 whole repositories from scratch from within the web browser; the user is presented by a text-area where they can paste content (akin to various pastebin services), but further modifications are 1.2 2.2 3.2 then stored in git revision history, it is possi- ble to add multiple files and eventually pull the history. It is unclear to us how often it is actu- 2.3 3.3 ally used as anything more than a trivial one-off pastebin, however.

6 Merging Methods 2.4

One of the major past challenges with systems Figure 2: A history DAG example encouraging massive branching was devising a powerful merging mechanism that would make the merging process sufficiently easy, quick and performed: consider merging revisions 1.2 and robust for frequent usage. This ability is crucial, 2.4 as shown in the figure2. On the branch 1, otherwise people will become reluctant to create no new revisions appeared since its last merge branches and the main strength of distributed with 2.2, thus merging the 2.4 revision will ef- version control will dwindle. On the other hand, fectively merely copy this revision over to the practice shows that merge algorithms should be branch 1. A system might create a merge-node simple enough to be easily predictable, even if 1.3 anyway, but this may not be desirable since the tradeoff would be to require manual inter- many useless merge-nodes will be created in the ventions by the user more frequently — common extremely common scenario of users only track- problem of algorithms not based on three-way ing some project in read-only fashion: every time merge is that while they may generate less con- they update their copy, no local revisions will be flicts, their merge resolutions may be relatively found locally but the merge will create a merge- unintuitive and some of their properties are con- node revision to represent the update anyway. troversial and unfamiliar to the users. To avoid this problem, a “virtual” merge In this section, we will describe the most strategy of fast-forwarding is used; in case one common merge strategies (three-way merge and of the merged branches contains no new revi- recursive merge) as well as some more ad- sions since the last merge, it is simply repointed vanced theoretical developments (mark-merge, to the head commit of the other branch. Thus, PseudoCDV merge). We will also consider the in our scenario, branch 1 is simply repointed to patch algebra based way of merging in changeset- the revision 2.4! Then, when committing a new oriented systems. revision to branch 1, 2.4 will be treated as the There is a great number of considerations new fork point. when designing and comparing merging algo- This technique allows branches of the DAG rithms and we cannot cover them all here; we to converge whenever possible, however it also recommend the pages dedicated to merging al- breaks commit parents ordering — after fast- gorithms in the RevCtrl [55] for detailed forwarding to a different branch, merges with study of all the aspects. the original branch will appear “flipped”. The users either cannot rely on the parents ordering, 6.1 Fast-forward Merge or they have to constrain the system usage by a policy to disallow scenarios that would provoke First, we should notice what happens in the re- a fast-forward.48 vision DAG when a “blank” merge is actually 48Fast-forwarding can frequently occur in tight push- 47Part of the GitHub cloud. pull loops performed concurrently on a shared central 6 MERGING METHODS 16

6.2 Three-way Merge However, the choice of a particular b has ma- jor impact on the size of d and d , affecting Three-way merge49 is the most common method x y the conflicting portion of d. Thus, b should be of merging in use. Assume merging two version chosen intelligently, usually by taking the least nodes x and y into merged version m. Let b be common ancestor (LCA). the merge base — the “nearest common ances- In case of a tree, LCA can be simply described tor” of x and y. Then, using difference func- as the node nearest to the tree root on a non- tion50 ∆: (text1, text2) → delta (and delta ap- oriented path from x to y, and it will be always plying function ∆−1:(text1, delta) → text2): unique as follows from tree properties. However, this approach is satisfying only for the first-time d = ∆(b, x) x merge of the branches of x and y — if a merge dy = ∆(b, y) between the branches is performed repeatedly, the same b will always get chosen, even though d = d ∪ d x y the previous merge outcomes would be much −1 m = ∆ (b, d) more reasonable choices as the size of dx and dy Simply the combination of differences be- can be assumed to generally increase over time. tween base and both merged versions are ap- This is the main motivation behind making plied to the base version again and the result is the history a DAG instead of a tree; when the merged version. The problematic part here recording the m version, both x and y are is determining d — if d and d concern discrete marked as its ancestors instead of just one of x y 53 portions of the base version, combining them is them. When determining the LCA of descen- trivial. dands of x and y later, m can then be used in- However, when both change the same part stead of the original b. Thus, for example for of base version, each in a different way, they the graph in figure2 earlier, the base of the 2.4 cannot be combined in a simple way. A merge merge of 2.3 and 3.3 would be 2.2. conflict arises, only the non-conflicting parts of However, in this case we lose the guarantee of d are applied and for the rest the user is re- b uniqueness, since multiple discrete paths be- −1 tween x and y can exist: to deal with this prob- quired to choose between mx = ∆ (b, dx) and −1 lem, recursive merge extension of the algorithm my = ∆ (b, dy) or combine mx and my manu- ally — usually the system inserts both variants has been devised. to the file, separated by visible one-line mark- ers.51 0 In general, in single-shot usage52 the algo- rithm can work on any b merge base version. repository. One way to avoid them is to tell the devel- X0 Y0 opers not to merge the changes their peers have done in parallel and pushed earlier, but instead to rebase their own changes on top of the already-pushed ones: this op- eration rewrites the history locally, sequencing two com- mits performed in parallel. X1 Y1 49Unfortunately, we were not able to find out who in- troduced the technique of three-way merging, currently commonly known and widely used; the earliest reference we were able to find is in the RCS paper. [54] merge 50See 4.1 for the tour over various functions. 51In some cases, the system can still resolve simple conflicts, for example if they concern only whitespace characters or perhaps even when merging a simple refor- Figure 3: A criss-cross history graph example matting change with a semantic change. Putting more inteligence into conflict resolution is one of the active areas of merge research. However, silently mis-merging incompatible changes can have dangerous consequences, 6.3 Recursive Merge so making the resolution more intelligent at the risk of increasing error rate can be very harmful. Consider the criss-cross merge scenario [56] 52When the algorithm is applied on two arbitrary ver- shown in figure3. In this case, b can take the sions without historical context. In a generic DAG, se- lecting inappropriate b within repeated merges can lead 53In case of a DAG, we prune paths that are supersets to information loss, as described below. of other paths from our consideration. 6 MERGING METHODS 17

values of both X0 and Y0, however taking any working on much finer grain level. In case of con- of these can lose data from the other branch de- flict, the main idea of the algorithm is to simply pending on the nature of the merge — consider leave the conflict markers in place when using conflicting change in X0 and Y0, in X1 resolved the result as a base for further merges. This by taking X0 version exclusively and in Y1 by means that earlier conflicts are properly prop- taking Y0. Then, for b = X0, Y0 version will be agated as well as conflicting changes in newer cleanly merged and vice versa — clearly, conflict revisions. However, some rare edge cases are still should be generated instead. mishandled as described in [56]. This particular problem and its more compli- cated variations led some researchers to aban- 6.4 Precise Merge don the three-way merge altogether and focus on other merge methods instead, some of them A weave-based merge algorithm Precise Codev- described below. However, the problem can be ille Merge [62] has been devised by the Codev- at least partially solved within the context of ille version control system.55 It later turned out three-way merge as well, though the practical that Codeville probably independently invented impact has to be yet evaluated carefully. a merging method very similar to what Bit- A basic rule should be that the system, detect- Keeper uses internally. ing multiple possible merge bases, does never Instead of comparing the to-be-merged revi- randomly select one but instead asks the user sions to a base revision, the algorithm directly for intervention, chooses a proper merge base or performs a two-way merge between the two re- possibly intelligently constructs one. visions and then uses the weave-embedded his- One naive approach we devised is to take tory information to decide which side of each LCA(X1,Y1) recursively and repeat until a conflicting hunk is to be chosen. single revision comes out. The practicality of Consider vi() being the number of state this approach is dependent on the development changes (addition or removal; also called gen- strategy adopted by a particular project — if the eration count) of line i at revision r. For each criss-cross merge is not repeated for too long and revision, have a list of line chunks: lx, ly. Then, a common base for both branches appears once the weave of the file is iterated line by line: in a while, the LCA will stay in a reasonable distance from the merged versions and conflicts • if a line is not in either of the revisions, would generally stay within reasonable scale. nothing is done However, for long-term criss-cross merges this • if a line is found to be only in one of the approach degenerates to something similar to a revisions, it is appended to appropriate line tree-based history, and it still leaves some more chunk list lx or ly and revision precedence advanced criss-cross problems unsolved [1].54 flag px is set if vi(x) > vi(y), or py if vi(y) > An important enhancement implemented by vi(x) (obviously, vi(x) 6= vi(y)) Fredrik N. Kuivinen is to actually perform the merges themselves recursively: [27][56] • if a line is found to be in both revisions, either:

|B| = 1 : b(B) = B0 1. px and py are unset and this is part of |B| = 2 : b(B) = M(LCA(B0,B1),B0,B1) common lines block and appended to M(B, x, y) = ∆−1(b(B), x ∪ y) the output 2. only p or p is set and this is non- m(x, y) = M(LCA(x, y), x, y) x y conflicting change (then, lx or ly re- That is, in the example above: spectively is appended) 3. both are set and a conflict block from −1 −1 m = ∆ (∆ (0,X0 ∪ Y0),X1 ∪ Y1) lx and ly chunks is generated; in the latter two cases, lx ← ly ← ∅ and It is easily visible that this works properly in px ← py ← 0. case of no conflicts, and reduces conflict rate by 55The system is not otherwise covered here since it has 54It must be also considered that if the merge gener- never seen wider usage and is not developed anymore; it ates unnecessarily large conflicts, risk of human-induced can be probably said that its main purpose ended up to mismerge raises considerably. be to research various merging algorithms. 7 FUTURE RESEARCH 18

The advantage of this algorithm is the absence • if all (y)∗ revisions are ancestors of x, vx of the need to walk the revision graph at all shall be merge result (provided that the system uses the right stor- age format, which can be difficult while keep- • otherwise, throw a conflict ing the desirable append-only properties of the For detailed formal user model and proof of most common ones), simplicity and handling of correctness please refer to [35]. most of the problematic edge cases of the three- way merges. Also, this algorithm handles cherry- picking naturally well compared to three-way 6.6 Patch Algebra merge. In Darcs (3.7), a particularly elegant formal sys- The disadvantage is a requirement for the tem patch algebra has been developed for merg- weave structure potentially expensive to em- ing two sets of patches in a changeset-oriented ulate if the system uses different storage 56 version control system. [20] method, and somewhat controversial usage of The main idea of patch algebra is to pro- the vi criterion — it is not immediately obvious vide natural and formally-backed semantics for that lines with more state changes should always merging within such a system.57 All patches are win against these with less. Due to the funda- defined to have a set of dependencies on other mental difference in operation to the much more patches58, inversions, and independent patches widely used three-way merge, in cases it chooses are allowed to commute on the stack. When different result its operation may be confusing to merging two sets, independent patches are ap- the users. propriately commuted (sometimes requiring use of inversion patches to work around fuzzy appli- 6.5 Mark Merge cations) and for conflicting patches, user reso- lution is requested; the merge operation is then The mark merge (or *-merge) algorithm [35] is recorded in a special merger patch.59 somewhat unusual. First, all the other merge Given the way the to-be-merged merger techniques described here are designed to merge patches are unwound, it is believed that the vectors (file content), while this one is a scalar merge behaviour with regard to the pathologi- merge algorithm (for example for directory cal cases like criss-cross merge are similar to the items — the executability flag or file name when recursive merge algorithm or better, since Darcs merging renames). Second, mark merge has a can make use even of intermediate changes be- precisely specified user model and formal proof tween the two merged revisions and their base that the algorithm matches the user model. that the three-way merge approach cannot see. The aim of mark merge is to define the most [56][8] However, we are not aware of any work sensible way of merging scalar values while giv- formally analysing and rigorously comparing the ing foremost priority to explicit user choices. properties of Darcs merges with the other ap- First, a mark means explicit decision by the user proaches. on the scalar value; marked revision is such that user explicitly set the value in this revision. Let us have version r, then ∗(r) shall be set of min- 7 Future Research imal marked ancestors of r — that is, taking graph of r ancestry and reducing it to marked On the technical side, some of the existing revisions, set of revisions with no descendants. systems are not portable enough to be usable Then, when merging versions x and y with val- 57But please note that while Darcs makes an attempt ues vx and vy: for formal description, it is still lacking in many aspects both regarding precise definitions and many proofs miss- ing. We must however mention a relatively little-known • if vx = vy, the value shall be also the merge result theoretical work [2] inspired by Darcs formalism and at- tempting to rebuild it on sound foundations. 58The dependencies are either logical as explicitly • if all (x)∗ revisions are ancestors of y, vy specified by the user, or syntactic — a patch is depending shall be merge result on all patches lines of which it touches — as autodetected by the system. 56Vesta currently uses this merge method and con- 59The actual mechanisms used to preserve all the de- structs weaves on-demand; scalability of this method was sirable properties of the patch stack are rather compli- not studied, however. [51] cated — please refer to [20] for full description. 7 FUTURE RESEARCH 19 equally well in different operating systems (es- for tracking upstream and some special frontend pecially with regard to Windows vs. on top of that60 for tracking their changes in Linux and other UNIX systems) and the user form of stack — this has user interface issues interfaces are still lacking in many aspects, as and it is problematic to publish the repositories well as documentation. since the history is changing constantly. Another Regarding the theoretical side, we believe alternative (often used by open-source software three main areas need research most urgently distributors when maintaining their packages) is — content movement tracking, management of to version-control the patches in their diff form third party changes and more precise formal directly. specification. Using patches for changes interchange can have some benefits with regard e.g. to code re- 7.1 Fine-grained Content Move- view practices, but frequently it is merely extra ment Tracking overhead and it incurs extra load on the third- party developers. A new design that would allow More fine-grained tracking of changes is required seamless external changes tracking needs to be to improve merging algorithms, accounting for developed. changes and following history of arbitrary parts of content; as noted in 2.5, some of the systems 7.2.1 TopGit provide way to follow file history across renames, Git introduces the pickaxe mechanism, but bet- We have created TopGit [66], a layer over Git ter infrastructure is required for practical usage that aims to solve the task of maintaining third- especially during merging and allowing visual party patches properly, allowing for fully dis- representation and usage. tributed development and proper history track- ing of all the changes. We start off from the concept of topic 7.2 Third-party Changes Tracking branches, popular Git technique of having a spe- The current distributed architectures still are cialized branch for development of each indepen- not flexible enough to accomodate for fully dis- dent change, then merging them all together; tributed development in open-source projects. this is merely a “design pattern”, needing no When a third-party developer publishes some special tool support. However, we extend this in changes and the upstream developers want to two ways: merge it only with certain reservations, the third-party developer can either do the required 1. We create a directed acyclic graph from the 61 changes on top of his previous changes, leav- topic branches, giving each branch a list ing the history in a certain degree of disar- of dependencies (other branches). ray and making his proposed chances increas- 2. We store metadata for each topic branch ingly harder to review, or rewrite the history of within its file tree: the branch descrip- his changes, which creates problems for others tion and authorship information within who have already accessed the published version /.topmsg file and the list of branches it de- since the rewrite essentially created a fresh new pends on inside /.topdeps.62 branch unrelated to the older one. Similar problems arise when the third-party The topic branches can depend on other topic changes need to be ported to new upstream ver- branches, but also on regular Git branches. Typ- sion, or when a distributor wants to maintain ically, in the main set of third-party patches, series of changes in their version of the prod- each patch will have its own TopGit branch uct; using the traditional merge mechanisms, it depending on a Git branch tracking the up- quickly becomes difficult to track the changes stream development, then possibly some third- against upstream cleanly as the ability to gen- party patches will have dependencies on some erate a diff of single change against latest up- of the TopGit branches if they further extend stream version can be crucial. Thus, many people make use of the dis- 60E.g. mq for Mercurial, StGIT for Git 61 tributed nature by only maintaining the changes Thus, conceptually one level higher than the DAG of the commits. locally and still using patches for changes inter- 62These two files are excluded from merges of depen- change; or, they use the version control system dencies. 8 CONCLUSION 20 the third-party changes, and at the top of the 8 Conclusion graph will be usually at least one so-called stag- ing branch that introduces no changes on its own We have gone through the most important con- (as opposed to content branches) but ties all the cepts, algorithms and structures used in current third-party patches together by listing them all version control systems, and put them in the in its dependencies. context of practical usage; we hope this gave the TopGit provides tools for creating topic reader a comprehensive idea about the area and branches, querying them, pulling topic branch will help to put his further research in concrete updates and merging them with local work, but topics into the broad context. most importantly a command for updating the Among the open source projects, the new gen- branches, which recursively updates all the de- eration of distributed version control systems pendencies of a branch, then merges them in. has registered massive uptake; while there are The project reached basic practical usability still projects using CVS and many projects are and at the time of this writing is used for main- using or even migrating to SVN, large portion of tenance of some packages, but its user the most high-profile open source projects with interface leaves a lot to be desired and some of many commits per day and code base of mil- the functionality is not fully implemented. lions of lines of code [43] is using one of the distributed systems described above. Especially problematic is the task of remov- ing a dependency of a patch in a way that pre- The wide adoption and scale of use clearly serves the history, makes it possible to easily re- indicates that the current systems have reached add the dependency later and does not spread appropriate scalability and reliability levels. The the dependency removal to further depending high number of user interface tools in devel- patches that also have the removed patch as a opment also makes it more comfortable to use dependency. [67] Another problem is inventing these systems and smoothens up the learning a user interface that makes history exploration curve that has been traditionally very steep for easy, since traditional commit browsers present distributed version control systems. However, as exceedingly complex trees. described in section7, many desirable features are still unimplemented and support for some workflows is still very lacking. 7.3 Formal Analysis 8.1 Acknowledgements Precise semantics of current version control sys- tems are defined only loosely and the lack of This article is based on Bachelor thesis I wrote formal analysis is visible especially in the area at the Faculty of Math and Physics, Charles of merging algorithms. In order to get a truly University. I would like to thank my thesis reliable and dependable system with regards to advisor Martin Mareˇsfor guidance and Bram merging “wild” branches exhibiting pathologi- Cohen, Piet Delport, Junio C. Hamano, Greg cal behaviour, the precise semantics of various , Uwe Kleine-K¨onig,Tom Lord, Matt merges should be specified formally; many of the Mackall, Larry McVoy, Kenneth C. Schalk, currently widely used merging methods lack de- Nathaniel Smith, Pavel Surynek and Zooko tailed formal analysis and their behaviour in cor- Wilcox–O’Hearn for their review and helpful in- ner cases is only guessed. put. There has been ongoing development in this area [20][2], but it appears to be mostly stalled now as the recursive three-way merge and sim- References ilar algorithms appear to work well enough in common cases and thus there is not much incen- [1] Smith N.: 3-way merge considered harmful, tive to formally describe their behaviour in cor- [email protected], 2005-05-01. ner cases. However, we believe that such a for- mal foundation would also help development of [2] L¨ohA., Swierstra W., Leijen D.: A newer changeset-oriented systems, in turn mak- Principled Approach to Version Control, ing progress with the third-party changes track- http://people.cs.uu.nl/andres/ ing problem. VersionControl.html REFERENCES 21

[3] Asklund U., Bendix L.: A study of [17] Grune D., Berliner B., et al.: CVS, configuration management in open source http://www.nongnu.org/cvs/ software projects, Software, IEEE Proceedings 149 (2002) 40–46. [18] Mansfield D., et al.: CVSps, http://www.cobite.com/cvsps/ [4] Ziv J., Lepmep A.: A universal algorithm for data compression, IEEE Transactions [19] Roundy D.: Darcs, http://darcs.net/ on Information Theory, 23(3):337-343, [20] Roundy D.: Darcs manual, 1977. http://darcs.net/manual/ [5] Henson V.: An Analysis of [21] Darcs Community: Darcs Wiki: Conflicts Compare-by-hash, USENIX Proceedings of FAQ, http://wiki.darcs.net/ HotOS IX (2003). DarcsWiki/ConflictsFAQ [6] Stiegler M.: An Introduction to Petname [22] Tridgell A.: Efficient Algorithms for Systems, http://www.skyhunter.com/ Sorting and Synchronization, PhD thesis, marcs/petnames/IntroPetNames.html Australian National University, 1999. [7] Myers E.: An O(ND) Difference Algorithm [23] Rabin M. O.: Fingerprinting By Random and its Variations, Algorithmica 1 (1986) Polynomials, Technical Report TR-15-81, 251–266. Center for Research in Computer [8] Wilcox-O’Hearn Z. B.: badmerge – Technology, Harvard University, 1981. concrete version with good semantics, [24] Torvalds L., Hamano J. C., et al.: Git, https://zooko.com/badmerge/ http://git.or.cz/ concrete-good-semantics.html [25] Hamano J. C., et al.: Git — A Stupid [9] Bentley A., Pool M., et al.: Bazaar, Content Tracker, OLS 2006 Proceedings http://www.bazaar-vcs.org/ Volume 1, Ottawa, 2006. [10] Cohen B., et al.: Bazaar Source Code, [26] Greaves D., Hamano J. C., Torvalds L., et bzrlib/patiencediff.py, al.: Git Documentation, http://www.bazaar-vcs.org/Download http://www.kernel.org/pub/software/ [11] McVoy L., et al.: BitKeeper, scm/git/docs/ http://www.bitkeeper.com/ [27] Kuivinen F., et al.: Git Source Code, [12] Korn M., et al.: Bzr Virtual Filesystem, builtin-merge-recursive.c, http://www. https://launchpad.net/bzr-fs/ kernel.org/pub/software/scm/git/ [13] Bentley A., Pool M., et al.: Bzr Weave [28] Pitre N., et al.: Git Source Code, Format, http: diff-delta.c, http://www.kernel.org/ //bazaar-vcs.org/BzrWeaveFormat pub/software/scm/git/ [14] Black J.: Compare-by-Hash: A Reasoned [29] Hamano J. C., et al.: Git Source Code, Analysis, USENIX Annual Technical diffcore-delta.c, http://www.kernel.org/ Conference–Systems and Experience pub/software/scm/git/ Track, USENIX 2006, Boston, 2006. [30] Torvalds L., et al.: Git Source Code, [15] Dart S.: Concepts in configuration pack-objects.c, http://www.kernel.org/ management systems, Proceedings of the pub/software/scm/git/ 3rd international workshop on Software [31] Blank M., et al.: gitfs, http://www. configuration management, Trondheim, sfgoth.com/~mitch/linux/gitfs/ 1991, ISBN 0-897914-429-5. [32] Lord T., et al.: GNU Arch, http: [16] Torvalds L.: Concerning Git’s Packing //www..org/software/gnu-arch/ Heuristics, http://www.kernel.org/pub/ software/scm/git/docs/technical/ [33] Eggert P., et al.: GNU Diffutils, http: pack-heuristics.txt //www.gnu.org/software/diffutils/ REFERENCES 22

[34] Henson V., Henderson R., et al.: [49] Seiwald C., et al.: Perforce, Guidelines for Using Compare-by-Hash, http://www.perforce.com/ http://infohost.nmt.edu/~val/ review/hash2.html [50] Hudson G., Personal conversation on Subversion [35] Brownawell T., Smith N.: Improvements to *-merge, [email protected], [51] Schalk K. C., Personal conversation on 2005-08-30. Vesta

[36] Aldous D., Diaconis P., et al.: Longest [52] IEEE: POSIX 1003.1 2004 Edition, 2004. Increasing Subsequences: From Patience Sorting to the Baik-Dieft-Johansson [53] Python Community: Python difflib — Theorem, Bull. Amer. Math. Soc. 36 helpers for producing deltas, (1999) 413—432. http://docs.python.org/lib/ module-difflib.html [37] Mackall M., et al.: Mercurial, http://www.selenic.com/mercurial/ [54] Tichy W.: RCS: a system for version control, Software—Practice and [38] Mackall M., et al.: Mercurial Source Code, Experience, 15 (7): 637–654, July 1985. mercurial/revlog.py, http://www. selenic.com/mercurial/release/ [55] Research Community: RevCtrl Wiki, [39] Hoare G., Smith N., et al.: Monotone, http://revctrl.org/ http://monotone.ca/ [56] Smith N., Baudis P., Cohen B. et al.: [40] Hoare G., et al.: Monotone RevCtrl Wiki: Criss Cross Merge, Documentation: Hash Integrity, http://revctrl.org/CrissCrossMerge http://monotone.ca/docs/ [57] National Institute of Standards and Hash-Integrity.html Technology: Secure Hash Standard (SHS), [41] Hoare G., et al.: Monotone FAQ, FIPS180-02, August 2002. http://www.venge.net/mtn-wiki/FAQ [58] E-Soft Inc.: Security Space Survey of [42] Andrieu O., et al.: Monotone-viz, http: Public Subversion DAV Servers, //oandrieu.nerim.net/monotone-viz/ http://subversion.tigris.org/ svn-dav-securityspace-survey.html [43] Wheeler D.: More Than a Gigabuck: Estimating GNU/Linux’s Size, [59] Blandy J., Fogel K., et al.: Subversion, http://www.dwheeler.com/sloc http://subversion.tigris.org/

[44] Wilcox–O’Hearn Z. B.: Names: [60] Berlin D.: SVNDIFF version 1, Decentralized, Secure, Human-Meaningful: [email protected], 2005-10-21 Choose Two, https://zooko.com/distnames.html [61] Cohen B.: The diff problem has been solved, http://bramcohen.livejournal. [45] Hudson G.: Notes on keeping version com/37690.html histories of files, http://web.mit.edu/ ghudson/thoughts/file-versioning [62] Cohen B.: The new Codeville merge [46] Shapiro J. S., et al.: OpenCM, algorithm, [email protected], http://www.opencm.org/ 2005-05-05. [47] Collins R., et al.: Patch Queue Manager, [63] Rochkind M. J.: The Source Code Control http: System, IEEE Transactions on Software //bazaar-vcs.org/PatchQueueManager Engineering SE-1:4 (Dec. 1975), 364–370. [48] Ratcliff J. W., Metzener D. E., et al.: [64] Korn D., et al.: The VCDIFF Generic Pattern Matching: The Gestalt Approach, Differencing and Compression Data Dr. Dobb’s Journal, July 1988, p. 46. Format, RFC3284. REFERENCES 23

[65] Heydon A., Levin R., Mann T., Yu Y.: The Vesta Approach to Software Configuration Management, Compaq SRC Research Report 168, 2001. [66] Baudis P., et al.: TopGit, http://repo.or.cz/w/topgit.git [67] Baudis P.: TopGit: how to deal with upstream inclusion, [email protected], 2008-09-14. [68] Mackall M.: Towards a Better SCM: Revlog and Mercurial, OLS 2006 Proceedings Volume 2, Ottawa, 2006. [69] Davison W.: Unified context diff tools, comp.source.misc Volume 14, 1990. [70] Conradi R., Westfechtel B.: Version models for software configuration management, ACM Computing Surveys 30 (1998) 232–282. [71] McCoy K.: VMS File System Internals, Digital Press, 1990, ISBN 1-55558-056-4.

[72] MacDonald J.: XDelta, http://xdelta.org/ [73] MacDonald J.: Versioned File Archiving, Compression, and Distribution, UC Berkeley, 1999.

[74] Trendafilov D., Memon N., Suel T.: zdelta: An Efficient Delta Compression Tool, Technical Report TR-CIS-2002-02, Polytechnic University Brooklyn, 2002.