Towards a Better SCM: Revlog and Mercurial

Towards a Better SCM: Revlog and Mercurial Matt Mackall Selenic Consulting [email protected] Abstract projects demand scalability in multiple dimen- sions, including efficient handling of large numbers of files, large numbers of revisions, Large projects need scalable, performant, and and large numbers of developers. robust software configuration management systems. If common revision control operations As an example of a moderately large project, are not cheap, they present a large barrier to we can look at the Linux kernel, a project proper software engineering practice. This pa- now in its second decade. The Linux kernel per will investigate the theoretical limits on source tree has tens of thousands of files and SCM performance, and examines how existing has collected on the order of a hundred thou- systems fall short of those ideals. sand changesets since adopting version control only a few years ago. It also has on the order I then describe the Revlog data storage scheme of a thousand contributors scattered around the created for the Mercurial SCM. The Revlog globe. It also continues to grow rapidly. So it’s scheme allows all common SCM operations to not hard to imagine projects growing to manage be performed in near-optimal time, while pro- millions of files, millions of changesets, and viding excellent compression and robustness. many thousands of people developing in par- Finally, I look at how a full distributed SCM allel over a timescale of decades. (Mercurial) is built on top of the Revlog scheme, some of the pitfalls we’ve surmounted At the same time, certain SCM features become in on-disk layout and I/O performance and the increasingly important. Decentralization is cru- protocols used to efficiently communicate be- cial: thousands of users with varying levels of tween repositories. network access can’t hope to efficiently cooper- ate if they’re all struggling to commit changesets to a central repository due to locking and bandwidth concerns. So it becomes critical that 1 Introduction: the need for SCM a system be able to painlessly handle per-user scalability development branches and repeated merging with the branches of other developers. Group- ing of interdependent changes in multiple files As software projects grow larger and open into a single “atomic” changeset becomes a ne- development practices become more widely cessity for understanding and working with the adopted, the demands on source control man- huge number of changes that are introduced. agement systems are greatly increased. Large And robust, compact storage of the revision 84 • Towards a Better SCM: Revlog and Mercurial history is essential for large number of develop- seeks and network bandwidth. We have in fact ers to work with their own decentralized repos- already long since reached a point where disk itories. seeks heavily dominate many workloads, including ours. So we’ll examine the important aspects of source control from the perspective of the facet it’s most likely to eventually be con- 2 Overview of Scalability Limits strained by. and Existing Systems For simplicity, we’ll make a couple simplify- To intelligently evaluate scalability issues over ing assumptions. First, we’ll assume files of a a timescale of a decade or more, we need to constant size, or rather that performance should look at the likely trends in both project growth generally be linearly related to file size. Sec- and computer performance. Many facets of ond, we’ll assume for now that a filesystem’s computer performance have historically fol- block allocation scheme is reasonably good and lowed an exponential curve, but at different that fragmentation of single files is fairly small. rates. If we order the more important of these Third, we’ll assume that file lookup is roughly facets by rate, we might have a list like the fol- constant time for moderately-sized directories. lowing: With that in mind, let’s look at some of the theoretical limits for the most important SCM op- • CPU speed erations as well as scalability of existing sys- • disk capacity tems, starting with operations on individual files: • memory capacity • memory bandwidth Storage compression: For compressing single file revisions, the best schemes known in- • LAN bandwidth clude SCCS-style “weaves” [8] or RCS-style deltas [5], together with a more generic com- • disk bandwidth pression algorithm like gzip. For files in a typi- • WAN bandwidth cal project, this results in average compression on the order of 10:1 to 20:1 with relatively con- • disk seek rate stant CPU overhead (see more about calculat- ing deltas below). So while CPU speed has changed by many orders of magnitude, disk seek rate has only Retrieving arbitrary file revisions: It’s easy changed slightly. In fact, seek rate is now dom- to see that we can easily achieve constant time inated by disk rotational latency and thus its (O(1) seeks) retrieval of individual file revi- rate of improvement has already run up against sions, simply by storing each revision in a sep- a wall imposed by physics. Similarly, WAN arate file in the history repository. In terms of bandwidth runs up against limits of existing big-O notation, we can do no better. This ig- communications infrastructure. nores some details of filesystem scalability, but it’s a good approximation. This is perhaps the So as technology progresses, it makes more and most fundamental operation in an SCM, so the more sense to trade off CPU power to save disk scalability of this operation is crucial. 2006 Linux Symposium, Volume Two • 85 Most SCMs, including CVS, and Bitkeeper use Checking out a project revision: Assuming delta or weave-based schemes that store all the O(1) seeks to check out file revisions, we might revisions for a given file in a single back-end expect O(files) seeks to check out all the files file. Reconstructing arbitrary versions requires in the project. But if we arrange so that file reading some or all of the history file, thus revisions are nearly consecutive, we can avoid making performance O(revisions) in disk band- most of those seeks, and instead our limit be- width and computation. As a special case, CVS comes O(files) as measured by disk bandwidth. stores the most recent revision of a given file A comparable operation is untarring an archive uncompressed, but still must read and parse the of that revision’s files. entire history file to retrieve it. As most systems must read the entirety of a SVN uses a skip-delta [7] scheme that requires file’s history to reconstruct a revision, they’ll reading O(log revisions) deltas (and seeks) to require O(total file revisions) disk bandwidth reconstruct a revision. and CPU to check out an entire project tree. SCMs that don’t visit the repository history in Unpacked git stores a back-end file for each file the filesystem’s natural layout can easily see revision, giving O(1) performance, but packed this performance degrade into O(total files) ran- git requires searching a collection of indices dom disk seeks, which can happen with SCMs that grow as the project history grows. Also backed by database engines or with systems worth noting is that git does not store an index like git which store objects by hash (unpacked) information at the file level. Thus operations or creation time (packed). like finding the previous version of a file to calculate a delta or the finding the common ances- Committing changes: For a change to a small tor of two file revisions can require searching set of the files in a project, we can expect to the entirety of the project’s history. be bound by O(changed files) seeks, or, as the Adding file revisions: Similarly, we can see number of changes approaches the number of that adding file revisions by the same scheme files, O(files) disk bandwidth. is also O(1) seeks. Most systems use schemes that require rewriting the entire history file, thus Systems like git can meet this target for com- making their performance decrease linearly as mit, as they simply create a new back-end file the number of revisions increase. for each file committed. Systems like CVS require rewriting file histories and thus take in- Annotate file history: We could, in princi- creasingly more time as histories grow deeper. ple, incrementally calculate and store an anno- Non-distributed systems also need to deal with tated version of each version of a file we store lock contention which grows quite rapidly with and thus achieve file history annotation in O(1) the number of concurrent users, lock hold time, seeks by adding more up-front CPU and stor- and network bandwidth and latency. age overhead. Almost all systems instead take O(revision) disk bandwidth to construct anno- Finding differences in the working direc- tations, if not significantly more. While anno- tory: There are two different approaches to this tation is traditionally not performance critical, problem. One is to have the user explicitly ac- some newer merge algorithms rely on it [8]. quire write permissions on a set of files, which nicely limits the set of files that need to be ex- Next, we can look at the performance limits of amined so that we only need O(writable files) working with revisions at the project level: comparisons. Another is to allow all files to 86 • Towards a Better SCM: Revlog and Mercurial be edited and to detect changes to all managed files. As most systems require O(file revisions) to retrieve a file version for comparison, this operation can be expensive.

Load more