GIT—A Stupid Content Tracker
Total Page:16
File Type:pdf, Size:1020Kb
GIT—A Stupid Content Tracker Junio C. Hamano Twin Sun, Inc. [email protected] Abstract The paper gives an overview of how git evolved and discusses the strengths and weaknesses of its design. Git was hurriedly hacked together by Linus Torvalds, after the Linux kernel project lost its license to use BitKeeper as its source code management system (SCM). It has since 1 Low level design quickly grown to become capable of managing the Linux kernel project source code. Other projects have started to replace their existing Git is a “stupid content tracker.” It is designed SCMs with it. to record and compare the whole tree states ef- ficiently. Unlike traditional source code control Among interesting things that it does are: systems, its data structures are not geared to- ward recording changes between revisions, but for making it efficient to retrieve the state of in- 1. giving a quick whole-tree diff, dividual revisions. 2. quick, simple, stupid-but-safe merge, The unit in git storage is an object. It records: 3. facilitating e-mail based patch exchange workflow, and • blob – the contents of a file (either the 4. helping to pin-point the change that caused contents of a regular file, or the path a particular bug by a bisection search in pointed at by a symbolic link). the development history. • tree – the contents of a directory, by recording the mapping from names to ob- The core git functionality is implemented as a jects (either a blob object or a tree object set of programs to allow higher-layer systems that represents a subdirectory). (Porcelains) to be built on top of it. Several Porcelains have been built on top of git, to sup- • commit – A commit associates a tree port different workflows and individual taste of with meta-information that describes how users. The primary advantage of this architec- the tree came into existence. It records: ture is ease of customization, while keeping the repositories managed by different Porce- – The tree object that describes the lains compatible with each other. project state. 386 • GIT—A Stupid Content Tracker – The author name and time of creation branches by recording the commit object names of the content. of them. – The committer name and time of cre- To keep track of what is being worked on in ation of the commit. the user’s working tree, another data structure – The parent commits of this commit. called index, is used. It associates pathnames – The commit log that describes why with object names, and is used as the staging the project state needs to be changed area for building the next tree to be committed. to the tree contained in this commit from the trees contained in the parent When preparing an index to build the next tree, commits. it also records the stat(2) information from the working tree files to optimize common op- • tag – a tag object names another object erations. For example, when listing the set of and associates arbitrary information with files that are different between the index and it. A person creating the tag can attest that the working tree, git does not have to inspect it points at an authentic object by GPG- the contents of files whose cached stat(2) signing the tag. information match the current working tree. Each object is referred to by taking the SHA1 The core git system consists of many relatively hash (160 bits) of its internal representation, low-level commands (often called Plumbing) and the value of this hash is called its object and a set of higher level scripts (often called name. A tree object maps a pathname to the Porcelain) that use Plumbing commands. Each object name of the blob (or another tree, for a Plumbing command is designed to do a spe- subdirectory). cific task and only that task. The Plumbing commands to move information between the By naming a single tree object that represents recorded history and the working tree are: the top-level directory of a project, the entire di- rectory structure and the contents of any project state can be recreated. In that sense, a tree ob- • Files to index. git-update-index ject is roughly equivalent to a tarball. records the contents of the working tree files to the index; to write out the blob A commit object, by tying its tree object with recorded in the index to the working tree other commit objects in the ancestry chain, files, git-checkout-index is used; gives the specific project state a point in project and git-diff-files compares what history. A merge commit ties two or more lines is recorded in the index and the working of developments together by recording which tree files. With these, an index is built that commits are its parents. A tag object is used to records a set of files in the desired state to attach a label to a specific commit object (e.g. a be committed next. particular release). • Index to recorded history. To write These objects are enough to record the project out the contents of the index as a tree history. A project can have more than one lines object, git-write-index is used; of developments, and they are called branches. git-commit-tree takes a tree object, The latest commit in each line of development zero or more commit objects as its parents, is called the head of the branch, and a reposi- and the commit log message, and creates tory keeps track of the heads of currently active a commit object. git-diff-index 2006 Linux Symposium, Volume One • 387 compares what is recorded in a tree ob- • The development process is highly dis- ject and the index, to serve as a preview tributed. The development history led of what is going to be committed. to v2.6.16 since v2.6.12-rc2 contains changes by more than 1,800 authors that • Recorded history to index. The directory were committed by a few dozen people. structure recorded in a tree object is read by git-read-tree into the index. • The workflow involves many patch ex- changes through the mailing list. Among • git-checkout-index Index to files. 20,000 changes, 16,000 were committed writes the blobs recorded in the index to by somebody other than the original au- the working tree files. thor of the change. By tying these low-level commands together, • Each change tends to touch only a handful Porcelain commands give usability to the files. The source tree is highly modular whole system for the end users. For ex- and a change is often very contained to a ample, git-commit provides a UI to ask small part of the tree. A change touches for the commit log message, create a new only three files on average, and modifies commit object (using git-write-tree and about 160 lines. git-commit-tree), and record the object • The tree reorganization by addition and name of that commit as the updated topmost deletion is not so uncommon, but often commit in the current line of development. happens over time, not as a single rename with some modifications. 4,500 files were added or deleted, but less than 600 were 2 Design Goals renamed. • The workflow involves frequent merges From the beginning, git was designed specifi- between subsystem trees and the mainline. cally to support the workflow of the Linux ker- About 1,500 changes are merges (7%). nel project, and its design was heavily influ- enced by the common types of operations in the • A merge tends to be straightforward. The kernel project. The statistics quoted below are median number of paths involved in the for the 10-month period between the beginning 1,500 merges was 185, and among them, of May 2005 and the end of February 2006. only 10 required manual inspection of content-level merges. • The source tree is fairly large. It has approximately 19,000 files spread across Initial design guidelines came from the above 1,100 directories, and it is growing. project characteristics. • The project is very active. Approximately 20,000 changes were made during the 10- • A few dozen people playing the integrator month period. At the end of the examined role have to handle work by 2,000 contrib- period, around 75% of the lines are from utors, and it is paramount to make it effi- the version from the beginning of the pe- cient form them to perform common oper- riod, and the rest are additions and modifi- ations, such as patch acceptance and merg- cations. ing. 388 • GIT—A Stupid Content Tracker • Although the entire project is large, in- tween them. Tree comparison can take ad- dividual changes tend to be localized. vantage of this to avoid descending into Supporting patch application and merging subdirectories that are represented by the with a working tree with local modifica- same tree object while examining changes. tions, as long as such local modifications do not interfere with the change being pro- cessed, makes the integrators’ job more ef- ficient. 3 Evolution • The application of an e-mailed patch must The very initial version of git, released by Linus be very fast. It is not uncommon to feed Torvalds on April 7, 2005, had only a handful more than 1,000 changes at once during a of commands to: sync from the -mm tree to the mainline. • When existing contents are moved around • initialize the repository; in the project tree, renaming of an entire file (with or without modification at the • update the index to prepare for the next same time) is not a majority.