<<

Revision Control Systems Introduction to

Bartosz Kostrzewa October 2014 The palest ink is better than the best memory Chinese proverb Contents

1 Notes 1 1.1 Outline ...... 1 1.2 Revision Control ...... 1 1.3 Revision Control in Science ...... 2 1.4 What is a ? ...... 3 1.5 Revision Control Features ...... 3 1.5.1 Branches ...... 3 1.6 Centralized and Distributed Systems ...... 4 1.6.1 CVS, SVN ... - Centralized Version Control Systems ...... 4 1.7 git - the stupid content tracker ...... 5 1.7.1 storage paradigms - differences vs. snapshots ...... 5 1.7.2 revisions vs. commits ...... 6 1.7.3 the staging area ...... 7 1.7.4 remote repositories ...... 8 1.7.5 speciality: cheap branching ...... 8 1.8 Branching Development Models ...... 9 1.8.1 , Release ...... 9 1.8.2 Rolling Release, Feature Branches ...... 9 1.8.3 Stable Branch, Development Branch, Feature Branches ...... 11 1.9 Remotes and Branching Models ...... 11 1.9.1 Revisiting the “Central” Repo ...... 11 1.9.2 What is a “pull request”? ...... 11 1.10 Source Code Management and Collaboration ...... 12 1.10.1 redmine & github ...... 12

2 Exercises 13 2.1 git and SVN, Similarities and Differences ...... 13 2.2 Basic Exercises ...... 13 2.2.1 git Configuration ...... 13 2.2.2 Setting up a Local Repository ...... 14 2.2.3 Making Changes and the Staging Area ...... 15 2.2.4 Committing Two Changes Separately ...... 15 2.2.5 Unstaging a Change ...... 16 2.2.6 Reverting a ...... 16 2.2.7 File Operations ...... 17 2.2.8 Creating a New Branch ...... 18 2.2.9 Interacting with a Remote Repository - Cloning, Fetching, Pulling and Resolving Conflicts . 19 2.2.10 Interacting with a Remote Repository - Pushing Local Changes to a Remote ...... 23 2.3 Advanced Exercises ...... 24 2.3.1 The .gitignore File ...... 24 2.3.2 Preparing a git Repository for Writing a Paper (or your Thesis) ...... 24 2.3.3 Creating “bare” git Repositories ...... 25 2.3.4 git stash ...... 26 2.3.5 Staging Partial Changes ...... 27 2.3.6 Using git for Debugging ...... 28 2.3.7 Dealing with and Reverting Merges ...... 30 2.3.8 Rewriting History ...... 31 2.3.9 Cherry-pickimg one or more Commits ...... 33 2.3.10 Referencing the Commit Hash in a Program ...... 34 2.3.11 Using tags ...... 35 2.3.12 Converting a Subversion Repository ...... 35

3 Conclusion 35

4 References 36

1 1 Notes

This is a short set of notes for the lecture with the same title. The slides and the notes are supposed to complement each other and you should read the notes while looking at the slides, unless you have a photographic memory.

1.1 Outline We will begin with a short overview of the reasons for using version control systems in software development in general and in scientific software devel- opment in particular. We will learn about the difference between centralized and decentralized (or distributed) version control systems. Then we will talk about the types of work-flows possible with distributed version control systems and will look in particular at the work-flows enabled by git. The particular work-flow shown here includes pizza and who could disagree with that? Because a lot of the content of these notes is abstract, the exercises at the end are supposed to familiarize you with git and the various techniques discussed in the lecture.

1.2 Revision Control Revision control should be an essential part of the software development pro- cess. Your software will go through many small incremental steps as you add features, fix bugs or release different versions. Although manageable by hand on a small scale, this process will quickly lead to confusion as to which change was made when and by whom. In the best case, this can result in very painful bug tracking, in the worst case it can lead to work - such as new features - being forgotten and lost. A revision control system allows you to keep track of the state of a software project at any given time. It does so by saving the current set of files that belong to your program side-by-side with meta-information which identifies when given changes were made and by whom. Some systems will even track relationships between changes made to a software project so that the origin of a given chunk of code can be traced back to the point of its introduction into the code-base. The additional meta-information will give you the necessary information to follow the development process retrospectively and make your bug and release management easier. It will also allow you to properly assign credit for devel- opments of your code-base which is important if you’re trying to understand a bug in somebody else’s function, for example. Finally, the most basic but most important reason for the necessity of revi- sion control is simply the imperfection of human memory. Chances are that you will work on multiple things at the same time and you will often forget what you were doing as you switch from one project to another. Much like good code

2 commenting, keeping a good revision history with meaningful change messages helps you and your team keep track of your progress and makes it easier two switch between projects as necessary.

1.3 Revision Control in Science As scientists I believe we have an additional obligation to use revision con- trol systems. Just like experimental scientists are expected to keep diligent research log books documenting their research process, academics using com- puters should be able to keep track of the development process of their pro- grams. Because of the nature of software, doing so by hand would be very cumbersome. The information kept by a revision control system can be directly useful for keeping track of the methodology for the purpose of publications based on some computational work. Similarly, results in publications should be linked to the exact version of your software they were created with, another thing that a revision control system can help you with. This is essential in ensuring that our research remains reproducible and hence testable. Finally, it allows for proper accountability of the work done, which can be very important judging by the recent “scandal” involving the climate research unit at the University of East Anglia. The history kept by the revision control system can rightly be considered as your ”Laboratory Notebook”.

1.4 What is a Revision Control System? Revision Control Systems (RCS) are also referred to as Version Control Systems (VCS) or Source Code Management (Systems) (SCM[S]) or Software Configu- ration Management (Systems). The basic idea is to provide some sort of system which in addition to just keeping the files related to some software project also records the development history. With this, it should allow you to move around in this history, either on a per-file basis like in CVS or on a project basis as in SVN or git. As mentioned before, supplementary useful information about the originator of a particular change could also be kept in addition to creation and modification times. Finally, one of the most important features of a revi- sion control system is the preservation of change messages or a “change log” in other words. Writing meaningful, succinct and complete change messages is extremely helpful to you and other developers on the team. The change message “bug-fix” is useless, while:

module M: changed function F to fix bug #4587, clear memory for temporary string before it is reused to prevent output from being garbled by stale information

3 tells you exactly what was done and why. It also links this particular change to a bug, which is useful if you’re trying to figure out when a particular bug was fixed. More advanced systems also understand the renaming of directory struc- tures and files. They can also help you with undoing changes that turn out to be incorrect in retrospect. If you follow good practice in revision control, this “undo” functionality can be as simple as one command. As we will see, versatile systems are adept at managing branches as well as their splitting and merging, thereby supporting you in the deployment of new features or bug fixes. This also aids in keeping a structured programming work-flow or at least a struc- tured program history, as we will see later. Finally these systems offer various features that help with release management and collaborative development.

1.5 Revision Control Features 1.5.1 Branches The concept of branching will be central to a large part of this document and it is therefore important to describe it here. In basic revision control there will only ever be one copy of the source code that the developers work on. For the purpose of publishing releases, copies might be made at a given point in time and labelled somehow, say “version 1.0”. Branches are a way of keeping track of multiple copies of a software project and you might think of them as virtual directories. They can come in useful when multiple versions of your software are in use at the same time and you need to fix a bug in a few of them. Alternatively you might keep a “stable” and an “unstable” version of your code, where only the latter has new features added. More advanced branching systems are possible. We will see that branches can be used to test an idea or as an organizational tool to write a fix for some bug. These advanced features require that the revision control system can branches easily and we will see that this is the case in git.

1.6 Centralized and Distributed Version Control Systems 1.6.1 CVS, SVN ... - Centralized Version Control Systems Centralized version control systems such as CVS or SVN operate by storing the source code and development history of your program on a central server on the internet or local intranet. The developers of a given software project connect to these systems through a special client which allows them to download the source code for the program and submit their changes back into the central database. In the slide we can see how the development team edits a shopping list with each member adding an item to the central document. The team members must

4 keep in mind to always keep their local version of the remote data up-to-date because conflicts (not at this scale) can become difficult to resolve. This works very well in general but has some major drawbacks:

• most commands rely on client-server communication, making it difficult to work offline or when the server is down • the systems are often not well-suited for incremental development because a large number of revisions (commits) slows them down substantially

• because of the way that these systems store the database of changes, branching and in particular merging branches can be very expensive (or even impossible) operations • this storage paradigm also means that if the meta-data is damaged, none or only some of the history can be recovered. in some cases it even results in complete data loss • resolving conflicts can be very difficult

It is important to realize that centralized revision control systems are very useful tools and they are certainly much better than not using revision control at all. It is possible to argue, however, that they are not necessarily ideally suited for the type of development work that we do.

Git, bzr ... - Distributed Version Control Systems Distributed version control systems do away with many of the limitations. First of all, every operation is carried out locally on your hard drive, this makes them orders of magnitude faster than server-bound systems. This also means that you can use them purely locally to keep track of the changes to your own projects without setting up a repository on a server. In fact, it generally takes one or two commands to create a new repository and track the files in a given folder. This is very helpful when writing a paper or developing small tools. It also means that all copies of the repository are essentially equivalent, therefore safeguarding your project against data loss because the complete project history can be recovered from any of the ”clones”. Decentralized version control systems are very flexible in the way that the project data can be shared between collaborators. Rather than being limited to interacting over a central servers, team members can exchange changes between each other and push them to a central repository individually. Because of the way these version control systems store their data, any conflicts that arise between the different repositories can, in general, be resolved quickly and easily. This flexibility makes it possible to employ many different and highly effi- cient work-flows in the software development process. In particular, it becomes

5 quite easy to support teams that rarely meet in person and may work at vastly different times of the day. In the slide we can see that the team is editing a shopping list. Each member adds an item to their version of the shopping list. Some members decide to pull in the changes from other members and all of them push their changes to the central repository. In the end, all of them can synchronize their local copies by pulling in the final version. Each and every version and change are identified by a unique ID which makes it easy to not lose track at which ”revision” of the code one is working. Of course not everyone would agree that this type of system is beneficial. Without proper care it can become confusing to see which version is “current” in the sense of a linear history. As a result, individual developers will have to make more of an effort to manage their own code-bases and team leaders will have to keep an overview. Having said that, there is nothing that prevents you from using a traditional work-flow with a central repository while benefiting from all the good features that these systems offer, as you can see in slide 11.

1.7 git - the stupid content tracker 1.7.1 storage paradigms - differences vs. snapshots As mentioned before, it is the database model of distributed version control systems which makes them so powerful. In particular we will take a look at the database model implemented in git. Rather than saving incremental changes to every file in the repository, the system takes ”photographs” or snapshots of the current state of the code-base, referencing objects in their development history if they don’t change (as indicated by dashed outlines in the diagram). This allows git to track the history of every chunk of a file even many revisions back into the past. Many version control systems will be unable to tell you where a given chunk of code came from because of the loss of context (line numbers change, whole sections of code get re-factored etc.) - git is different and vastly more powerful in this respect. This functionality can become a major factor in bug fixing. A given chunk of code, perhaps containing a bug, can be present in many different releases of your code. If this bug is a show-stopper, you might want to fix it across several releases that are still in active usage. With git, this is quite easy. You simply identify the point in time (the ”commit”) at which the bug was introduced and then look at all releases that can trace their history back to this commit. Another benefit of the database model is the concept of immutability. A new commit will never destroy any history in git, it will only add to it, no matter what sort of (basic) operation is carried out on the repository 1. No

1There are, however, ways to completely remove a commit from the history if something was introduced by mistake, for example. One has to be very careful when using these features, especially if the code is developed collaboratively and the history has already been shared

6 more data loss! Finally, this reference/snapshot based storage model makes branching and merging very cheap and simple. Also, the number of commits in a repository has a marginal effect on performance even when doing operations on large chunks of history. This fact encourages the submission of small, easily tracked and traceable commits, the content of which can be easily described in one or two lines in the commit message. By contrast, the documentation of centralized version control systems even suggests that code should only be checked in if it constitutes a change that can “stand on its own”. It is easy to see why this is the case: a checked-in revision will be used by every other contributor of the software project. Still, the “trunk” of many centralized repositories often accumulates revisions where untested or unfinished code was checked in, giving other developers headaches. As we will see, cheap branching and merging allows you to develop incrementally under full revision control while allowing you the necessary time to only check features into a central repository when they’re ready.

1.7.2 revisions vs. commits After having talked about the data storage models used by different revision control systems it is important to understand the difference between revisions as in CVS or SVN and the commits you will encounter in git. A revision can be thought of a little like a version number in a linear his- tory of some code-base (or file in the case of CVS). Revisions are convenient because it is relatively easy to say which version of the code one is referring to. However, moving between them can be quite an expensive operation because of the underlying storage model. When you go back a number of revisions, the RCS has to replay the history up to that point and if you have a large number of revisions, this can take a while. In a distributed revision control system the idea of a linear history is not necessarily representative of the real situation. You might have a given version of the source-code with some additions of your own while the “true” group repository holds some other version. You might have branched off the group repository at some point and added commits of your own, then synchronized some (or all) commits that appeared in the main repository in the meantime. A commit therefore identifies a set of changes to one or more files relative to some starting situation. git keeps a checksum for every object in a repository, including commits. Although these are long and hard to remember, git attempts to help you out by matching the fist few characters to the identifier, as long as the match is unique. The checksum of a commit depends on the content (i.e. which lines were changed, added or removed) as well as the starting version of the file(s) with others.

7 that has(have) been touched. The most recent commit in a repository is also called the “HEAD” commit and is often used in the same way as a revision number in other revision control systems. As we will see, there are simple ways of extracting a (set of) given com- mit(s) from one repository and applying it to another, even if the histories of the repositories do not match exactly. Finally, special significance is given to commits relating to the merging of branches. One would generally say: “I merged branch XYZ into the master branch” and git will generate a special commit for this operation. This type of commit keeps track of which histories were merged and more importantly, which commits were applied to the target branch as a result. This has two major benefits: firstly, commits which exist in both histories do not suddenly appear twice. Secondly, one could undo all the commits related to a merge by reverting the merge commit, a functional- ity which is not offered by SVN, for example. Finally, in order to “pull” new commits from a remote repository you don’t necessarily need to synchronize all files or commit all the changes that you have made to your local copy. As long as the new commits that you plan to “pull in” do not overlap with any changes that you haven’t yet committed, you can proceed without losing your work.

1.7.3 the staging area An important aspect of git is the introduction of the ”staging area”, a sort of temporary space which can store the changes you want to submit in your next commit. This means that the development work-flow is: edit, add the file you edited to the staging area and then commit. We will see that one can even stage partial changes to a file if necessary. While this seems like more work initially, it turns out that it allows a much more flexible work-flow. For instance, you might want to split up a long work session into a number of commits for clarity and logical consistency. The staging area also provides a natural protection against committing code that is not yet ready by mistake. Changes that have been added to the staging area can be “unstaged”, completed and re-added without risking losing them in the process.

1.7.4 remote repositories Another unique git feature is the fact that almost everything can be turned into a remote repository for the purpose if pulling changes or sharing them with others. If someone sends you their clone of some git repository by e-mail, you can define the directory that you saved it to as a “remote” and interact with it in much the same way as you would with code on a group server, for example. This also makes it trivially easy to share code within working groups: simply use the network file system that you certainly have available. Alternatively you could share via some cloud storage provider like Dropbox.

8 Finally, if you have a machine with SSH access, you have a secure git server with access restrictions. To make a repository public (read-only), it is sufficient to copy it to some publicly accessible directory on some web server and follow a number of instruc- tions to enable fetching via http. If you want, there are additional tools that will expose the revision history via a web interface, like gitweb. In addition, there are at least three options for running git somewhat like an SVN server: git daemon, gitosis and gitolite.

1.7.5 speciality: cheap branching One of the ”killer features” of git is cheap branching. Creating a new branch in git is one line and the creation takes a fraction of a second. This new branch is exactly equivalent to the branch it was split off from. Importantly, merging two branches (or merging a branch back, as shown in the image on slide 16) is a first class operation in git and therefore extremely fast and painless. Branching allows developers to implement and test new features in a com- pletely independent code-base, thereby not hindering the work of others with new revisions popping up every couple of hours. Also, the development process remains exactly the same. You make small changes to the code-base and com- mit these in logical units, making it easy to follow your process. With other version control systems a feature or bug-fix of several hundred lines could be just one commit, making it very difficult to check exactly what was changed and why. One might argue that having many commits for one bug-fix, for example, might be confusing. However, git can produce a diff between the commit that the new branch is split off from and the end of the development work. In addition, we will see in the exercises that if you want to condense the work into just one commit at a later stage, you can do that too. Because merging the changes back into the ”master” branch (or any other branch, for that matter) is very cheap and simple in git, the branch concept should be used for most development work. Merging branches also merges their histories, thereby keeping the development process transparent. In contrast to other revision control systems it is not necessary to keep your code-base 100% in sync with the “main” repository. Any changes that occur while you are working on your new feature can be integrated at any stage of your progress. A drawback of this is of course that if you wait too long you might run into the situation that many conflicting changes have accumulated, making the merge more complicated. Branches should be used to implement and test new features, develop fixes for bugs and they can even be used for release management together with tagging. ( e.g.: when creating a new tag for a release you also create a branch for this release. Any future bug-fixes can then be easily integrated into this version of the code-base. )

9 1.8 Branching Development Models 1.8.1 Trunk, Release This kind of branching model is often used with centralized version control systems and is popular for the development of commercial software. All the development takes place in the “trunk” branch and all contributors have write access. As stable versions of the code are released, new branches are created as copies of “trunk” at that given point in time. The code in “trunk” may alternate between a “stable” phase during which only bug-fixes are allowed and an “unstable” phase during which new features are added. This system seems very natural and has many benefits. First of all, it is easy to see what constitutes the “most recent” version of the software. If you work in a team which is able to meet often and manage the development process or if you have very modular code which avoids conflicts by virtue of having different people working on different parts of the code, this system should work well. In situations where none or only one of those conditions are fulfilled, conflicts can frequently occur, resulting in constant “update, resolve” cycles which may eat a lot of time. Finally, when the developers working on the code have vastly different experience levels, it becomes difficult to enforce good coding style because code review is not an integral part of the revision control process.

1.8.2 Rolling Release, Feature Branches The fact that git has very good support for the merging of branches has led to the idea of using so-called “feature branches” to systematize the development of new features and bug fixes. In this particular development model, popular with many open-source projects, development is tracked in a single master branch. Additions are made by creating feature branches, stabilizing the given feature and then merging back into the master branch. Each empty circle on the diagram represents one commit and not all branches are shown in the diagram. When the code-base is deemed stable enough for a release, a tag is created to identify a particular point in the revision history, as marked here by the filled blue circles. One goal of this kind of development model is to keep the code in the master branch as runnable as possible at all times. In particular, features are not merged until they have been reviewed and tested. Notice in particular that many developments can take place at the same time. “feature A” is developed at the same time as a bug is fixed, just in time for release 0.1. “feature B”, on the other hand, is not quite ready at this point and postponed. Early in the development of “feature B” the author realizes that the functionality can be generalized and begins work on “feature ”, which turns out to be quite complicated to implement. As the release of version 0.2 nears, work on “feature B” is completed and these changes are then merged both into the master branch as well as the “feature C” branch, because B and C touch some of the same files.

10 This kind of development model is very well suited for distributed teams and in some sense integrates code review into the revision control process. When you have completed a given feature and are ready to merge it into the main code-base, you can ask a colleague to go through your code and run some tests, to make sure that everything works as expected. In some situations only the team leader would have write access to the master branch on the group work server, in which case a final code review would go through them before they pull in the code. A major benefit of a branching development model is that it allows experi- mentation without disturbing work in other branches. If an experiment doesn’t work out, the branch is simply deleted. If an idea turns out to be great, it is polished up for merging and integrated into the code-base. At every point in time the developer can choose to synchronize with developments in other branches or postpone any conflict resolution to one final step. This model has two major drawbacks: the ease of creating new branches can lead to a situ- ation where some developers do too many things at the same time without ever really finishing any of their ideas. Similarly, when one is working on a monolithic code-base, say one file with many thousands of lines and hundreds of crucial global variables, conflicts might arise too frequently, slowing develop- ment. In both cases one could argue, however, that these are just symptoms of an underlying problem: the former can be fixed by better (self-)management of developers while in the latter case one must say: “monolithic code is evil”.

1.8.3 Stable Branch, Development Branch, Feature Branches The final branching model that we’re going to discuss here is one which com- bines the benefits of the previous two models. It is based on the establishment of a “master” and a “develop” branch with the idea that whatever is in the master branch constitutes tested, runnable code while the development branch might sometimes fail to work as intended. All development work is carried out on the development branch and new releases are regularly merged into master, as they stabilize. Sometimes one might have the situation that a bug is fixed and then “back-ported to the most recent release in the master branch. In this model it is clearly easier to see what is current, but the fact that the development branch is not required to be in a runnable state may encourage some sloppiness. On the other hand, major architectural changes sometimes cannot be made in a feature branch without major headaches (because they may touch many or most files of the project), in which case working directly on the development branch makes a lot of sense.

11 1.9 Remotes and Branching Models 1.9.1 Revisiting the “Central” Repo It should be clear by now that in a distributed revision control system there is no one “true” repository. Yet, a team can very easily designate that the repository on the group work server is special, perhaps only the team heads have write access there and certainly any releases of the code are based on that repository. Still, from a revision control point of view, all clones of the repository are equivalent. They can all have master branches and one can establish feature branches locally as well as on the group repository. But how does this combine with the ideas about branching development models discussed in the previous sections? The answer lies in the fact that the feature branches shown in the previous sections do not have to live in the same repository. Since everyone can pull from everyone else, specific work-sharing and code reviews are framed as so-called “pull requests”.

1.9.2 What is a “pull request”? This slide shows a typical situation in collaborative software development using a distributed revision control system. Bob contacts Alice, who is the team leader for this project, to inform her that he has a new feature ready to be introduced into the main code-base. In the development process he cloned Alice’s repository and created a new branch for the development of feature A. He then made a few logically connected self-consistent commits. Alice can add his repository as a remote and take a look at his changes using a number of interfaces provided by git. In this process she notices that some of the code isn’t quite up to the coding standards for her project and she asks Bob to document routine F a little bit better. In the meantime other people add other features to the code-base, represented by the commits in Alice’s master branch. After adding the comments that Alice requested, Bob notices that there is now one conflicting change in one of the files that he’s worked on. In order to make the pull-request painless for Alice, he merges in the changes that have occurred since he branched off and resolves the conflict. As we will see later, git offers another technique for this particular step called “”. Rather than merging two branches, it is possible for git to extract the work that Bob has done to implement his feature and “replay” the commits on Alice’s current master branch. This makes it appear as though Bob branched off the very latest commit on Alice’s master branch and as a result, the revision log will have one fewer merge commit. Rebasing needs to be handled with care because it literally rewrites the revision history and should never be used if some changes have already been shared and some new developments have been based on these changes. Whether you prefer the one or the other for the purpose of synchronizing a feature branch with the target of a pull-request should be discussed in your team.

12 This kind of work-flow requires a little bit more work from the more experi- enced developers on a project, but it pays off in producing code that potentially has fewer bugs and is more consistent in terms of coding style. Having said that, it is of course completely optional and development using git can also proceed in a traditional way where everyone has write access to the group repository. Even if that is the case, code review can still take place via the mechanism discussed above.

1.10 Source Code Management and Collaboration 1.10.1 redmine & github The last topic that will be covered before moving on to the exercises is about interfaces for collaborative software development. These are systems which allow links to be established between bug reports, feature requests, discussions, documentation and the source code repository in which all these things are developed. In practice, there are three types of these systems. The first class is generally part of some integrated development environment (IDE) and will not be discussed here. The second type is generally a web interface that can be run on the web server of some development team and requires some effort to set up. An example of such an interface is redmine, a mature and widely used open- source project which provides bug tracking, project and team management, a Wiki for documentation and planning, a discussion system as well as interfaces for various revision control systems. It is accessed through a browser and allows team members (or the public, depending on the setup) to browse the source code and revision history, discuss issues as well as perform code review. There are many plugins which implement a myriad of functionalities and the system is certainly worthwhile if you have someone who can keep it up and running. Another slightly more comfortable system is launchpad, made by Canonical, developers of Ubuntu . The third class of systems consists of service providers which run systems comparable to redmine but with more functionality on their web servers. These services, both commercial ones like github or free ones like savannah have the benefit that a third-party runs the service, reducing the administrative burden. One has to consider the possibility though that a service might go bankrupt and it can be argued that it’s a bad idea to base ones development process on a provider which could disappear due to market forces. Also, some people might disagree with hosting their project with a third party. In terms of functionality these services can be truly excellent. Github, in particular, offers very thorough support for collaborative software development using git and pull requests. The whole procedure, cloning repositories, creating branches and managing pull requests (creating, discussing, updating as well as accepting or rejecting) can be done from the browser. If you’re so inclined you can even use the online

13 text editor to make changes to the code.

2 Exercises

These exercises will hopefully allow you to frame the abstract information from the lecture with concrete examples. They consist of two sets, the first of which introduces git and helps you understand the basic commands that you will meet when you use it to work on some project. The second set consists of a number of specific use-cases and more specialised commands which should help you to deepen your understanding.

2.1 git and SVN, Similarities and Differences If you have ever used SVN, it is strongly recommended that you read through at least the first couple of paragraphs of the git SVN Crash Course at http: //git-scm.com/course/svn.html. There are many git commands with the same or similar names as in SVN but they are used for very different purposes. It may be a good idea to keep the page open in the background.

2.2 Basic Exercises 2.2.1 git Configuration We presume here that you have already installed git and are ready to go. One of the purposes of revision control is proper attribution of work, for this git uses a user name and e-mail. In addition, when you make a new commit, git will let you edit the commit message in your favourite editor, which is why you should set it up here. It is very useful to remember that a shell with good auto-completion will provide you with all the available options when you write git config --global and press the [TAB] key. In the following, if you don’t specify a value for a given key, git will print the current value. Go ahead and adjust the user name, user e-mail and editor. git config --global user.name "John Doe" git config --global user.email "[email protected]" git config --global core.editor "vi" There are two types of configuration databases considered by git. One is the global one that we just edited, the other is local to a git repository. For instance, you might want to have different identities when contributing to projects in your free time and at work. The local information is set via git config --local using the same key names as above, it overrides the global setting for that repository.

14 2.2.2 Setting up a Local Repository Note: this section is mandatory for all the following exercises, do it. As explained in the lecture, one of the most important features of dis- tributed version control systems is the fact that they are local. Hence, putting any directory on your machine under version control involves only a few com- mands. In the code below, calling git commit will open the editor that you have configured above. Any line NOT starting with # will be taken as the commit message, which in practice should be meaningful, succinct and com- prehensive. To accept the commit message and confirm finalize the commit, use the “save & quit” functionality of your editor (e.g. :wq for vi). mkdir newrepo cd newrepo git init # creates the .git directory git status # see the current status of the local repository touch README # create the first file in the repository git add README # put README into the "staging area" git commit # the initial commit finalizes the initialization git log # see the commit history NOTES: • A git repository is only fully initialized after the first commit. • For some operations, git will provide a standard commit message, in all other cases you have to write something. A commit will be aborted if the commit message is empty! • When we add files to a repository we say that they are now “under re- vision control”. Files that are under revision control can also be referred to as “tracked”, in contrast to files which aren’t which we also call “un- tracked”. • git cannot track empty directories, adding an empty directory will not work. Once there’s a file in the directory and you add this file, the directory will be known to git. • In the following exercises you should make frequent use of the git status command to see what is going on under the hood.

2.2.3 Making Changes and the Staging Area Git has a feature called the ”staging area” which allows you to plan what to include in the next commit. For example, in order to add a new file to the repository you just created, you would type: vi newfile.c git add newfile.c git commit

15 and the syntax for modifying an existing file is equivalent:

vi newfile.c # edit the existing file git add newfile.c # add the changes to the staging area git commit # commit the changes A short note on git commit: specifying the -m option allows you to provide the commit message in quotes:

git commit -m "commit message" which will finalize the commit with that message right away.

2.2.4 Committing Two Changes Separately Git gives you a lot of flexibility when it comes to your work-flow. In particular, it has facilities in place to avoid commits that are too large to be useful. If you are implementing or debugging a feature, it makes sense to split the work into logical units so that you and others can later follow the development history. This also makes it easier to revert certain changes while preserving others, if it turns out that this is necessary. The following commands will create two commits even though you edited everything in one go.

vi file1.c # create file1.c with some content vi file2.c # create file2.c with some content git add file1.c # add changes to file1.c to staging area git commit # commit changes to first file git add file2.c # add changes to file2.c to staging area git commit # commit changes to second file

NOTES: there are short-cuts.

• To commit everything that has changed simply do ”git commit -a”, this will do ”git add” on all changed files and then immediately commit (it will not add new files, however).

• To add all (tracked) files with changes to the staging area, type ”git add -u”.

2.2.5 Unstaging a Change It will happen that you stage changes and then realize that they’re not quite ready. Or perhaps you decide that the commit you are preparing should really be two. In any case, you can of course unstage your changes without losing them. In the following, let’s edit file1.c and file2.c again, add both, unstage one of them and commit the first. Then we edit the file we unstaged and commit it with the finalized changes.

16 vi file1.c vi file2.c git add file1.c file2.c git reset HEAD file2.c git commit git add file2.c git commit

2.2.6 Reverting a Commit Sometimes it might happen that you’ve commited something which turns out to have been a bad idea. Luckily, git makes it quite easy to revert such a change. In the following, let’s revert the last commit, the change in file2.c. Please note that you will have to look up the commit hash in your own log, don’t copy the one written here, it won’t work (also, I just typed some random letters and numbers).

git log # look up the commit hash which you would like to revert git revert 78asd987faojsfoiuer97f87faqe Note that this procedure preserves the actual information in the commit. If, at a later stage, you realize that you need these changes after all, you can revert the revert. We can go ahead and do this right away:

git log # look up the commit hash of the revert above git revert lkjsafiu879834asdfjasf897234

NOTES: • Reverting commits involving the merge of two branches may be somewhat more complicated to handle and we will cover this situation in one of the advanced exercises.

• Remember that git will match partial hashes as long as the match is unique. So instead of copying and pasting “78asd987faojsfoiuer97f87faqe” from the log, you could probably get by with just writing “78asd” and the commands above would work just fine.

2.2.7 File Operations Unlike other revision control systems, git will understand what should happen after standard filesystem operations like mv or rm on files that it tracks, as long as you remember to call git add on the moved files. However, the correct way of dealing with files in a repository is to use git for these operations.

vi file3.c # create file3.c and put in under version control git add file3.c git commit -m "added file3.c"

17 git mv file3.c file4.c git commit git rm file4.c git commit At this point, in order to recover the file, you could revert the last commit. But what if you accidentally call git rm and don’t want to create a commit for a mistake? Since the command is somewhat special because it combines a change to the staging area with a file operation, undoing it is unfortunately also a little more work. We will now meet the checkout command which marks an important difference between git and SVN. Let’s begin by reverting the commit that deleted file4.c (remember you need to look up your own commit hash): git log # look up commit hash git revert ad2cfad02a85210a533d5421f69d16cdecd8d8ac and now we delete it again and realize our mistake. git rm file4.c # uhoh... git status # the deletion is now staged for a commit git reset -- file4.c # remove the operation from the staging area git status # hmm, that’s odd, "deleted file4.c" is now an unstaged change ls # indeed, the file is not back... what now? git checkout -- file4.c # grab the file from the repostitory So what happened in those six lines? The third line should be clear but the sixth is odd, we are recreating a file that was visibly gone even after we “reset” the staging area. The answer lies in the fact that the current directory with all the files is only a “working copy” or a representation of the repository which really lives in a database in the .git directory. When we call git checkout with the path separator -- and a filename which we know exists in the reposi- tory, we are copying it from the database into the working directory. Let’s do a quick demonstration of this. vi file4.c # make some changes to file4.c git status # should show file4.c as modified git checkout -- file4.c # check out the file from the repo git status # our changes are GONE!

NOTES: • Commit messages for file operations can be tedious, but git provides a nice template in the commented out section that you see in the editor. You can simply uncomment the relevant part and you’re good to go! • git checkout can also be used for grabbing specific versions of one file from remote repositories, we will revisit this at some later stage. • In many git commands you can refer to commits, branches, remote branches and paths at the same time. In order to not confuse git, you use the path separator -- as shown above.

18 2.2.8 Creating a New Branch Branching is a very important aspect of git. As explained in the lecture, branches can be thought of as “virtual directories”. In fact, git stores all the information for a repository in the .git hidden directory while the files in the working directory of the repository are only a “representation” of the code. If you switch branches, all the (tracked) files in the working directory are replaced. Because it is really cheap to create and merge branches, it can be used to introduce new features without touching the “master” copy of the code; that’s also the name given to the default branch when a new git repository is created. In addition to working with files as we saw before, the git checkout command is also used for switching between branches. (it has some other uses which we won’t cover here)

git branch # see the currently available branches and # which branch you’re working on (starred)

git branch -a # see all branches, including remote ones

git branch newfeature # create a new branch called "newfeature" git branch # see that you have just created a new branch git checkout newfeature # switch to the newfeature branch

vi newfeature.h git add newfeature.h git commit

git checkout master # switch back to the master branch ls # see that newfeature.h is not present git merge newfeature # merge the changes from the # newfeature branch into master

git branch -d newfeature # delete the now redundant branch

NOTES:

• ”git branch -d” will warn you if you haven’t merged the changes into the branch that you branched off from, if you want to drop an unmerged branch, use the ”-D” switch.

• “git checkout -b newfeature” is a short-cut for creating a branch and switching to it immediately. • If you have uncommited changes in your working directory, you will only be able to switch between branches if those changes do not affect the

19 same files in the two branches. This can be a problem at times but can be worked around either by creating another clone of the repository and switching branches there or, as we will see in the advanced section, using git stash.

2.2.9 Interacting with a Remote Repository - Cloning, Fetching, Pulling and Resolving Conflicts Although git is useful even when used locally for tracking your own projects, revision control systems really shine when working in groups or teams. Unlike other systems, git does not require a server to be useful. Also unlike other systems, if you have SSHD running on some machine, you can set up your own “git server” with SSH access easily as we will see in the advanced section. You can also share git repositories on a simple web server for read-only access. When collaborating, the first step is usually to clone the remote repository residing somewhere. For the purpose of playing around with this we are going to start by cloning the repository we created above:

cd .. git clone newrepo newrepo.copy cd newrepo.copy git remote - # show which remotes have been set up # ’origin’ is the default after cloning git status git log # show the log of the current newrepo.copy branch git log origin/master # show the log of the master branch # of "newrepo" git branch -a # show all branches, including remote ones The “git remote -v” command above lists the “remotes” which have been set up for this repository. To interact with a remote repository, mainly three commands are used for “fetching”, “pulling” and “pushing”. It is important to understand that git keeps a local index of a remote repository and that its history – up to the point in time it has been fetched at – is also available offline. To demonstrate interaction with a remote, let’s go back to the original version of our repository and change a file.

cd .. cd newrepo vi file1.c git commit -a -m "incredibly important changes to file1.c" cd .. cd newrepo.copy We will now proceed in three steps: first we are going to check out the commit history of the local cache of the “origin” remote. As you will see, our new commit will be absent because we haven’t yet fetched. We will then fetch the changes, which will update the local cache of “origin”. Finally, we are going to pull in the changes, the step which is going to merge the remote changes

20 with our clone of the repository. We could also do a pull right away, but then we would not be able to check out any changes before we integrate them into our clone.

git log origin/master # look at the log of origin/master git fetch origin # update local cache of "origin" git log origin/master # look at the log again git origin/master # display the differences between our clone # and the origin/master branch git pull origin master # actually merge the remote changes git log As you can see, the change from “newrepo” has now been integrated into our clone. When we call git pull, git internally calls the merging algorithms that are also used in git merge above.

A subtle point: There is an issue which might be important to some people so let’s deal with it right here. You can skip it at a first pass if you like, in most cases it really isn’t very important. Looking at the log of newrepo.copy in the last command above, you can see that there’s no mention of the fact that we pulled from “newrepo”. This is because git supports something called a “fast- forward” merge. When there are no conflicts between the two repositories, git can simply integrate all the changes from one repository into another and this makes it look as if there has never been a merge. Some people object to this because they would like to preserve history for every operation. Let’s go back to “newrepo” and introduce another change to file1.c and commit it (you know how to do this by now). Going back to “newrepo.copy”, we will now pull a little differently:

git pull --no-ff origin master git log As you can see, git now opened our editor to add a commit message. Look- ing at the log, it is now clear that we merged changes from the “origin” reposi- tory at a given point in time. Note also that there are TWO commits now, one refers to the change to file1.c while the other refers to the merge. If you plan to use use git mostly like SVN, on a central server where everyone has write access, it makes sense to allow fast-forward merges

Continue here if you skipped: The next topic that we are going to cover in this section is that of conflicts and the resolution thereof. In a collaborative environment, other people will of course work on code and make changes to files. As long as these changes don’t touch the same lines in a given file as your own changes do, git will be able to deal with these automatically. If the same lines in two repositories change, however, manual intervention becomes necessary. “newrepo” and “newrepo.copy” should now be in a synchronized state, so in order to create a conflict, edit the same line of file1.c in both repositories and

21 commit the changes (of course, you need to make different changes, otherwise there won’t be a conflict!). In “newrepo.copy”, now do:

git fetch origin git pull origin master git will abort the merge, notifying you of the conflict and telling you what to do. If you now edit file1.c, you will see that git has inserted some markers directly into the file. At any conflicting section you will find the lines:

<<<<<<< HEAD code from newrepo.copy ======code from newrepo >>>>>>> c4dc807e54aa6f64d9e3d782468568d027b5278a These indicate the two conflicting bits of code, the first, indicated by “HEAD”, is from the repository being pushed to (i.e. “newrepo.copy”) while the second, ended by a commit hash, is from the repository being pulled from (i.e. “newrepo”). You can now choose which code you would prefer or rewrite the section altogether, if that’s the appropriate response to the conflict. Finally, you can remove the markers, including the ======separator. All that’s left to do now is to commit the conflicting file:

git add file1.c git commit As you can see, git has already prepared a commit message which you could expand on. It might be useful to mention which version you ended up choosing or whether you even chose to rewrite the section. The final issue that we are going to work on is related to checking out other people’s work for the purpose of code review, for instance. For this purpose, change out of the repositories and create yet another clone of “newrepo”, let’s call it “newrepo.bob”.

git clone newrepo newrepo.bob cd newrepo.bob git branch amazing_feature Now switch to “newrepo.copy” and do the folowing:

git remote add bob ../newrepo.bob # add a new remote git fetch bob git remote -v # note that you now have a local cache of another # remote repository git branch -a git checkout -b bobs_amazing_feature bob/amazing_feature git branch -a

22 As you can see, you now have create a branch named “bobs amazing feature” which is based on the “amazing feature” branch in the “bob” remote repository. You are now able to check Bob’s work.

NOTES:

• In addition to cloning local directories, git of course supports true remote repositories.

# for read/write ssh access to a repository git clone [email protected]:/path/to/git/repository [localname]

# for public read-only access git clone git://server.com/repositories/project git clone http[s]://server.com/repositories/project by default, git will clone into a directory with the same name as the repository (minus .git suffix, if there is one); if you want a different directory specify this via [localname] git clone ssh://server.com/repositories/weirdprojectname.git myprojectdir

• If you know that there will be conflict between your changes and the remote repository that you are preparing to pull from, you can also use git merge with the --ours or --theirs option to resolve the conflict automatically by preferring your own or the remote changes.

2.2.10 Interacting with a Remote Repository - Pushing Local Changes to a Remote Collaborative work in distributed revision control is organized in terms of pull- requests for the simple reason that the repository owner should always retain full control over their repository. In fact, we cannot push to a branch in a remote repository if it is currently checked out (i.e. if it appears with a star when the repository owner issues the git branch command). To illustrate, let’s try the following (in “newrepo.copy”):

vi file1.c git commit -a -m "important changes to file1.c from newrepo.copy" git push origin master As you can see it produces an error message. This make sense as it would be a problem if the working directory of your repository could change while you’re working on it because someone else pushes changes to it. Note that if a team member has write access to your repository they CAN change other branches as well as create new ones. However, one should generally not push to someone else’s repository because it can mess up the branch structure. When a repository is set up as a group repository, it is made “bare”, i.e. without a

23 working copy and hence no “active branch”. We will see an example of how to do this in the advanced exercises. For now, make a clone of the repository in the “exercise 2.2.10” directory in the archive that you downloaded with these notes. This one is set up such that you can also push without errors. Change into the directory of your clone and try:

vi file.c git add file.c git commit git push origin master If someone else pushed changes which would be in conflict with yours, git will notify you of this and request that you do a ”git pull” to pull to synchro- nize and resolve the conflicts. After that you would be able to push without issues. You can also create new branches on the remote by pushing a branch from your clone:

git checkout -b newbranch # create new branch locally and switch to it git push origin newbranch:newbranch # push branch to remote git branch -a git push origin :newbranch # delete newbranch on remote git branch -a Make a note of the syntax involving branch names and the : character. Also note that to delete a remote branch, you simply leave out the local branch in the notation involving :.

NOTES: • When you don’t specify a local branch to pull from and a remote branch to push to, git follows a default behaviour involving values in the con- figuration in the .git directory. To be sure what you’re doing, it makes sense to always specify branch names. Remember that auto-completion via [TAB] is your friend. • The local and remote branch names do not have to be the same. In fact, in a shared repository it might make sense to add some personal identifier to branches that you create. This concludes the basic section of the exercises. You should now be able to interact with most git repositories and collaborate in software projects.

2.3 Advanced Exercises In the following few exercises we’re going to take a look at how to solve specific problems using git that will hopefully be useful to you in practice.

24 2.3.1 The .gitignore File In some situations it might be important to tell which files git should under NOT put under revision control. This would be the case with temporary files or binary files for instance. For this purpose you can create a .gitignore file which lists, one per line, filenames or patterns (such as *.dat) that git should never add to the repository. The format also supports negation by specifying the ! character. You can have multiple .gitignore files in the different directories of your repository. If that’s the case, any conflicting rules that you set in a subdirectory override the rules from the parent directory.

2.3.2 Preparing a git Repository for Writing a Paper (or your The- sis) To make the information from the previous section more concrete, let’s set up a git repostitory for writing a paper in LATEX. It makes a lot of sense to write papers under revision control: not only can you compare different versions, you also don’t have to deal with “sending it around” only to be confused by who has added what when, why and which version is currently the most recent. As you know, upon compilation with LATEX, a few temporary files are created in addition to the output .dvi, .ps or .pdf. Neither the temporary files or the binary output files shoud be put under revision control. On the other hand, keeping figures under revision control makes perfect sense and git has good support for binaries through compression algorithms. We begin by initializing a directory for the paper: mkdir paper cd paper git init vi .gitignore In the .gitignore file we now put something like: *.log *.aux *.out *.backup *.fls *.blg *.bbl *.fdb_latexmk *.pdf *.dvi *.ps *~ However, this is not enough: what if we want to include plots in .pdf format? For this reason we will store all graphics in an extra directory and adjust the ignore rules for this directory accordingly:

25 mdkir graphics cd graphics vi .gitignore with these contents:

!*.pdf !*.ps In this subdirectory, the rules for .ps and .pdf files are overridden and we tell git to allow them to be put under revision control here. Now we just need to finalize with the initial commit:

cd .. git add .gitignore graphics/.gitignore git commit -m "initial commit: set up .gitignore for LaTeX"

2.3.3 Creating “bare” git Repositories As described in the basic exercises, pushing to the active brach of a git reposi- tory is not allowed for good reasons. But if you’re sharing a given repository, say on a group server, it makes no sense for there to even be an active branch or a working copy of the source code. A repository consisting only of the .git directory is called “bare” and this is the kind of thing that you would put on a shared network filesystem or an SSH server.

As an empty repository: If you know that you will share a repository from the beginning, you can create it as bare from the start, either directly on the server or you can copy it over from your local machine. This also applies if you want to create a repository from existing but untracked files. You start by runing (in this case you can do it locally):

git init --bare sharedrepo.git where the .git suffix is a convention to mark that this is a bare repository. Let’s say you created this on a server and would now like to populate it with files. To do so, you first need to clone a working copy of the repository. Note that git will warn you that you are trying to clone an empty repository:

git clone sharedrepo my_sharedrepo # or if the repository is on an SSH server git clone [email protected]:/home/user/repositories/sharedrepo my_sharedrepo Note how git will find this repository even though you did not specify the suffix. Once you have a working copy, change into it and start adding files. Commit them as usual and finish by pushing.

26 cd my_sharedrepo touch file1.c file2.c # or copy over some files from an untracked source tree git add *.c git commit -m "initial commit: initialized files" git push origin As you can see, git has now created a new master branch in the shared repository. After the initial commit, this can now be used as remote by others.

From an existing “non-bare” repository: You will probably often find yourself in the situation that you’ve developed something and would like other people to contribute to it in a more direct manner than through pull-requests. In this case you can create a bare repository from your existing one. Let’s go back to “newrepo” from the basic section. From the parent directory we can clone it with a special flag: git clone --bare newrepo newrepo_bare.git If you take a look at this new clone, you will see that it doesn’t contain a working tree and you can upload this directly to your group server for the purpose of collaborative development.

2.3.4 git stash In one of the notes in the basic exercises it was mentioned that switching be- tween branches if you have uncommited changes is not possible if the changes in the current branch and the branch that you are switching to would conflict. Sometimes it might be necessary to switch branches but maybe you are unwill- ing to commit your changes in their current state. For this and other reasons git offers yet another way to deal with your changes, the so-called stash. This is an unintelligent storage area which will save all the changes that you’ve made to the curent branch since the last commit. Let’s make a change to “newrepo” to see how it works: cd newrepo vim file1.c git status git stash save "work in progress in file1.c" git status # ah, the changes are gone, we can switch to another branch now # but where are they? git stash list # here they are git stash pop git status # and now they’re back and the stash is empty git stash list # we can now finish them off and commit vi file1.c git commit -a -m "some more changes to file1.c"

27 NOTES:

• The stash can be very useful at times but it is somewhat confusing, so it should be used with caution.

• The pop command that we used above applied the topmost stash element to the working copy and then deletes the stash item, use with care. • If you stash another , this will be added to the list of stahes. Remember that pop always applies the topmost one.

• To apply a stash item without deleting it, use git stash apply for the topmost one and git stash apply stash@{N} for the N’th one. • To delete a stash item, use git stash drop, which will delete the top- most item without applying it. To delete a particular stash item, specify it as above.

2.3.5 Staging Partial Changes In some situations it might be desirable to commit several changes to the same file separately. git will allow you to do this with some effort. Since the interface for this functionality is quite opaque, you might want to skip it at a first reading. Beginning again in “newrepo”, let’s make some changes to two lines of our favourite file:

vi file1.c git add -p Git will show you a partial diff between the current version of the file and the one in the last commit (in this case it will probably be a full diff because I presume file1.c is quite small). Git refers to the changes to be staged as a “hunk”. By typing ? we can have git display some help. We would like to stage only the first line that we changes, so unless that is already what git has selected for us, we type e for “edit”. Git will open our editor, showing something like this:

context -line1 original +line1 new context context -line2 original +line2 new where the lines with a minus sign would not appear if you added new lines rather than changing them. Now since we want to stage only the changes to line1, we comment out the changes for line2 by inserting # characters at the beginning of those two lines:

28 context -line1 original +line1 new context context #-line2 original #+line2 new We save and quit and git will have staged a partial change.

git status git commit -m "partial change to file1.c"

We can now simply add and commit the remaining change or use git add -p again if we have more than two hunks.

2.3.6 Using git for Debugging Imagine you use your software to run on machines that cost hundreds of millions of Euros and where every hour of computer time is worth several thousand Euros of taxpayer money. Imagine now that after a year of doing so, you discover a potentially disastrous bug in your code which seriously puts your results in question, although it was subtle enough to have gone undiscovered so far. Consider further that you notice that some old results are seemingly unaffected by this bug; this means it must have been introduced somewhere between the production of those results and the present moment. Since you have identified the bug and have an idea of a date when the bug did not exist in your code, you can now go back in history to pinpoint exactly when it was introduced in order to determine which of your results should be recomputed. advanced usage of git log: We have met the path separator -- before. When applied to git log, it will list all commits that have touched the specified file(s), which can be very useful when trying to determine when a certain line was changed.

git log -- file1.c git log -p -- file1,c # log with diffs, ’-p’ for ’patch’ In addition to showing the commit history for a single file, please take some time to study the documentation for git log by looking at the manual page: man git log. Try out the various formatting features, some of which include helpful commit trees which will make it easier to understand how a complicated code was put together.

git log --graph --oneline --all In the case of “newrepo”, try to see the representation of when we created “newbranch” and merged it into the master branch.

29 graphical user interfaces: There is a very useful graphical user interface for git called gitk. If it is not installed, you can find it in the repositories of your linux distribution and if you are using Mac OS or Windows, wherever you found git to begin with. When you call it in a repository, it will give you a graphical representation of the commit history including a nice tree graph. Although the interface is a bit archaic, once you understand it, you can search for commits touching specific paths or even adding or removing specific words or lines to your code. If you have it installed or can install it, try it out! git bisect: Let us now assume that you don’t know exactly where the bug is, but you have a way of detecting whether your code is affected by the bug, say with some simple (or not so simple) test-case. As in the previous situation, we may know that the code worked at some given point in time, so we can use that time as a starting point for our search. Git now offers a very useful tool called git bisect, which will help us with finding the point at which the bug was introduced. It will do so by automatically going back N/2 commits in time, where N is the number of commits since the starting point that we give to git (or the first commit if we don’t give starting point). There, we will be able to run our test-case and see if the code is affected. If it is, this means that the bug was introduced earlier, if it isn’t, it must have entered later. We tell git whether we need to go into the past or into the future by telling it whether the test-case failed or worked and it will now reverse or advance by N/4 commits. As you can imagine, a few iterations of this strategy (with ever decreasing moves into the past or future) will allow us to pinpoint where the bug has been introduced. To do this exercise, look into the repository in the “exercise 2.3.6” directory and proceed as follows. First we clone the repository so that we can recover more easily if we get one of the steps wrong. Then we take a look at the log; as you can see, the author of this particular piece of software ignored the crucial advice of writing good commit messages and destroyed any hope of figuring out what the bug is from the log alone...

git clone exercise_2.3.8 my_ex_2.3.8 cd my_ex_2.3.8 git log ./test_case.sh git checkout -b find_bug # let’s work in a new branch # to make things less messy git bisect start git bisect bad # tell git that the current version is broken # we know that the code stll worked at the point specifeid by the commit # below, so we tell git where to start looking from there git bisect good 08164edd49d2c88d0808fd73481596403923fa0e ./test_case.sh And now we proceed cyclically. We run the test case and see if it tells us

30 whether the bug is present or not. If it isn’t, we say git bisect good, if the bug is present we say git bisect bad. In the end, git will output the first bad commit:

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx is the first bad commit After this, we can go back to the starting situation then compare the code to understand where it is and how it was introduced.

git bisect reset git diff xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx and subsequently you will be able to fix it. Also, you will know exactly how much money you wasted by running with a broken version of the code and which of your results need to be retracted (or checked, at the very least).

2.3.7 Dealing with and Reverting Merges Given the advanced nature of this problem and the existence of an excellent article on the subject, this is less of an exercise than a reading assignment. It would be good for your understanding to go to http://git-scm.com/blog/ 2010/03/02/undoing-merges.html and read the article. The notation in the diagrams there is reversed from the point of view of these notes: the arrow always points backwards in time at the parent commit and time runs left to right. As you can see, even though working with many different branches has very clear benefits, it can also sometimes complicate your life.

2.3.8 Rewriting History This section mainly discusses the various uses of the rebase command. You should be very careful when using it because it changes the actual revision history and the various commit hashes. As a result, if you use it on a branch which has already been shared with someone else, you will break their clone of the branch in the sense that they won’t be able to pull from you anymore (because git will not see common reference commits). Using this command can make other people very angry if they’ve based something on your work, so make sure to discuss it first. In principle, except for the first two, there is almost never a need to use these commands. However, you might sometimes find yourself in a situation where you realize that your development history looks needlessly complicated and makes it difficult to follow the structure behind your thinking. Since the revision history is part of your output as a software developer and since you can very safely assume that it will at some point be useful either to you, your colleagues or your successor, you should put som thought into structuring it well and invest some effort into making it comprehensible.

31 Amending the last Commit: If you haven’t shared a certain piece of code yet and you realize that you’ve either forgotten something in your last commit or you’re unhappy with your commit message, you can adjust it by making changes to some files, calling git add on them and committing the changes via git commit --amend. Just issuing git commit --amend will allow you to edit the commit message right away. Just as for the following commands, it is important to understand that this command will change the commit hash, the commit message (if you chaned it) and the commit content (if you added any additional changes) and break other people’s clones of this particular branch if they have already pulled the previous version of the commit from you.

Rebasing on top of Another Branch: When you are working on some new feature and it takes a while, it is recommended to regularly follow the developments in the central repository if you’re not working alone. As we saw above, one way of doing so is via git pull, which uses git merge under the hood to merge the branch on the server with your feature branch. This ensures that when the time comes to request your feature to be tested and pulled in, there is no or minimal work to resolve any conflicts. However, these regular merges have the drawback that they will leave lots of merge commits in the revision history, unless they are fast-forward merges. There exists a method for doing away with these merges and making it appear as if you were always basing your work on the latest and greatest com- mit from the central repository, this is git rebase. This command initially compares the commit histories of your branch and the branch that you are trying to rebase onto. It attempts to find a point shared by the two branches, then it reverts all the changes that you made from that point onwards, applies all the changes that have happened in the central repository since and then attempts to apply all your changes to this new base. If there are no conflicts, it will now appear as if you were developping on the central repository without ever having to do a merge. If there are conflicts, you will need to resolve them commit by commit with the aid of the rebase tool. This can sometimes be a bit annoying because your development is applied commit by commit and so certain conflicts may occur multiple times. To test this out, let’s make two clones of “newrepo”:

git clone newrepo newrepo.rebase1 git clone newrepo newrepo.rebase2 Now go to the newrepo directory and make two different changes to the same file and commit them. In the “newrepo.rebase1” directory, make two changes to a different file and commit them. Finaly, in “newrepo.rebase2”, make changes to the same file as in the first step, but make different changes and commit them. Now we are going to see how rebase works.

cd newrepo.rebase1 git log # make a note of the commit hashes git fetch origin

32 git log origin/master git rebase origin/master git log # now the commit hashes have changed and are on top of the # commits from origin/master Since you didn’t have any conflicts, the rebase algorithm was able to change history automatically. Now we will repeat the same exercise with “newrepo.rebsase2”. As you see, the rebase will fail and you will be asked to fix the problem, call git add the file and then call git rebase --continue. You should now un- derstand why rebasing should never be done with a branch that has already been shared with others. Since the commit hashes change (and possibly their content if there are conflicts), the revision history that has been shared and the rebased commit history are not compatible anymore.

Interactive Rebase - Dropping or Merging Commits: The interactive mode of the rebase command is useful if you are unsatisfied with some of the past commit messages, if you realize that some changes that you made in some past commit were completely wrong and you would like to remove them without leaving “revert” commits in the history or if you realize that some of your commits should be merged to form logically connected contributions. Let’s try this out by first cloning “newrepo” again, say into “newrepo.rebase int”. Then we choose a commit from the log up to which we want to change history, let’s say the fifth.

cd newrepo.rebase_int git log > before_rebase.log git rebase -i HEAD~5 Here we also meet a new notation: HEAD is a short name for the latest commit while the ~ character allows you to reference a commit in the past of whatever is written in front of it. This line means: do an interactive rebase 5 commits into the past. Git will open your editor with a list of commits and instructions. You can issue commands for the various commits by changing the first word in each line to one of the characters shown in the instructions. Try this out by editing one commit and squashing another and then take a look at the difference between the commit logs:

git log > after_rebase.log # at this point, resize your terminal window to be at least # 120 characters wide and look at a side-by-side diff diff -W 120 -y before_rebase.log after_rebase.log

2.3.9 Cherry-pickimg one or more Commits Imagine you are working on a software project which follows the principles described in this document. The developers take care to make small, logically conected changes to the source code and construct well-documented commits

33 shen introducing new features or fixing bugs. Imagine further that you are in charge of maintaining an older version of the software, the development of which has been discontinued but which is still in active use. Let’s assume that a bug is found and a fix is developped based on the most current version of the software. The bug is sufficiently isolated to be fixable in one or a few well localized commits (this will often be the case for modular codebases). Because you are in charge of maintenance, you are asked to backport the fix to the old version. Since the fix has been developped based on a much newer source-tree, merg- ing with this branch would pull in a lot of changes that you don’t want to have in the old version. A solution to extract this commit is called “cherry-picking”. To do this exercise, look at the repository in “exercise 2.3.9”. It contains a code which has a master branch representing “current” development and a “v1.0” branch for maintaining the old version of the code. On the master branch, the second to last commit is an important bugfix that you would like to propagate to version 1.0. Start by looking up the commit hash of this commit and copy it into your clipboard (or make a note of it in some other way). Now switch to the “v1.0” branch (remember, use git checkout) and issue the following command:

git cherry-pick yyyyyyyyyyyyyyyyyyyyyyyyyyyyy As you will see, git will extract the commit from the master branch and immediately apply it to your source-tree.

NOTES: • In practice this has the potential of being complicated by conflicts or incompatibilities in the file structure between the two commit histories, but it often works and is wort a try. Take a look at the manual page for the command for more information.

• In a real situation you would start by creating a work in progress branch from the branch “v1.0”, do the cherry-picking in that branch and if you succeed, merge this into “v1.0”

2.3.10 Referencing the Commit Hash in a Program You are hopefully convinced at this point that a revision control system can be very helpful in debugging a program and seeing which of your results have been affected by some problem. There is an issue with this which we have ignored up to now, however. Just because you know that you had a version of your code which was correct at some point in time doesn’t mean that you also know which version this was. At most you might know that it was a particular version number, say “4.2”, but you’re unlikely to know which branch it was compiled from and what was the latest commit in this branch.

34 Luckily, with some effort, you can actually reference the commit hash in your programs. Unfortunately, this is a situation where Subversion does a much better job because it includes native supoport for including revision numbers into release code. In the case of commit hashes, we need to work a bit harder and an example of this is shown in the “exercise 2.3.10” directory. The extra effort consists of having to run git from your Makefile, extracting the commit hash and somehow including it in your code. In the code snippet that you will find in the above-mentioned directory, this is achieved through the GIT-VERSION-GEN script which has been adapted from the source distribution of git. Take a look at “test.c” as well as the Makefile. If you’re curious, look also at the logic in GIT-VERSION-GEN. Now compile the “test” executable (you will need gcc) and try it out:

make ls # git_hash.h has appeared, take a look! ./test Now make a change to the repository by adding a file and commiting it and try again:

make clean make ./test As you can see, we now have a program which embeds the most recent commit hash every time it is compiled. If we now include this also in the meta- information of our results, we will always know (almost) exactly which version of our program a given result has been created with. Note that it is important that the “git hash.h” target is “PHONY” so that it is executed every time you call make as it would be quite difficult to inform make about the fact that the commit hash has changed.

2.3.11 Using tags Commit hashes are somewhat annoying objects to be working with. They are of course crucial for the functioning of distributed version control, but they would be a nuisance if we had to refer to a release version of our software as “ce0f0a673a39d1bddcdf42dc6aaa69addf1647f6”. Luckily the “tag” mecha- nism allows you to construct an alias which refers to a particular commit. In “newrepo”, do:

git log # make a note of the newest commit hash git tag v1.0 git rev-parse v1.0 # look up which commit the tag "v1.0" # points at

NOTES:

35 • There is another important feature to tags: PGP signing. If you use cryptographic e-mail signatures and e-mail encryption (and you should), you can “sign” a commit or tag cryptographically, thereby certifying that you really are the originator of this commit and making it very difficult for the NSA to impersonate you.

2.3.12 Converting a Subversion Repository This is an extended topic which has been documented quite well externally and so there is no need to do it again. You might find these documents very useful:

• http://jmoses.co/2014/03/21/moving-from-svn-to-git.html – note especially the links regarding problems with Subversion, the description of the process is not the best

• http://john.albin.net/git/convert-subversion-to-git – the de- scription here is complete and includes the caveats that might trip you up

3 Conclusion

With this we conclude the course. I hope that you will be able to apply the knowledge learned in practice and that it will prove useful to you. Keep in mind that git is quite a complicated piece of software, but this also makes it very versatile and adaptable. It is well documented and the manpages are worth a read. If you know which tool you are intersted in, the command man git toolname2 will give you the manpage. For more general and more advanced information, check out the references below.

4 References

ProGit: http://git-scm.com/book - this is an extensive book which hits upon almost all use-cases that you might run into. It’s available for free online or in a nice printed version.

Git - SVN Crashcourse: http://git-scm.com/course/svn.html - this is a very nice side-by-side comparison of common commands in svn and git. Highly recommended if you’re used to svn because so many commands have similar or the same names but behave copletely differently.

2some older man versions might require you to write “git-toolname” instead

36