<<

MERGE CONTRIBUTIONS IN REPOSITORIES

A Thesis

Presented to

The Graduate Faculty of The University of Akron

In Partial Fulfillment

of the Requirements for the Degree

Master of Science

Drew T. Guarnera

August, 2015 MERGE COMMIT CONTRIBUTIONS IN GIT REPOSITORIES

Drew T. Guarnera

Thesis

Approved: Accepted:

______Advisor Dean of the College Dr. Michael L. Collard Dr. Chand Midha

______Faculty Reader Interim Dean of the Graduate School Dr. Kathy J. Liszka Dr. Chand Midha

______Faculty Reader Date Dr. Chien-Chung Chan

______Interim Department Chair Dr. David N. Steer

!ii ABSTRACT

Git’s non-linear historical graph is analyzed for the presence of merge commits and the contribution of merge conflicts to the source code. Git’s distributed model encourages parallel development by providing light-weight branching.

This creates a need to integrate new changes from separate branches into main development branches using Git’s merge facilities. It is hypothesized that the role of merge commits is primarily that of historical unifier and as such do not regularly contribute new code changes in the commits they generate. To this end new metrics were defined using differencing data associated with merge commits to formally analyze code contributions of merge commits. An empirical study of 23 open-source repositories was performed and a tool, Grim, was developed to collect the necessary historical and commit metadata from each repository. It was found that merge commits represent less that 20% of the overall history of a project for a majority of the repositories, and merge conflicts appear even less frequently at around 10%. Merges with conflicts make up around 20% of merges for most of the repositories and the source code contributions made by merge commits most often involves a reorganization of existing changes and not the addition of new source code changes.

!iii TABLE OF CONTENTS

Page

LIST OF TABLES ...... vii

LIST OF FIGURES ...... viii

CHAPTER

I. INTRODUCTION ...... 1

II. GIT ...... 7

Structure ...... 7

Branching ...... 8

Merge Operations ...... 10

III. ROLE OF MERGE COMMITS ...... 14

IV. GRIM ...... 17

V. EMPIRICAL STUDY ...... 23

Data Collection ...... 23

Custom Difference Based Metrics ...... 25

New Additions ...... 26

New Removals ...... 26

New Edits ...... 27

Selected Edits ...... 27

Data Analysis ...... 28

!iv VI. THREATS TO VALIDITY ...... 33

VII. RELATED WORK ...... 35

VIII. CONCLUSION AND FUTURE WORK ...... 37

BIBLIOGRAPHY ...... 40

!v LIST OF TABLES

Table Page

1. Details about each system used in the empirical study ...... 24

2. Grim runtime performance benchmarks ...... 25

3. Presence of merge commits and commits with conflict in each repository ...... 30

4. Average number of source changes metrics across all repositories ...... 32

!vi LIST OF FIGURES

Figures Page

1. Branch merging and resulting commit state transitions ...... 15

2. Example output from ‘git show’ command ...... 20

!vii CHAPTER I

INTRODUCTION

Version control is a critical tool in the development of software. It affords developers the ability to work with their code in a sandbox-like environment, thus allowing the flexibility to experiment with modifications before finalizing these changes to the repository, i.e, codebase. In the event that issues should arise, version control supports rolling back changes to revert a codebase to a previous state. While the aforementioned features alone are enough to consider version control valuable, these tools also maintain the entire evolutionary history of each individual commit, i.e, change to a repository. This historical record and its associated metadata has been used by many researchers in the field of Mining Software Repositories (MSR) [1] [2] to wide effect for the purpose of gaining insight into software projects and general development trends.

While a variety of version-control systems exist and vary based on how the repository is managed, they break down into two major categories. Until recently, the most popular version-control systems were centralized version control systems (CVCS), such as CVS and Subversion, and represents the subject of the largest body of work in the field of MSR. These centralized systems rely on a dedicated software repository hosted where all contributors have access. Due to the fact that nearly all actions with a CVCS require server communication access to a main repository, this means that committing !1 changes is an atomic operation. Thus, there is no natural facility in place to work and save changes safely within the version-control system locally on a development machine without committing them to the main repository. This creates a unique problem for users of this type of system as they must either ensure all work is completed before committing their changes to the repository, or commit incomplete work to the repository. Many tend to opt for the former choice, but this can prevent developers from committing small, incremental changes with less disruption to the software system, and leaves them at the mercy of the undo features of their source-code editor in the event that the changes yield negative side effects.

More modern version control systems such as Git and are distributed version control systems (DVCS). The DVCS model has most of the same features of

CVCS and can even have a central repository for the consolidation of all commits to a project. The feature that separates DVCS from CVCS is that each local copy a developer uses is a complete copy of the repository, including all historical and metadata information. While the most obvious benefit of this feature is redundancy for backups, it also introduces a greater level of development flexibility for software engineers. With

DVCS developers can clone (i.e, copy) a repository to their development machine and make commits to their local copy without altering a central repository or any other developer’s repository. This allows developers to work on a feature or fix and build up a series of commits locally and upon completion of their task push (i.e., transfer) all of these commits to another repository. This ability to commit locally, request any historical

!2 commit logs, revert changes, and many other features without the need to communicate with a remote central repository also makes many DVCS operations significantly faster than their CVCS counterparts. The benefits of the DVCS approach has not gone unnoticed by the development community [3], and as such many projects have moved to a DVCS tool, especially in the open-source community.

Git in particular has become the DVCS tool of choice for many open source projects, from small development groups all the way up to big companies like Facebook,

Microsoft, Google, and Netflix [4]. Git is also the core of one of the largest software repository hosting services, Github, who reported 3.5 million users and housed 6 million git based repositories as of April 10, 2013 [5]. While the popularity of Git continues to rise, it has not yet received the same level of attention by the research community for mining and exploration as previous version-control systems such as Subversion or CVS.

Git’s distributed model encourages non-linear commit activity by providing light- weight parallel development paths with branching. Work is done on branches for a variety of reasons: bug fixes, refactorings, new features, or other development activities.

Branches create a divergent historical path, and eventually the commits that represent these development efforts will need to be integrated into other branches, such as a stable

“master” or “release” branch. Merges in Git are the way developers can combine their work from disparate branches and unifying the parallel histories.

Merges may occur once for a branch to integrated bug fixes into the mainline of development, at which time the branch is deleted, or left unused. Merges may also occur

!3 multiple times in the history of a development branch into a long running master or release branch. Many merges in Git are able to be performed automatically by the internal

Git merge tool. The ability to merge allows developers to organize changes to the code base, and is a primary reason that parallel development is so well supported by Git.

However, merge conflicts caused by overlapping parallel development can occur. This results in a merge conflict, which must be manually solved by the developer.

While these branch and merge features are helpful for developer productivity, they also introduce complexity to the historical interpretation of data when mining Git repositories. This work intends to explore the structure of Git commit history focusing on a better understanding of merge commits, including their presence, and their contributions to the historical commit data within Git repositories. In this study, twenty- three repositories from Github were mined for data using a custom application, Grim, and each repository was examined for the presence of merge commits. The number of total merges is collected as well as the number of merges that cause merge conflicts (i.e., a state in which user intervention is required to complete a merge). Each merge commit found to introduce a conflict was then further analyzed. As part of this analysis, four custom metrics: new additions, new removals, new edits, and selected edits are presented and formalized based on differencing data were developed to quantify the merge commit contributions. The metrics, in combination with other Git commit metadata and the historical structure of the repositories, are used to answer the following research questions this work proposes:

!4 • RQ1: How much of the history of a project are merge commits?

• RQ2: Do merge conflicts occur frequently?

• RQ3: Do merge conflicts introduce new edits, and specifically what kind?

• RQ4: Are new edits or reorganizations of existing edits more common in merge

conflicts?

Exploring for answers to these questions will help to improve the accuracy of change metric data collection and enhance the understanding of what is most relevant information in terms of project development when it comes to merge commits. For researchers, this information will help to better understand how to mine Git repositories and to comprehend the significance of merge commits to a project’s history. The practitioner of software development can use the information presented to better understand the impact of Git processes on the historical presentation of their project’s evolution.

This thesis is presented as follows: Chapter II discusses Git from a brief history to the internal structure of the repositories. In addition, branching and merging with Git is shown, detailing the commands used and the historical impact for merging and branching operations. Chapter III discusses merge commits with respect to their analysis and their role in a repository’s history. Chapter IV discusses Grim, the tool developed to perform

Git repository analysis and the extension developed for the purpose of this study. Chapter

V presents the empirical study in detail, including repository selection, analysis of the data collected from the repositories, and the findings of a subsequent manual analysis of

!5 selected merge commits from a source code level. Chapter VI discusses possible threats to the validity of this study. Chapter VII discusses related works, and Chapter VIII concludes the thesis with a summation of the work presented and the direction of future work.

!6 CHAPTER II

GIT

Git was developed in 2005 by Linus Torvalds, the founder of the Linux kernel, as a replacement for the version-control system BitKeeper. While Git took some of its inspiration from BitKeeper, it was designed from the ground up to be flexible enough to support the non-linear development needs of the Linux kernel [3]. The result of this focus is a version-control system that supports almost any development workflow while tracking multiple, simultaneous development paths on a given project. Here, I define workflow as the way in which developers utilize the branching and merging operations in

Git with respect to the development process. The underlying structure of Git and its branching and merging operations will be discussed in the rest of this section.

Structure

Internally, Git is a database of objects that stores repository information as a directed acyclic graph (DAG) [6]. Each object is identified by a unique SHA1 hash and plays a unique role in the object hierarchy. The lowest level object is called a blob. A blob is simply the compressed content of a single file managed by the repository. Blobs are in turn all linked together by tree objects. A tree object is used to represent directories and contains reference to blob objects, as well as links to other tree objects, to recreate the !7 directory structure of the repository. Commit objects are created to tie all of the aforementioned objects together to represent the current state of a repository at the time the commit was created.

When a change occurs to a repository and a commit is issued, git uses the cryptographic hash IDs of the objects to determine where changes have occurred. Unlike with other version-control systems such as SVN, Git does not store a or deltas of the changes from commit to commit. Instead, the new state is recorded, i.e., the new contents of the file. New blob objects are created for any changed or new files to represent the entirety of the files, and new tree objects are generated to point to the new blobs, including references to any existing, unchanged trees or blobs as well. This new object may cause changes to ripple throughout the graph, as any tree objects that now need to point to the newly created trees will also have new tree objects generated to include the updated references. This method of structuring the object database by way of references makes the structure incredibly flexible and supports Git’s most important feature, branching.

Branching

Git’s biggest strength is its ability to create and track numerous branches, where parallel development tasks can be performed apart from mainline development. Since branches in Git are so lightweight, developers use them in many ways to fit their workflows [7]. For focused development tasks, developers will often create branches for

!8 specific bug fixes or features. This allows them to work on the features and commit progress whenever they like without having to worry about polluting a known stable branch. These branches are usually referred to as topic branches. In other instances, branches can be used as long-running staging areas for changes. For example, a project may have a “development” branch where only cutting-edge features are implemented, while a “stable” branch houses all the changes for the project that are fully tested and functional. The branches used in this way are historical branches. Branches can even be used as mediation points, where code can be committed and placed under review before a member of the development team accepts it for integration into a historical branch. There are many other ways to use Git branches in a development workflow and different terminology used to describe their specific roles in the process, but the flexibility of Git’s branching make all of this possible.

By default all Git repositories are initialized with a default branch called “master” using the command ‘git init’. Internally, a branch is just a single file that contains the ID of the branch, which is the commit object it is currently referencing, and the name of the file is whatever colloquial name a developer has used to designate it. To keep track of what current branch a developer is working on, Git uses a variable called HEAD, which simply points to the branch file. It is important to note that Git commit objects have no concept of a branch and do not record this information in their metadata. To that end, the concept of a named branch in Git is purely for developer convenience when referring to various development paths or concurrent activities.

!9 Git stores repository data in a normally hidden directory, “.git/“. Each time a new branch is created, a new file is stored in .git/refs/heads. Anytime a new commit is added to a branch, the branch pointer file is updated with the new commit ID. Since branches are just references to commit IDs, a developer can make new branch from the latest commit in the current working branch using git branch [branch name] or any commit ID in the project’s history using git branch [branch name] [commit SHA] When a user wants to switch between a branch, the command git checkout [branch name] can be issued and

Git will restore the repository to the state of the commit ID within the branch file. In addition to changing the local repository state, Git also updates HEAD to point to the branch that is currently checked out. When a branch is no longer needed, due to its changes having been transferred to another branch, or in the event of a failed experiment or dead feature, a branch can be deleted using the command git branch -d [branch name]

As mentioned before, Git does not store branch names in any critical structural commit metadata, so this operation simply deletes the branch file from .git/refs/heads/ and all change sets are still valid.

Merge Operations

Merges can happen for a variety of reasons. Since each developer works with a standalone copy of a repository, it is not uncommon for developers to make several commits to their local repository before issuing a push with the command ‘git push’ which uploads their new commit states to a central repository where all changes are

!10 collected. If another developer were to try and push changes to a central repository that contains commit objects her repository does not have, the push would be declined and the developer would be required to issue a pull command using ‘git pull’ which downloads commit states from the repository. When the new commits are pulled, an implicit merge operation takes place and a temporary local branch is created so Git can merge the new commits into the developer’s repository.

Other times, a merge is done explicitly in conjunction with the many branch- based development workflows that Git supports. One common example is merging a topic branch into a historical branch. In this situation, a developer has been working on resolving an issue in a topic branch. Upon completion of the development task, the developer would like to merge these changes into a historical branch to contribute the changes to mainline development. To accomplish this task, a developer would first checkout the historical branch using git checkout [historical branch name] and then use git merge [topical branch name] to merge the the two together. Behind the scenes, Git takes the two commit states for each branch and traverses the DAG of objects to discover the best common ancestor for those two commits. Once a common ancestor is found, a three way merge is performed to combine the changes into a merge commit where the first parent is the destination for the merge and the second parent is the source of the merge.

A merge commit is not always necessary to combine the state of the two parents.

In the event that the common ancestor state is the same as the destination state, the

!11 second parent could be combined with a fast forward merge. In this instance, the pointer for the first parent is simply updated to reference the commit state of the source parent.

On the opposite end of the spectrum, merging a commit might result in a merge conflict.

A conflict occurs when Git cannot derive on its own which changes a commit introduced to keep in a given file in the repository. This usually occurs as a result when commits are made by multiple developers that change the same line or region in a file. Since both changes are equally likely to be the correct choice from the perspective of the merging algorithm, this condition requires user intervention to make the final decision on how to resolve the conflict.

Sometimes, such as in the case of a release or major update, it is necessary to merge multiple branches together all at once. The command to perform this merge is similar to the one above with the exception that multiple branch names are appended to the command. When a merge involves more than two parents, this is called an octopus merge [8] named after the algorithm used to resolve the merge. While this type of merge does not happen frequently, it is utilized as an alternative to generating multiple merge commits from piecewise merge operations over multiple branches. It is interesting to note that this particular merging strategy is unable to handle conflicts of more than two files, so all branches involved in this merge must first be conflict free with the destination branch or the merge operation will fail.

When working with multiple developers and branches of parallel development, merges are almost a certainty. However, some development groups prefer a different way

!12 of combining their branches called rebasing. Rebasing is a destructive operation that alters the history of a repository. Issuing a rebase causes Git to take the existing commits from a merge source, delete them, and recreate new commits to be replayed on top of a merge destination. When the rebase is finished, the merge source will contain all of the commits from the merge destination and a subsequent merge will be a fast forward merge with no merge commit.

The rebase operation provides a more clean and concise view of the history compared to issuing merge commits throughout the history. However, it does have limitations in that a rebase should not be used on any commit history that has already been made available to other members of a development team. Doing so will alter the cryptographic hash values of repository and cause issues with the consistency of the commit objects for all the local repositories that represent the project before the rebase took place (i.e., the repositories of all other members of the development staff who previous cloned the repository). Rebasing is intended to rewrite local repository history before it is combined with a publicly available repository. When a blending of public branches is required, using merging ensures the consistency in the state of the repository between all developers. While there are two camps on this issue, one preferring rebasing for a clean and organized history, and the other preferring the most accurate representation of the projects development, when used properly, both merging and rebase effectively unite the history of two or more branches. The next section will discuss the merge commits of Git in more detail.

!13 CHAPTER III

ROLES OF MERGE COMMITS

While all Git commits in the DAG have the same responsibility of pointing to the correct objects in the database to represent the state of a repository at any given time, not all commits have the same role with regards to their historical significance. Regular commits, or commits with a single parent, are used for the purpose of new changes to a software repository. Since a regular commit has one parent, and the only way to create a new Git commit is to make a change to the repository, this commit must be in some way an extension of the commit that directly precedes it. Changes such as refactorings, bug fixes, or feature additions all are added to the history in this way. This is not always the case when looking at merge commits.

The default operation of a merge in Git takes two separate existing states and blends them into a new state. While this seems straightforward, this historical impact of this is very different from that of a regular commit. The key difference between a regular commit and a merge commit is the concept of existing work. A regular commit has no additional reference point in the history other than the parent commit that came before it, thus to alter its state must be the act of new changes. A merge commit on the other hand was blended from two (or more) pre-existing states and as such does not require any new

!14 changes to be classified as a new state by Git. This fundamental difference has a subtle yet important impact.

Assume that a bug fix is done on a branch called “Issue123” and needs to be merged into a branch used for automatic nightly testing called “Nightly”. As seen in

Figure 1, the current state or commit of Issue123 is C5 and is represented by the set {C5,

C4, C1}, while the current state of Nightly is C3 represented by the set {C3, C2, C1}.

When the command ‘git merge Issue123’ is issued from the Nightly branch, a new commit C6 created in the Nightly branch and is a union of the state of both Nightly and

Figure 1: Branch merging and resulting commit state transitions !15 Issue123. The change set of C6 can be represented as {C6}U{C3,C2,C1}U{C5,C4,C1}.

It is important to note that in the set of states, C6 is only present if the merge commit itself has changes. If one were to attempt to examine the source changes that took place at each commit within these branches using a standard approach to compare a commit with its parent as the branches are traversed, Issue123 and Nightly would be represented accurately until the merge commit. In this case, a diff of C6 and C3 would show the changes {C6,C5,C4,C1} and a diff of C6 and C5 would show the changes

{C6,C3,C2,C1}. This poses a problem for analysis because the change data collected would have been represented twice by analyzing the merge commit as the change sets

{C3,C2,C1} and {C5,C4,C1} were already found while traversing each branch independently before the merge commit. Analysis for change metrics that do not take this into account would end up recounting data. In the next section a tool for accurately extracting data on these changes will be presented.

!16 CHAPTER IV

GRIM

To answer the research questions posed about merge commits in Git, an extension was developed for Grim (Git Repository Information Miner). The base Grim tool was developed in collaboration with Ms. Heather Michaud from multiple class projects. Grim is a Python command line application designed to interface with the Git command line client in order to traverse and reconstruct a Git repository’s DAG to retrieve information of interest. To achieve this, Grim first runs

git log —branches —date=iso —pretty=format:”%cN%aN

%cd%ad%H%P%s%b

The log command by itself is a simple mechanism to see the history of the branch that is currently checked out. However, by supplementing the option —branches, we are able to obtain all of the commits from all branches. It is important to note that for the option — branches to work properly, the Git repositories for analysis must cloned or downloaded locally using the command git clone —bare [repository uri]. This bare form of the repository copies all branch references locally so the Git client knows what branches are available for traversal when the log command is used. If a repository were checked out without using the option —bare, only the default branch reference would be present and commits solely in other branches would not be shown. The default output format for the !17 log command is intended for human consumption and omits some of the available information from a commit object; this format is overridden using the option — pretty=format: argument to output data in a format that is more straightforward to parse.

The specific parsing string utilized requests the committer name (%cN), author name

(%aN), commit date (%cd), author date (%ad), commit hash (%H), and parent hash(es)

(%P) are all delimited by the custom string “”.

Once this data is all collected, it is stored in three map data structures. The first map (children) stores a commit’s SHAs as a key and a list of that commit’s parent SHAs as values. This representation is more of a top-down approach and indicates that any commit SHA used as a key that returns a list with more than one parent SHA is automatically a merge commit. The second map (parents) is complementary to the first map in that it takes a bottom up approach and stores a parent SHA as a key and a list of child commits that it references. This map indicates that any parent SHA used as a key that returns a list of two or more children must be a branch point in the history. It is important to note that this concept of branch points isn’t natively stored in Git and must be derived by Grim. These two maps are not only useful in traversing the graph, but they also allow for a quick identification of the structural role any given commit plays in constant time. The third and final map has keys for every commit SHA in the repository’s history and retrieves an object that represents all of the metadata values from a commit.

This is useful when traversing the graph or performing any other analysis which might require examination of the metadata.

!18 To perform merge commit analysis with Grim the following command is run to process a local git repository as input python grim.py merges [ local path to repository] - o [directory for data]. While Grim has other functionality, the ‘merges’ argument switches the context of Grim to support the merge commit analysis, and -o is the location for all the output results. When performing merge analysis, Grim iterates over the children map and stores any SHA IDs of any child that has two or more parents to find all merge commits and also tracks the specific number parents. The commits are all run through the git command git show —unified=0 —find-renames. The git show command when used on a merge commit has two unique outputs.

If a merge was resolved automatically without intervention, the show command will display no differencing information. On the other hand, if a merge commit involved a conflict, the command outputs the files in conflict and a difference of the changes, or a diff, computed between those files with respect to the resulting resolution. Setting the option —unified to 0 is used to remove all context lines from the differencing result that are not involved in the conflict resolution. Removing unmodified context lines from the diff ensures each hunk, or contiguous block of changes, does not have any unnecessary information. The option —find-renames is used to augment the diff to check if a file was a simple rename instead of a series of deletions and additions. Git does not track the activity of renaming a file in the repository, as the cryptographic hash ids associated with blobs takes care of determining file similarity. During a diff however, the option —find- renames applies a default 50% similarity index used during a diff to see any changes

!19 discovered in the diff were the result of a rename and then omits them from the diff output. This it not perfect, but it cuts down on difference results showing all the lines from a file as removed and then added under a different filename.

What is important to note about the diff information being shown here is that it is not quite the same as a traditional diff. The first 18 lines in Figure 2 represent information about the commit. Line 1 is the SHA 1 ID of the commit and line 2 shows that the merge has only two parent hash IDs (separated by a space), which is important later on in the diff. Lines 5 through 18 represent the commit message and, while it can contain useful

Figure 2: Example output from ‘git show’ command

!20 historical information, it is completely editable by the developer. As such, it is not always the most reliable source of information. Each line that begins with “diff —cc [filename]” indicates the start of the diff results for a new file. In this instance, the diff for this commit contains differencing for two files cache.h (Line 19) and setup.c (Line 31). Each time a file has differencing data, there will be at least one hunk indication. This is represented by lines starting with ‘@@@‘. In the above example, the difference data for cache.h has two hunks, one at line 23 and the second at line 26. While looking at the source lines that make up the difference data, ‘+’, ‘-‘, and empty space notation is used is used to represent change, but their position also has significance. The position of these characters is based on the number of parents involved in the merge with each character representing a state of the file in that parent. Looking closer at the differencing data of setup.c starting on Line 36, it can be seen that the first character of the line is a ‘-‘. This indicates that the version of setup.c found in the first parent, shown in line 2 to be

5e97f46, contained that line but does not appear in the version of the file resulting from the merge conflict resolution. Conversely, on line 39 the ‘+’ in the first position indicates that this line did not exist in version of setup.c from parent 5e97f46 but it exists in the new file after the conflict resolution. An empty space in any column indicates that the line is not in conflict with the parent of that file.

With these rules in mind, the extension to Grim is capable of parsing the result of the show command and prepares three output files. One is a comma separated value

(CSV) file storing information about merges with conflicts. The second file is a text file

!21 that contains all the raw output of the git show command for manual verification and additional analysis. The third file is a plain text file which contains some basic repository statistics, including the total number of commits and total number of merges. In terms of performance, Grim scales very well and is able to perform all of the aforementioned operations on Linux, the largest repository in terms of total number of commits and merge commits, in about six minutes, with smaller repositories taking mere seconds.

Table 2 in Chapter V shows Grim’s runtimes for all the selected repositories in this study.

!22 CHAPTER V

EMPIRICAL STUDY

To address the following research questions, an empirical study was performed on a variety of public Github repositories. The following sections detail the method of data collection and analyses performed.

Data Collection

For this empirical study, 23 repositories, shown in Table 1, were cloned from

Github using ‘git clone —bare’ and last updated with the latest commit data as of June 17,

2015. The repositories were selected to represent a sufficient set of projects from a variety of different application domains, language of implementation, size based on lines of code, and historical size based on number of commits. The full commit history from all branches of the 23 repositories was traversed by the Grim tool for the presence of merge commits (any commit with two or more parents) using the command

python grim.py merges [repository path] -o [data output directory]

Grim’s performance benchmarks on the data set can be seen in Table 2. Any commit found by Grim to have two parents and line differencing data is collected and considered to be a merge conflict. Each merge conflict has the following information recorded by

Grim: merge conflict commit ID (SHA1 hash), name of the conflicting file, number of !23 parents associated with the merge commit, and four custom metrics based on their appearance rate in the line differencing data. These metrics are all recorded per merge conflict, per file within each merge conflict, and per hunk within each file in a merge commit. The summaries of the results of the data collection for each system are shown in

Table 3 and Table 4.

Custom Difference Based Metrics

For the purposes of this study, four custom metrics were created to represent source changes found in merge commit differencing data. The four metrics cover a range of merge commit code contribution situations including new additions, new removals, new edits, and selected edits. Each of the following metrics is defined and the criteria for determining their appearance is explained below in the following sections.

New Additions

The new additions metric is defined as any line in the differencing data where the first n characters (where n is the number of parents in the merge) are all ‘+’ characters.

The ‘+’ character in the merge diff indicates that the line in question did not appear in a parent commit but appeared in the resulting merge commit. The position of the ‘+’ character denotes which parent of the merge did not have the change, thus if a line is preceded by the same number of ‘+’ characters as the number of parents, it is known that the change did not exist in any of the parents previous to the merge commit and as such is

!24 a added content with respect to the new merge commit state. An example of this can be seen in line 40 and 41 of Figure 2 in Chapter IV.

New Removals

Table 1: Details about each system used in the empirical study

Total Commits (all Domain Language Total KLOC First Commit Date branches)

Atom Text Editor CoffeeScript 90.08 08/18/2011 24359

Bullet Physics Library C++ 1,402.87 05/25/2006 3301

Database Cassandra Java 523.35 03/02/2009 17149 Management

CMake Build System C 937.68 08/29/2000 28895

Cocos2d-x Game Engine C++ 1,791.11 07/06/2010 33811

Django Web Framework Python 719.12 07/13/2005 30752

Application Container Docker Go 267.40 01/18/2013 16532 Engine

Version Control Git C 743.93 04/07/2005 41746 System

Libgdx Game Engine Java 1,752.57 03/06/2010 11250

Libgit2 Git Library (3rd Party) C 233.63 11/31/2008 8827

Linux OS Kernel C 19,512.34 04/16/2005 520224

Mongo Database C++ 3,172.14 11/19/2007 33076

Compiler, Runtime, Mono C# 8,925.57 06/08/2001 111607 and Library

Monodevelop IDE C# 2,038.42 09/23/2005 38130

MSBuild Build System C# 469.16 03/11/2015 116

Node.js Web Platform Javascript 4,005.49 02/16/2009 11638

Unit Testing Nunit C# 261.08 07/16/2009 2114 Framework

Oclint Static Analysis (linter) C++ 24.14 11/11/2012 763

Programming Rust Rust 704.24 06/16/2010 43764 Language

Spring Web Framework Java 1,141.05 07/10/2008 12241

Textmate Text Editor C++ 122.92 08/09/2012 3419

Tmux Terminal Multiplexer C 60.60 07/09/2007 5327

XBMC Multimedia Player C++ 2,639.67 11/23/2009 36400

!25 Table 2: Grim runtime performance benchmarks

Total Commits (all Runtime (sec) branches)

Linux 520,224 374.43

Mono 111,607 12.82

Rust 43,764 18.09

Git 41,746 23.14

Monodevelop 38,130 7.69

XBMC 36,400 11.36

Cocos2d-x 33,811 16.34

Mongo 33,076 2.98

Django 30,752 1.71

CMake 28,895 8.91

Atom 24,359 3.12

Cassandra 17,149 65.24

Docker 16,532 5.87

Spring 12,241 0.75

Node.js 11,638 2.00

Libgdx 11,250 2.98

Libgit2 8,827 1.37

Tmux 5,327 0.40

Textmate 3,419 0.13

Bullet 3,301 0.48

Nunit 2,114 0.53

Oclint 763 0.09

MSBuild 116 0.05

The new removals metric is defined in a similar way to new additions with the exception that character being observed is the ‘-‘ character. The ‘-‘ character indicates that the line in question did appear in a parent commit but no longer appears in the resulting merge commit. Once again the position of the ‘-’ character denotes which parent of the merge previously contained the line in question, thus if a line is preceded by the

!26 same number of ‘-’ characters as the number of parents it is known that the change has been removed from all parents of the merge commit and no longer exists in the resulting merge commit.

New Edits

The new edits metric is defined as the overall code contribution of the merge commit regardless of activity performed as long as the changes represented in the merge commit are new with respect to its parents. The value of this metric is the summation of the new additions and new removals metrics found for each conflict. This metric serves to represent the same granularity level for comparison to the selected edits metric discussed in the following section.

Selected Edits

The fourth metric is called selected edits. Selected edits are defined as any line in the diff output that contains a mixture of either ‘+’ or ’-‘ and one or more empty space characters (‘ ‘) in the first n characters of the differencing data. When a mixture of these characters appears in the diff line, this indicates that the content of at least one parent is not in conflict, as represented by the empty space (‘ ‘). This indicates that an existing line from one of the parent commits has made it through to the final version and was

“selected” by the developer resolving the merge commit and does not constitute “new” work as it existed previously before the merge commit.

!27 Data Analysis

The analysis of the data collected from the repositories is presented below with respect to the research questions poised by this thesis and present the historical significance and contributions of merge commits to a Git repository.

• RQ1: How much of the history of a project are merge commits?

• RQ2: Do merge conflicts occur frequently?

• RQ3: Do merge conflicts introduce new edits, and specifically what kind?

• RQ4: Are new edits or reorganizations of existing edits more common in merge

conflicts?

To answer these questions, an in-depth analysis of the historical metadata available from the Git based repositories as well as the custom source change metrics derived from merge commit metric data.

RQ1: Do merge commits appear often in a project’s history? Table 3 shows that out of the 23 repositories analyzed, merge commits account for less than 20% of the entire history of commits in 16 of the systems (almost 70%). Looking further at the percentages of merge commits with respect to the total number of commits in the repository, no real correlation exists between them. This could indicate that the presence of merge commits is likely more related to the process in place among the project maintainers and the way Git fits into that workflow than the overall size of a repository’s history. One example of this could be developers working on local branches, private to

!28 their own development workstations, and upon completing a task, choose to rebase their changes on top of a historical branch to avoid merge commits and keep the project history clean and easier to follow. This finding confirms the initial hypothesis concerning the roles of branches in that regular commits make up a greater portion of the history and make a more significant contribution to software systems with regard to source changes.

RQ2: Do merge conflicts occur frequently? Table 3 shows in 19 of the repositories, merge conflicts make up less than 20% of the merge commits in their history with the remaining 4 repositories never exceeding 31%. Two repositories, MSBuild and

Textmate, had no merge conflicts at all. In the case of MSBuild, it is one of the smallest repositories in terms of number of commits. Additionally, all the merge commits to

MSBuild are pull requests. A pull request is a way of packaging up a series of commits to be transferred to another Git user for integration into their repository. While this can be achieved with Git on its own, Github offers a facility to make this type of workflow more convenient. Pull requests are reviewed by maintainers before they are merged into the repository. This allows maintainers to be more selective about the changes they accept to avoid undesirable situations such as merge conflicts. Looking into Textmate’s history, all the merges in that repository are also merge commits from pull requests and can be explained the same way as those in MSBuild.

When looking at merge conflicts with respect to the total number of commits in a repository’s history, merge conflicts made up less than 3% of the entire history. One system, Cassandra, had nearly three times as many merge conflicts than the other

!29 systems at 10%. To further understand this disparity, the history of the merge conflicts were examined manually. This investigation revealed that Cassandra has multiple historical branches designated for different versions of the software. In each merge commit these branches are used as either a source or destination branch. Historical branches based on version would involve some level of parallel development, whether it is new features for the latest version, or bug fixes for the older version, the development paths would likely diverge and naturally increase the odds of generating conflicts.

Combine this with the fact that Cassandra’s history has the fourth highest number of merge commits in its history out of the 23 repositories shows that the increase number of merge conflicts are just a by-product of the way in which the maintainers use Git to manage their source code.

RQ3: Do merge conflicts introduce new edits to the resulting merge revisions

(additions or removals)? Table 4 shows in all but 3 of the repositories, new edits were introduced in a merge conflict. In 21 of the repositories, the average number of new edits in a merge conflict resolution ranges from 3 to 169 LOC. Two repositories, Mono and

Monodevelop both had a significantly higher average of new edits at 732 and 759 LOC.

In all repositories, new additions always appear more frequently than new removals.

Manual investigation of Mono and Monodevelop showed a large number of the merge conflict commits with the most new edits represent Subversion history imported into Git.

More recent merge commits in the Mono and Monodevelop history (from around 2013)

!30 show that merge commits with conflicts occurred with merges between branches, while merges from pull requests introduced no conflicts.

RQ4 Are new edits or reorganizations of existing edits more common in merge conflicts? Table 4 shows In all but one repository, the number of existing edits is higher than the number of new edits by a significant margin, which in some cases is by a factor of 10. This indicates that when merge commits occur, it is more likely that existing work

Table 3: Presence of merge commits and commits with conflict in each repository

Percentage of Total Merges Percentage of Percentage of Merges with Total Commits Total Merges with Conflicts Merge Commits Merge Conflicts Conflicts

Atom 24359 1886 143 7.74% 0.59% 7.58%

Bullet 3301 151 1 4.57% 0.03% 0.66%

Cassandra 17149 5699 1547 33.23% 9.02% 27.15%

CMake 28895 4469 69 15.47% 0.24% 1.54%

Cocos2d-x 33811 11587 554 34.27% 1.64% 4.78%

Django 30752 649 45 2.11% 0.15% 6.93%

Docker 16532 6252 155 37.82% 0.94% 2.48%

Git 41746 9504 1193 22.77% 2.86% 12.55%

Libgdx 11250 2107 59 18.73% 0.52% 2.80%

Libgit2 8827 1875 39 21.24% 0.44% 2.08%

Linux 520224 37057 3110 7.12% 0.60% 8.39%

Mongo 33076 2294 87 6.94% 0.26% 3.79%

Mono 111607 2034 419 1.82% 0.38% 20.60%

Monodevelop 38130 1673 442 4.39% 1.16% 26.42%

MSBuild 116 49 0 42.24% 0.00% 0.00%

Node.js 11638 361 109 3.10% 0.94% 30.19%

Nunit 2114 379 38 17.93% 1.80% 10.03%

Oclint 763 105 13 13.76% 1.70% 12.38%

Rust 43764 9639 197 22.02% 0.45% 2.04%

Spring 12241 370 5 3.02% 0.04% 1.35%

Textmate 3419 5 0 0.15% 0.00% 0.00%

Tmux 5327 171 17 3.21% 0.32% 9.94%

XBMC 36400 5288 28 14.53% 0.08% 0.53%

!31 will require reorganization or finessing between branches than the introduction of completely new edits. While this result was expected, it is surprising to see the disparity between the metrics especially with operations such as git merge —no-commit [branch name] where the merge occurs, but creation of a merge commit is delayed to allow a developer to make more source modifications. In instances where that command would be used, the merge commit would present itself like conflict with new edits. What this data indicates is that this practice is not widely adopted, or that new edits in a merge commit are reserved primarily for conflict resolution.

!32 CHAPTER VI

THREATS TO VALIDITY

While Git is able to provide a rich historical data about a software system, there

Table 4: Average number of source changes metrics across all repositories

Avg. Number of Avg. Number of Avg. Number of Avg. Number of Total Merges New Additions New Removals New Edits Selected Edits Total KLOC with Conflicts (LOC) (LOC) (LOC) (LOC)

Atom 90.08 143 3.52 0.27 3.79 49.55

Bullet 1,402.87 1 0 0 0 4

Cassandra 523.35 1547 14.89 1.83 16.72 149.59

CMake 937.68 69 143.28 0.01 143.29 1810.48

Cocos2d-x 1,791.11 554 102.79 65.46 168.25 289.58

Django 719.12 45 102.36 21.84 124.2 500.76

Docker 267.40 155 8.95 1 9.95 139.87

Git 743.93 1193 11.91 0.11 12.02 105.27

Libgdx 1,752.57 59 15.46 0.46 15.92 56.02

Libgit2 233.63 39 6.41 0.87 7.28 40.87

Linux 19,512.34 3110 5.42 0.31 5.73 184.38

Mongo 3,172.14 87 35.63 1.41 37.04 240.41

Mono 8,925.57 419 597.76 134.84 732.6 1299.05

Monodevelop 2,038.42 442 665.24 93.84 759.08 203.38

MSBuild 469.16 0 0 0 0 0

Node.js 4,005.49 109 36.65 33.35 70 333.96

Nunit 261.08 38 58.84 33.13 91.97 1134.13

Oclint 24.14 13 2.92 0.77 3.69 60.23

Rust 704.24 197 15.74 7.71 23.45 193.29

Spring 1,141.05 5 50.6 0 50.6 2014.4

Textmate 122.92 0 0 0 0 0

Tmux 60.60 17 10.18 3.94 14.12 174.12

XBMC 2,639.67 28 5.5 1 6.5 123.04

!33 are circumstances in which certain information can only be deduced [9]. In the counting of merge conflicts, the baseline was to detect any merge that contained additional content or changes. While many of these commits also contained information within the commit messages contained details about a conflict, it is possible to introduce changes into a merge commit by using the git merge —no-commit [branch name] command. In the even that this command is used, whether a merge conflict occurred or not, the results of the git show [commit id] command would show differencing data. In these instances, this could inflate the number of merge conflicts, but the number of merge commits would still remain accurate.

While the granularity of new edits that have occurred during a merge can be broken down into new additions and new removals, the selected edits category cannot be presented with the same level of granularity. The default differencing data uses a standard line based approach, and as such lacks the necessary understanding of what occurred during the edits with respect to any syntactic elements or relocations. A new differencing approach that understood these concepts or an in-depth manual analysis of the states of any files involved both before and after the merge commit would be necessary and remains for future work.

Git has built-in features to rewrite history. While these are useful to developers in creating clean project histories, these destructive operations, such as rebasing, alter the development record of the project and may not represent all development events as they occurred. This makes the data acquired from public central repository, like those found on

!34 Github, to be either an exact historical record of the development process or an abridged synopsis. Since it is not possible to tell if history has been altered all information as it appears in the public repository is taken as factual.

!35 CHAPTER VII

RELATED WORK

Many studies have been performed on traditional version control systems [2] such as CVS or Subversion, but little research has been done on the more modern distributed version control system Git. Some of the the research involving Git study the impacts of

DVCS on software [10] [11]. Other research specific to Git has been performed by Lee,

Seo, and Seo [12] which uses Git’s historical structure for a statistical analysis on the number of diverging paths (or branches) in Git repositories. Biazzini, Monperrus, and

Baudry [13] also traverse the history of a Git repository but for the purpose of determining authorship of continuously edited source code on a line-level of granularity.

In this thesis, analysis of the Git repository structure is supplemented with commit metadata and source code changes. With all of these facets of information used in tandem, this thesis is able to present a deeper context into the development process as a whole.

Specifically focusing on merge conflicts, researchers have focused on various types of merge algorithms and merge conflict detection and resolution [14] [15], the ability to collaboratively resolve conflicts [16], language aware merging tools [17], and frequency of merge conflicts [18]. Estler et al. [18] performed a study of 105 student developers and found that over 94% of the developers dealt with some merge conflicts, !36 but few dealt with them frequently. In contrast to this work, the thesis presented here studies the merges and conflicts appearing in 23 open-source projects contributed to by experienced developers, thus gaining insight into actual development practices and workflows in the open-source community. Phillips, Sillito, and Walker [19] performed a survey of developers to determine current branching and merging practices and how it relates to the type of version control system that was used. It was found that merging satisfaction was highly correlated with the type of VCS used by developers, with DVCS leading to greater merge satisfaction.

!37 CHAPTER VIII

CONCLUSIONS AND FUTURE WORK

Little previous research has been done exclusively on Git, and the work that has been done looks at the historical metadata and the structure of Git independently. This thesis advances the research by combining the two in order to provide a deeper understanding of a Git project’s historical record. The role of merge commits has been defined and is supported by an empirical study of 23 open-source repositories. Multiple research questions were posed and answered concerning the frequency of merge commits, merge conflicts, and the extent to which they contribute to source code. In addition, this paper contributes change metrics were introduced to differentiate merge conflict changes to source code using the concepts of new additions, new removals, new edits, and selected edits.

The results presented indicate that the examination of the Git history, in particular merge commits, along side source change data provided by differencing can be helpful in deducing basic Git workflow practices used by development teams. Additionally, the custom metrics for source code contributions from merge commits provides a more clear view of merge conflict resolutions by quantifying developer effort from the standpoint of new edits. This kind of insight can prove valuable to MSR for data analysis, and for

!38 development groups to better understand the impacts of Git usage on their applications.

With this in mind, various directions for further study will be explored to enhance and further refine the results.

As merges were revealed to have multiple sources, branches and pull requests, future work involves further breaking down the analysis by those sources to better represent the workflow in practice. Future work additionally includes an investigation into utilizing an alternative to standard line based differencing or performing in-depth manual analysis of the states of any files involved both before and after the merge commit to better understand syntactic changes and improve the accuracy of the change metrics. These extensions should help to improve heuristics for derived historical data in

Git. With improved heuristics, Grim can be expanded to take advantage of this additional information to enhance its results and expand its feature set. Lastly, once a heuristic engine is formalized, a survey of open-source project and developers could be performed to verify the accuracy of the engine.

!39 BIBLIOGRAPHY

[1] A. E. Hassan, “Mining Software Repositories to Assist Developers and Support Managers,” in Proceeding of the 22nd IEEE International Conference on Software Maintenance (ICSM’06), Philadelphia, Pennsylvania, USA, September 24-27 2006, pp. 339–342.

[2] H. H. Kagdi, M. L. Collard, and J. I. Maletic, “A survey and taxonomy of approaches for mining software repositories in the context of software evolution,” J. Softw. Maint. Evol., vol. 19, pp. 77-131, 2007.

[3] D. Spinellis, “Git,” IEEE Software, vol. 29, no. 3, pp. 100–101, 2012.

[4] Git. (2015, June 23). git-scm homepage [Online]. Available: https://git-scm.com/

[5] T. Preston-Werner (2013, April 10). Five years [Online]. Available: https:// github.com/blog/1470-five-years

[6] S. Chacon and B. Straub, Pro Git, 2nd ed. Apress, 2014.

[7] B. O'Sullivan, “Making sense of revision-control systems,” Communications of the ACM (CACM), vol. 52, no. 9, pp. 56–62, 2009.

[8] R. E. Silverman, Git Pocket Guide, O’Reilly, 2013.

[9] C. Bird, P. C. Rigby, E. T. Barr, D. J. Hamilton, D. M. Germán, and P. T. Devanbu, “The promises and perils of mining git,” in Proceedings of the 6th International Working Conference on Mining Software Repositories, (MSR’09), Vancouver, Canada May 16-17 2009, pp. 1–10.

[10] C. Brindescu, M. Codoban, S. Shmarkatiuk, and D. Dig, “How do centralized and distributed version control systems impact software changes?,” in Proceedings of the 36th International Conference on Software Engineering (ICSE’14), Hyderabad India May 31 - June 7 2014, pp. 322–333.

[11] C. Rodriguez-Bustos and J. Aponte, “How distributed version control systems impact open source software projects,” in Proceedings of the 9th Working

!40 Conference on Mining Software Repositories (MSR’12), Zurich, Switzerland June 2-3 2012, pp. 36–39.

[12] H. Lee, B.-K. Seo, and E. Seo, “A Git Source Repository Analysis Tool Based on a Novel Branch-Oriented Approach,” proceeding of the 4th International Conference on Information Science and Applications (ICISA’13), Pattaya, Thailand June 24-26 2013, pp. 1–4.

[13] X. Meng, B. P. Miller, W. R. Williams, and A. R. Bernat, “Mining Software Repositories for Accurate Authorship,” in Proceedings of the 29th IEEE International Conference on Software Maintenance (ICSM’13), Eindhoven, The Netherlands, September 22-28 2013, pp. 250–259.

[14] T. Mens, “A state-of-the-art survey on software merging,” IEEE Transactions on Software Engineering, vol. 28, no. 5, pp. 449–462, May 2002.

[15] D. Dig, K. Manzoor, R. Johnson, and T. N. Nguyen, “Effective Software Merging in the Presence of Object-Oriented Refactorings,” IEEE Transactions on Software Engineering, vol. 34, no. 3, pp. 321–335, 2008.

[16] A. Nieminen, “Real-time collaborative resolving of merge conflicts,” in Proceedings of the 8th International Conference on Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom’12), Pittsburgh, PA, October 14-17 2012, pp. 540-543.

[17] S. Apel, J. Liebig, B. Brandl, C. Lengauer, and C. Kästner, “Semistructured merge: rethinking merge in revision control systems,” in Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering (ESEC/FSE’11), Szeged, Hungary, September 5-9 2011, pp. 190-200.

[18] H. C. Estler, M. Nordio, C. A. Furia, and B. Meyer, “Awareness and Merge Conflicts in Distributed Software Development,” in Proceedings of the IEEE 9th International Conference on Global Software Engineering (ICGSE’14), Shanghai, China, August 18-21 2014, pp. 26-35.

[19] S. Phillips, J. Sillito, and R. J. Walker, “Branching and merging: an investigation into current version control practices,” in Proceedings of the 4th International Workshop on Cooperative and Human Aspects of Software Engineering (CHASE’11), Honolulu, Hawaii May 21 2011, pp. 9–15.

!41