MERGE COMMIT CONTRIBUTIONS in GIT REPOSITORIES a Thesis

MERGE COMMIT CONTRIBUTIONS IN GIT REPOSITORIES A Thesis Presented to The Graduate Faculty of The University of Akron In Partial Fulfillment of the Requirements for the Degree Master of Science Drew T. Guarnera August, 2015 MERGE COMMIT CONTRIBUTIONS IN GIT REPOSITORIES Drew T. Guarnera Thesis Approved: Accepted: _______________________________ _______________________________ Advisor Dean of the College Dr. Michael L. Collard Dr. Chand Midha _______________________________ _______________________________ Faculty Reader Interim Dean of the Graduate School Dr. Kathy J. Liszka Dr. Chand Midha _______________________________ _______________________________ Faculty Reader Date Dr. Chien-Chung Chan _______________________________ Interim Department Chair Dr. David N. Steer !ii ABSTRACT Git’s non-linear historical graph is analyzed for the presence of merge commits and the contribution of merge conflicts to the source code. Git’s distributed version control model encourages parallel development by providing light-weight branching. This creates a need to integrate new changes from separate branches into main development branches using Git’s merge facilities. It is hypothesized that the role of merge commits is primarily that of historical unifier and as such do not regularly contribute new code changes in the commits they generate. To this end new metrics were defined using differencing data associated with merge commits to formally analyze code contributions of merge commits. An empirical study of 23 open-source repositories was performed and a tool, Grim, was developed to collect the necessary historical and commit metadata from each repository. It was found that merge commits represent less that 20% of the overall history of a project for a majority of the repositories, and merge conflicts appear even less frequently at around 10%. Merges with conflicts make up around 20% of merges for most of the repositories and the source code contributions made by merge commits most often involves a reorganization of existing changes and not the addition of new source code changes. !iii TABLE OF CONTENTS Page LIST OF TABLES ............................................................................................................vii LIST OF FIGURES .........................................................................................................viii CHAPTER I. INTRODUCTION .....................................................................................................1 II. GIT ............................................................................................................................7 Structure ..................................................................................................................7 Branching ................................................................................................................8 Merge Operations ..................................................................................................10 III. ROLE OF MERGE COMMITS ..............................................................................14 IV. GRIM ......................................................................................................................17 V. EMPIRICAL STUDY .............................................................................................23 Data Collection .....................................................................................................23 Custom Difference Based Metrics ........................................................................25 New Additions ..........................................................................................26 New Removals ..........................................................................................26 New Edits ..................................................................................................27 Selected Edits ............................................................................................27 Data Analysis ........................................................................................................28 !iv VI. THREATS TO VALIDITY ......................................................................................33 VII. RELATED WORK ..................................................................................................35 VIII. CONCLUSION AND FUTURE WORK ................................................................37 BIBLIOGRAPHY .............................................................................................................40 !v LIST OF TABLES Table Page 1. Details about each system used in the empirical study ...........................................24 2. Grim runtime performance benchmarks .................................................................25 3. Presence of merge commits and commits with conflict in each repository ............30 4. Average number of source changes metrics across all repositories ........................32 !vi LIST OF FIGURES Figures Page 1. Branch merging and resulting commit state transitions .............................................15 2. Example output from ‘git show’ command ...............................................................20 !vii CHAPTER I INTRODUCTION Version control is a critical tool in the development of software. It affords developers the ability to work with their code in a sandbox-like environment, thus allowing the flexibility to experiment with modifications before finalizing these changes to the repository, i.e, codebase. In the event that issues should arise, version control supports rolling back changes to revert a codebase to a previous state. While the aforementioned features alone are enough to consider version control valuable, these tools also maintain the entire evolutionary history of each individual commit, i.e, change to a repository. This historical record and its associated metadata has been used by many researchers in the field of Mining Software Repositories (MSR) [1] [2] to wide effect for the purpose of gaining insight into software projects and general development trends. While a variety of version-control systems exist and vary based on how the repository is managed, they break down into two major categories. Until recently, the most popular version-control systems were centralized version control systems (CVCS), such as CVS and Subversion, and represents the subject of the largest body of work in the field of MSR. These centralized systems rely on a dedicated software repository hosted where all contributors have access. Due to the fact that nearly all actions with a CVCS require server communication access to a main repository, this means that committing !1 changes is an atomic operation. Thus, there is no natural facility in place to work and save changes safely within the version-control system locally on a development machine without committing them to the main repository. This creates a unique problem for users of this type of system as they must either ensure all work is completed before committing their changes to the repository, or commit incomplete work to the repository. Many tend to opt for the former choice, but this can prevent developers from committing small, incremental changes with less disruption to the software system, and leaves them at the mercy of the undo features of their source-code editor in the event that the changes yield negative side effects. More modern version control systems such as Git and Mercurial are distributed version control systems (DVCS). The DVCS model has most of the same features of CVCS and can even have a central repository for the consolidation of all commits to a project. The feature that separates DVCS from CVCS is that each local copy a developer uses is a complete copy of the repository, including all historical and metadata information. While the most obvious benefit of this feature is redundancy for backups, it also introduces a greater level of development flexibility for software engineers. With DVCS developers can clone (i.e, copy) a repository to their development machine and make commits to their local copy without altering a central repository or any other developer’s repository. This allows developers to work on a feature or fix and build up a series of commits locally and upon completion of their task push (i.e., transfer) all of these commits to another repository. This ability to commit locally, request any historical !2 commit logs, revert changes, and many other features without the need to communicate with a remote central repository also makes many DVCS operations significantly faster than their CVCS counterparts. The benefits of the DVCS approach has not gone unnoticed by the development community [3], and as such many projects have moved to a DVCS tool, especially in the open-source community. Git in particular has become the DVCS tool of choice for many open source projects, from small development groups all the way up to big companies like Facebook, Microsoft, Google, and Netflix [4]. Git is also the core of one of the largest software repository hosting services, Github, who reported 3.5 million users and housed 6 million git based repositories as of April 10, 2013 [5]. While the popularity of Git continues to rise, it has not yet received the same level of attention by the research community for mining and exploration as previous version-control systems such as Subversion or CVS. Git’s distributed model encourages non-linear commit activity by providing light- weight

Load more