MERGE COMMIT CONTRIBUTIONS in GIT REPOSITORIES a Thesis

Total Page:16

File Type:pdf, Size:1020Kb

MERGE COMMIT CONTRIBUTIONS in GIT REPOSITORIES a Thesis MERGE COMMIT CONTRIBUTIONS IN GIT REPOSITORIES A Thesis Presented to The Graduate Faculty of The University of Akron In Partial Fulfillment of the Requirements for the Degree Master of Science Drew T. Guarnera August, 2015 MERGE COMMIT CONTRIBUTIONS IN GIT REPOSITORIES Drew T. Guarnera Thesis Approved: Accepted: _______________________________ _______________________________ Advisor Dean of the College Dr. Michael L. Collard Dr. Chand Midha _______________________________ _______________________________ Faculty Reader Interim Dean of the Graduate School Dr. Kathy J. Liszka Dr. Chand Midha _______________________________ _______________________________ Faculty Reader Date Dr. Chien-Chung Chan _______________________________ Interim Department Chair Dr. David N. Steer !ii ABSTRACT Git’s non-linear historical graph is analyzed for the presence of merge commits and the contribution of merge conflicts to the source code. Git’s distributed version control model encourages parallel development by providing light-weight branching. This creates a need to integrate new changes from separate branches into main development branches using Git’s merge facilities. It is hypothesized that the role of merge commits is primarily that of historical unifier and as such do not regularly contribute new code changes in the commits they generate. To this end new metrics were defined using differencing data associated with merge commits to formally analyze code contributions of merge commits. An empirical study of 23 open-source repositories was performed and a tool, Grim, was developed to collect the necessary historical and commit metadata from each repository. It was found that merge commits represent less that 20% of the overall history of a project for a majority of the repositories, and merge conflicts appear even less frequently at around 10%. Merges with conflicts make up around 20% of merges for most of the repositories and the source code contributions made by merge commits most often involves a reorganization of existing changes and not the addition of new source code changes. !iii TABLE OF CONTENTS Page LIST OF TABLES ............................................................................................................vii LIST OF FIGURES .........................................................................................................viii CHAPTER I. INTRODUCTION .....................................................................................................1 II. GIT ............................................................................................................................7 Structure ..................................................................................................................7 Branching ................................................................................................................8 Merge Operations ..................................................................................................10 III. ROLE OF MERGE COMMITS ..............................................................................14 IV. GRIM ......................................................................................................................17 V. EMPIRICAL STUDY .............................................................................................23 Data Collection .....................................................................................................23 Custom Difference Based Metrics ........................................................................25 New Additions ..........................................................................................26 New Removals ..........................................................................................26 New Edits ..................................................................................................27 Selected Edits ............................................................................................27 Data Analysis ........................................................................................................28 !iv VI. THREATS TO VALIDITY ......................................................................................33 VII. RELATED WORK ..................................................................................................35 VIII. CONCLUSION AND FUTURE WORK ................................................................37 BIBLIOGRAPHY .............................................................................................................40 !v LIST OF TABLES Table Page 1. Details about each system used in the empirical study ...........................................24 2. Grim runtime performance benchmarks .................................................................25 3. Presence of merge commits and commits with conflict in each repository ............30 4. Average number of source changes metrics across all repositories ........................32 !vi LIST OF FIGURES Figures Page 1. Branch merging and resulting commit state transitions .............................................15 2. Example output from ‘git show’ command ...............................................................20 !vii CHAPTER I INTRODUCTION Version control is a critical tool in the development of software. It affords developers the ability to work with their code in a sandbox-like environment, thus allowing the flexibility to experiment with modifications before finalizing these changes to the repository, i.e, codebase. In the event that issues should arise, version control supports rolling back changes to revert a codebase to a previous state. While the aforementioned features alone are enough to consider version control valuable, these tools also maintain the entire evolutionary history of each individual commit, i.e, change to a repository. This historical record and its associated metadata has been used by many researchers in the field of Mining Software Repositories (MSR) [1] [2] to wide effect for the purpose of gaining insight into software projects and general development trends. While a variety of version-control systems exist and vary based on how the repository is managed, they break down into two major categories. Until recently, the most popular version-control systems were centralized version control systems (CVCS), such as CVS and Subversion, and represents the subject of the largest body of work in the field of MSR. These centralized systems rely on a dedicated software repository hosted where all contributors have access. Due to the fact that nearly all actions with a CVCS require server communication access to a main repository, this means that committing !1 changes is an atomic operation. Thus, there is no natural facility in place to work and save changes safely within the version-control system locally on a development machine without committing them to the main repository. This creates a unique problem for users of this type of system as they must either ensure all work is completed before committing their changes to the repository, or commit incomplete work to the repository. Many tend to opt for the former choice, but this can prevent developers from committing small, incremental changes with less disruption to the software system, and leaves them at the mercy of the undo features of their source-code editor in the event that the changes yield negative side effects. More modern version control systems such as Git and Mercurial are distributed version control systems (DVCS). The DVCS model has most of the same features of CVCS and can even have a central repository for the consolidation of all commits to a project. The feature that separates DVCS from CVCS is that each local copy a developer uses is a complete copy of the repository, including all historical and metadata information. While the most obvious benefit of this feature is redundancy for backups, it also introduces a greater level of development flexibility for software engineers. With DVCS developers can clone (i.e, copy) a repository to their development machine and make commits to their local copy without altering a central repository or any other developer’s repository. This allows developers to work on a feature or fix and build up a series of commits locally and upon completion of their task push (i.e., transfer) all of these commits to another repository. This ability to commit locally, request any historical !2 commit logs, revert changes, and many other features without the need to communicate with a remote central repository also makes many DVCS operations significantly faster than their CVCS counterparts. The benefits of the DVCS approach has not gone unnoticed by the development community [3], and as such many projects have moved to a DVCS tool, especially in the open-source community. Git in particular has become the DVCS tool of choice for many open source projects, from small development groups all the way up to big companies like Facebook, Microsoft, Google, and Netflix [4]. Git is also the core of one of the largest software repository hosting services, Github, who reported 3.5 million users and housed 6 million git based repositories as of April 10, 2013 [5]. While the popularity of Git continues to rise, it has not yet received the same level of attention by the research community for mining and exploration as previous version-control systems such as Subversion or CVS. Git’s distributed model encourages non-linear commit activity by providing light- weight
Recommended publications
  • Introduction to Version Control with Git
    Warwick Research Software Engineering Introduction to Version Control with Git H. Ratcliffe and C.S. Brady Senior Research Software Engineers \The Angry Penguin", used under creative commons licence from Swantje Hess and Jannis Pohlmann. March 12, 2018 Contents 1 About these Notes1 2 Introduction to Version Control2 3 Basic Version Control with Git4 4 Releases and Versioning 11 Glossary 14 1 About these Notes These notes were written by H Ratcliffe and C S Brady, both Senior Research Software Engineers in the Scientific Computing Research Technology Platform at the University of Warwick for a series of Workshops first run in December 2017 at the University of Warwick. This document contains notes for a half-day session on version control, an essential part of the life of a software developer. This work, except where otherwise noted, is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International Li- cense. To view a copy of this license, visit http://creativecommons.org/ licenses/by-nc-nd/4.0/. The notes may redistributed freely with attribution, but may not be used for commercial purposes nor altered or modified. The Angry Penguin and other reproduced material, is clearly marked in the text and is not included in this declaration. The notes were typeset in LATEXby H Ratcliffe. Errors can be reported to [email protected] 1.1 Other Useful Information Throughout these notes, we present snippets of code and pseudocode, in particular snippets of commands for shell, make, or git. These often contain parts which you should substitute with the relevant text you want to use.
    [Show full text]
  • Version Control – Agile Workflow with Git/Github
    Version Control – Agile Workflow with Git/GitHub 19/20 November 2019 | Guido Trensch (JSC, SimLab Neuroscience) Content Motivation Version Control Systems (VCS) Understanding Git GitHub (Agile Workflow) References Forschungszentrum Jülich, JSC:SimLab Neuroscience 2 Content Motivation Version Control Systems (VCS) Understanding Git GitHub (Agile Workflow) References Forschungszentrum Jülich, JSC:SimLab Neuroscience 3 Motivation • Version control is one aspect of configuration management (CM). The main CM processes are concerned with: • System building • Preparing software for releases and keeping track of system versions. • Change management • Keeping track of requests for changes, working out the costs and impact. • Release management • Preparing software for releases and keeping track of system versions. • Version control • Keep track of different versions of software components and allow independent development. [Ian Sommerville,“Software Engineering”] Forschungszentrum Jülich, JSC:SimLab Neuroscience 4 Motivation • Keep track of different versions of software components • Identify, store, organize and control revisions and access to it • Essential for the organization of multi-developer projects is independent development • Ensure that changes made by different developers do not interfere with each other • Provide strategies to solve conflicts CONFLICT Alice Bob Forschungszentrum Jülich, JSC:SimLab Neuroscience 5 Content Motivation Version Control Systems (VCS) Understanding Git GitHub (Agile Workflow) References Forschungszentrum Jülich,
    [Show full text]
  • Create a Pull Request in Bitbucket
    Create A Pull Request In Bitbucket Waverley is unprofitably bombastic after longsome Joshuah swings his bentwood bounteously. Despiteous Hartwell fathomsbroaches forcibly. his advancements institutionalized growlingly. Barmiest Heywood scandalize some dulocracy after tacit Peyter From an effect is your own pull remote repo bitbucket create the event handler, the bitbucket opens the destination branch for a request, if i am facing is Let your pet see their branches, commit messages, and pull requests in context with their Jira issues. You listen also should the Commits tab at the top gave a skill request please see which commits are included, which provide helpful for reviewing big pull requests. Keep every team account to scramble with things, like tablet that pull then got approved, when the build finished, and negotiate more. Learn the basics of submitting a on request, merging, and more. Now we made ready just send me pull time from our seven branch. Awesome bitbucket cloud servers are some nifty solutions when pull request a pull. However, that story ids will show in the grasp on all specified stories. Workzone can move the trust request automatically when appropriate or a percentage of reviewers have approved andor on successful build results. To cost up the webhook and other integration parameters, you need two set although some options in Collaborator and in Bitbucket. Go ahead but add a quote into your choosing. If you delete your fork do you make a saw, the receiver can still decline your request ask the repository to pull back is gone. Many teams use Jira as the final source to truth of project management.
    [Show full text]
  • DVCS Or a New Way to Use Version Control Systems for Freebsd
    Brief history of VCS FreeBSD context & gures Is Arch/baz suited for FreeBSD? Mercurial to the rescue New processes & policies needed Conclusions DVCS or a new way to use Version Control Systems for FreeBSD Ollivier ROBERT <[email protected]> BSDCan 2006 Ottawa, Canada May, 12-13th, 2006 Ollivier ROBERT <[email protected]> DVCS or a new way to use Version Control Systems for FreeBSD Brief history of VCS FreeBSD context & gures Is Arch/baz suited for FreeBSD? Mercurial to the rescue New processes & policies needed Conclusions Agenda 1 Brief history of VCS 2 FreeBSD context & gures 3 Is Arch/baz suited for FreeBSD? 4 Mercurial to the rescue 5 New processes & policies needed 6 Conclusions Ollivier ROBERT <[email protected]> DVCS or a new way to use Version Control Systems for FreeBSD Brief history of VCS FreeBSD context & gures Is Arch/baz suited for FreeBSD? Mercurial to the rescue New processes & policies needed Conclusions The ancestors: SCCS, RCS File-oriented Use a subdirectory to store deltas and metadata Use lock-based architecture Support shared developments through NFS (fragile) SCCS is proprietary (System V), RCS is Open Source a SCCS clone exists: CSSC You can have a central repository with symlinks (RCS) Ollivier ROBERT <[email protected]> DVCS or a new way to use Version Control Systems for FreeBSD Brief history of VCS FreeBSD context & gures Is Arch/baz suited for FreeBSD? Mercurial to the rescue New processes & policies needed Conclusions CVS, the de facto VCS for the free world Initially written as shell wrappers over RCS then rewritten in C Centralised server Easy UI Use sandboxes to avoid locking Simple 3-way merges Can be replicated through CVSup or even rsync Extensive documentation (papers, websites, books) Free software and used everywhere (SourceForge for example) Ollivier ROBERT <[email protected]> DVCS or a new way to use Version Control Systems for FreeBSD Brief history of VCS FreeBSD context & gures Is Arch/baz suited for FreeBSD? Mercurial to the rescue New processes & policies needed Conclusions CVS annoyances and aws BUT..
    [Show full text]
  • Git and Github
    Git and GitHub Working Dir/Staging, Local/Remote, Clone, Push, Pull, Branch/Merge, Monorepo, GitHub Desktop Source code is by far the most important asset any Even if not using GitHub for their own source, app software company owns. It is more valuable than developers still need to get familiar with it as most of buildings, brand names, computer hardware, furniture today’s popular open source projects are using it and or anything else a software company has. Source code app developers will invariably need to use these. needs to be valued and treated like the very important company asset that it is. Hence the need for a robust This course covers both and helps developers gain source code management system. hands-on experience in how to incorporate both into their development workflow. Many Git-related terms Git is the most popular source code management have entered the developer lexicon – push, pull request, system; GitHub.com is the most popular Git cloud cloning, forking, promoting, repo – and this course hosting solution. Either Git alone or Git and GitHub can helps attendees understand each concept and mentally be used to comprehensively manage and protect source. tie everything together to see how they work in unison. Contents of One-Day Training Course Distributed Version Control Command Line Tooling Using what you might already know Porcelain vs. plumbing Adding distributed influence Beyond the basics - more complete look at Organizing teams via Git advanced command line tools for Git Strategies for managing source trees Managing
    [Show full text]
  • An Introduction to Mercurial Version Control Software
    An Introduction to Mercurial Version Control Software CS595, IIT [Doc Updated by H. Zhang] Oct, 2010 Satish Balay [email protected] Outline ● Why use version control? ● Simple example of revisioning ● Mercurial introduction - Local usage - Remote usage - Normal user workflow - Organizing repositories [clones] ● Further Information ● [Demo] What do we use Version Control for? ● Keep track of changes to files ● Enable multiple users editing files simultaneously ● Go back and check old changes: * what was the change * when was the change made * who made the change * why was the change made ● Manage branches [release versions vs development] Simple Example of Revisioning main.c File Changes File Version 0 1 2 3 Delta Simple Example Cont. main.c 0 1 2 3 makefilemain.c 0 1 Repository -1 0 1 2 3 Version Changeset Concurrent Changes to a File by Multiple Users & Subsequent Merge of Changes Line1 Line1 Line1 Line1 Line2 UserA Line2 UserA Line3 Line2 Line3 Line2 Line4 Line3 UserB Line3 Line4 Line4 UserB Line4 Initial file UserA edit UserB edit Merge edits by both users Merge tools: r-2 ● kdiff3 Branch Merge ● meld r-4 Merge types: ● 2-way r-1 ● 3-way Revision Graph r-3 Some Definitions ● Delta: a single change [to a file] ● Changeset: a collection of deltas [perhaps to multiple files] that are collectively tracked. This captures a snapshot of the current state of the files [as a revision] ● Branch: Concurrent development paths for the same sources ● Merge: Joining changes done in multiple branches into a single path. ● Repository: collection of files we intend to keep track of.
    [Show full text]
  • Git Cheat Sheet
    GIT CHEAT SHEET Git is the free and open source distributed version control system that's responsible for everything GitHub related that happens locally on your computer. This cheat sheet features the most important and commonly used Git commands for easy reference. INSTALLATION & GUIS STAGE & SNAPSHOT With platform specific installers for Git, GitHub also provides the Working with snapshots and the Git staging area ease of staying up-to-date with the latest releases of the command line tool while providing a graphical user interface for day-to-day git status interaction, review, and repository synchronization. show modified files in working directory, staged for your next commit GitHub for Windows git add [file] htps://windows.github.com add a file as it looks now to your next commit (stage) GitHub for Mac htps://mac.github.com git reset [file] unstage a file while retaining the changes in working directory For Linux and Solaris platforms, the latest release is available on the official Git web site. git diff Git for All Platforms diff of what is changed but not staged htp://git-scm.com git diff --staged diff of what is staged but not yet commited SETUP git commit -m “[descriptive message]” Configuring user information used across all local repositories commit your staged content as a new commit snapshot git config --global user.name “[firstname lastname]” set a name that is identifiable for credit when review version history git config --global user.email “[valid-email]” BRANCH & MERGE Isolating work in branches, changing context, and integrating changes set an email address that will be associated with each history marker git branch git config --global color.ui auto list your branches.
    [Show full text]
  • Version Control for Salesforce: a Practical Guide to Implementing Git-Based Release Management
    Version control for Salesforce A practical guide to implementing git-based release management Release management made easy | gearset.com Contents Introduction 2 Who is this whitepaper for? 2 A brief introduction to version control 2 Definitions 3 The path to version control 4 Production development 4 Sandbox development 5 Version control development 6 The benefits of version control in Salesforce development 6 Why don’t more Salesforce teams use version control? 8 Limitations of the first-party tooling 8 Thinking version control is only for enterprises 8 Not knowing what good looks like 8 Technical barriers of command-line tools 9 Getting started with version control: introducing Git 9 Service providers 9 On-premise vs hosted 9 A best practice development model for Salesforce 10 Overview 10 The development model 11 Branch management 12 Dealing with hotfixes 12 The hotfix model 13 What metadata to version control 14 Start with a controlled subset 14 Managed packages 15 Repository configuration 15 Finding the right deployment tool 16 For developers 16 For admins and release managers 16 For team leads and architects 17 Conclusion 17 About Gearset 17 1 Introduction Version control is one of the most powerful tools development teams can leverage on their path to effective release management, yet its adoption in the Salesforce ecosystem is surprisingly low. In this whitepaper we’ll examine how version control works, the benefits of version control over in-org development, and introduce a best-practice model for implementing version control in your business. Who is this whitepaper for? Anyone involved with the administration, development, maintenance, or management of Salesforce environments, looking for ways to improve the cadence, simplicity, reliability, and auditability of their release management.
    [Show full text]
  • Oreilly Version Control with GIT.Pdf
    Version Control with Git Jon Loeliger Beijing • Cambridge • Farnham • Köln • Sebastopol • Taipei • Tokyo Version Control with Git by Jon Loeliger Copyright © 2009 Jon Loeliger. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://my.safaribooksonline.com). For more information, contact our corporate/institutional sales department: (800) 998-9938 or [email protected]. Editor: Andy Oram Indexer: Fred Brown Production Editor: Loranah Dimant Cover Designer: Karen Montgomery Proofreader: Katie Nopper DePasquale Interior Designer: David Futato Production Services: Newgen North America Illustrator: Robert Romano Printing History: May 2009: First Edition. Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of O’Reilly Media, Inc. Version Control with Git, the image of a long-eared bat, and related trade dress are trademarks of O’Reilly Media, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O’Reilly Media, Inc. was aware of a trademark claim, the designations have been printed in caps or initial caps. While every precaution has been taken in the preparation of this book, the publisher and author assume no responsibility for errors or omissions, or for damages resulting from the use of the information con- tained herein. ISBN: 978-0-596-52012-0 [M] 1242320486 Table of Contents Preface . xi 1.
    [Show full text]
  • Integrating Simdiff 4 with Git + Sourcetree
    Integrating SimDiff 4 with git + SourceTree Contents Introduction ................................................................................................... 2 Configuration .................................................................................................. 2 Notes ......................................................................................................... 3 Usage ........................................................................................................... 4 © 2020 EnSoft Corp. Last updated 2020-05-27 Introduction Git is a distributed version control system. The primary interface for working with git is the command-line, using commands such as git commit, git push, etc. However, there are a number of GUI tools that have been built on top of the command line that provide a more convenient interface for working with a repository. Some of these interfaces support interactive diff and merge tools - others do not. The following instructions explain how to configure SimDiff 4 for use with git while using the SourceTree GUI tool. The configuration for this repository client requires the usage of Tool Selector. ToolSelector is a utility program developed by EnSoft that can be used to select between one or more configured tools based on certain properties of the input arguments (e.g. file type). For more information about ToolSelector, please refer to the ToolSelector User Guide.pdf located at your ToolSelector directory (by default, C:\Program Files\EnSoft\SimDiff 4\utils\toolselector). Note: if
    [Show full text]
  • CVS II: Parallelizing Software Dev Elopment
    CVS II: Parallelizing Software Dev elopment Brian Berliner Prisma, Inc. 5465 Mark Dabling Blvd. Colorado Springs, CO 80918 [email protected] ABSTRACT The program described in this paper fills a need in the UNIX community for a freely available tool to manage software revision and release control in a multi-developer, multi-directory, multi-group environment. This tool also addresses the increasing need for tracking third-party vendor source distributions while trying to maintain local modifications to earlier releases. 1. Background In large software development projects, it is usually necessary for more than one software developer to be modifying (usually different) modules of the code at the same time. Some of these code modifications are done in an experimental sense, at least until the code functions cor- rectly, and some testing of the entire program is usually necessary. Then, the modifications are returned to a master source repository so that others in the project can enjoy the new bug-fix or functionality. In order to manage such a project, some sort of revision control system is neces- sary. Specifically, UNIX1 kernel development is an excellent example of the problems that an adequate revision control system must address. The SunOS2 kernel is composed of over a thou- sand files spread across a hierarchy of dozens of directories.3 Pieces of the kernel must be edited by many software developers within an organization. While undesirable in theory, it is not uncommon to have two or more people making modifications to the same file within the kernel sources in order to facilitate a desired change.
    [Show full text]
  • Version Control Systems Stefan Otte Computer Systems and Telematics Institute of Computer Science Freie Universitat¨ Berlin, Germany [email protected]
    1 Version Control Systems Stefan Otte Computer Systems and Telematics Institute of Computer Science Freie Universitat¨ Berlin, Germany [email protected] Abstract—Classic centralized Version Control Systems II. BASIC CONCEPTS have proven that they can accelerate and simplify the This Section gives an overview of the centralized and software development process, but one must also consider the distributed approach. distributed systems in this analysis. What features can distributed Version Control Systems offer and why are It is important to keep in mind that not necessarily all they interesting? VCSs support the features described below, or they may This paper describes the general concepts of the cen- handle some details differently (see Section III). tralized and the distributed approaches, how Concurrent To emphasize the aspect that the software engineer is Versions System, Subversion and Git implement these using a VCS and to ease the reading of the paper the concepts. software engineer is called the user from now on. Even though, this paper describes VCSs in the context of software engineering, VCSs can be very useful in I. INTRODUCTION other areas of work too. They can not only handle source By developing programs, software engineers produce code and text files, they can also handle different file source code. They change and extend it, undo changes, types. and jump back to older versions. When several software engineers want to access the same file, concurrency A. Different Terms, Same Meaning becomes an issue. When reading about version control systems, terms Version Control Systems (VCSs) enable the acceler- like revision control systems (RCS), software configura- ation and simplification of the software development tion management, source code management (SCM) or process, and enable new workflows.
    [Show full text]