Detection of Named Branch Origin for Git Commits A

Total Page:16

File Type:pdf, Size:1020Kb

Detection of Named Branch Origin for Git Commits A DETECTION OF NAMED BRANCH ORIGIN FOR GIT COMMITS A Thesis Presented to The Graduate Faculty of The University of Akron In Partial Fulfillment of the Requirements for the Degree Master of Science Heather M. Michaud August, 2015 DETECTION OF NAMED BRANCH ORIGIN FOR GIT COMMITS Heather M. Michaud Thesis Approved: Accepted: _______________________________ _______________________________ Advisor Dean of the College Dr. Michael L. Collard Dr. Chand Midha _______________________________ _______________________________ Faculty Reader Interim Dean of the Graduate School Dr. Kathy J. Liszka Dr. Chand Midha _______________________________ _______________________________ Faculty Reader Date Dr. Zhong-Hui Duan _______________________________ Interim Department Chair Dr. David N. Steer !ii ABSTRACT The named branch on which a change is committed in a Git repository provides valuable insight into the evolution of a software project, including a natural and logical ordering of commits categorized by the developer at the time of the change. In addition, the name of the branch provides semantic context as to the nature of the changes along that branch. However, this branch name is unrecorded in the historical archive of Git repositories. In this thesis, a heuristics-based algorithm is presented to detect the named branch origin of commits based on the merge commit messages. An empirical evaluation shows precision levels reaching an average of 87% as seen when applied to generated test repositories and an average recall of over 97% when applied to generated test repositories and forty-four open source systems. This is shown to constitute an enormous increase in recall when compared to the only existing algorithm for branch name detection. Additionally, a detailed explanation of common merge commit messages, merge types, and branch names as found in over forty open-source projects is discussed. !iii TABLE OF CONTENTS Page LIST OF TABLES ............................................................................................................vii LIST OF FIGURES .........................................................................................................viii CHAPTER I. INTRODUCTION .....................................................................................................1 II. GIT STRUCTURE ....................................................................................................4 Branch operations ...................................................................................................6 Merge operations ....................................................................................................9 Merge ..........................................................................................................9 Fast-forward ..............................................................................................11 Pull-requests ..............................................................................................12 Rebase .......................................................................................................13 Cherry-pick ...............................................................................................15 Remote repositories ..............................................................................................16 III. ROLE OF MERGE COMMITS ..............................................................................18 Text extraction ......................................................................................................19 IV. CASE STUDY: MERGE COMMIT TYPES ..........................................................23 V. HEURISTICS-BASED ORIGIN DETECTION .....................................................29 !iv Branch Heads Heuristic ........................................................................................32 Parents of Merges Heuristic ..................................................................................35 Ancestral Heuristic ................................................................................................37 Majority-Origin Heuristic .....................................................................................40 VI. ALGORITHM EVALUATION ...............................................................................42 Generated test repositories ....................................................................................42 Grim extension ......................................................................................................46 Analyses ................................................................................................................47 VII. APPLICATION OF HEURISTICS ON OPEN-SOURCE SYSTEMS ...................50 VIII. THREATS TO VALIDITY ......................................................................................53 IX. RELATED WORK ..................................................................................................55 X. CONCLUSION AND FUTURE WORK ................................................................57 BIBLIOGRAPHY .............................................................................................................60 !v LIST OF TABLES Table Page 1. Variety of default merge commit messages based on branch types ........................20 2. Summary of merge commit types in 44 open-source projects ...............................21 3. List of collected open-source repositories used in the case study ...........................24 4. Distribution of explicit merge types on 44 open-source systems ............................25 5. Commonly occurring non-default merge commit messages ...................................27 6. Recall of heuristics-based algorithm stages on 44 systems .....................................51 !vi LIST OF FIGURES Figures Page 1. An example Git repository shown as a directed acyclic graph ...................................5 2. Stages of commits to branches in example Git repository ..........................................7 3. Branch creation and checkout .....................................................................................8 4. Before and after repository state of a git merge operation ........................................10 5. Before and after a fast-forward merge .......................................................................11 6. Stages of a git rebase operation .................................................................................14 7. A repository state before and after a git cherry-pick .................................................16 8. An example Git repository with labeled origins of algorithm baseline .....................31 9. Heuristics-based algorithm pseudocode ....................................................................32 10. Branch head heuristics pseudocode ...........................................................................33 11. An example Git repository with labeled origins after the branch head heuristic has also been applied ........................................................................................................34 12. Fast-forwards and branch deletions effect the result of the branch head heuristic ....35 13. Merge parent heuristic pseudocode ...........................................................................36 14. An example Git repository with labeled origins after the merge parents heuristic has also been applied .......................................................................................................37 15. Ancestral heuristic pseudocode .................................................................................38 16. An example Git repository with labeled origins after the ancestral heuristic has also been applied ...............................................................................................................39 !vii 17. Majority origin heuristic pseudocode ........................................................................40 18. An example Git repository with labeled origins after the majority origin heuristic has also been applied ........................................................................................................41 19. Bash source-code for generating test repositories .....................................................43 20. Example generated repository with fifteen non-merge commits ...............................45 !viii CHAPTER I INTRODUCTION Mining source-code repositories has become an intrinsic part of software engineering analyses since the advent of version-control systems (VCS). Grouping the commits into categories helps to easily interpret the historical information of the repository as well as identify evolutionary patterns and trends. Oftentimes, commits or issues are grouped according to the author [1], a sliding time-window [2-5], the size [6] or type [7] of the change, by the files that were changed [5,8], branch patterns [9-10], or data-mining clustering methods [11]. Git, the most popular VCS to date [12], allows the developer to create named branches (i.e., independent, diverging paths from the mainline of development) which can later be merged back into the mainline. Because branching and merging operations are so flexible and efficient, it has been seamlessly integrated into developer workflow. A typical Git workflow involves creating a short-lived topic branch to implement a specific feature or bug fix, which is then merged back into the main branch upon task completion. Long-term branches are used
Recommended publications
  • Release 3.1.0.Alpha.0 Patchwork Developers
    Patchwork Release 3.1.0.alpha.0 Patchwork Developers Sep 30, 2021 USAGE DOCUMENTATION 1 Overview 3 1.1 Projects..................................................4 1.2 People..................................................4 1.3 Users...................................................4 1.4 Submissions...............................................4 1.5 Comments................................................5 1.6 Patch Metadata..............................................5 1.7 Collections................................................6 1.8 Events..................................................7 2 Design 11 3 Autodelegation 13 4 Hint Headers 15 5 Clients 17 5.1 pwclient................................................. 17 5.2 git-pw................................................... 17 5.3 snowpatch................................................ 17 6 Installation 19 6.1 Deployment Guides, Provisioning Tools and Platform-as-a-Service.................. 19 6.2 Requirements............................................... 19 6.3 Database................................................. 20 6.4 Patchwork................................................ 21 6.5 Reverse Proxy and WSGI HTTP Servers................................ 24 6.6 Django administrative console...................................... 26 6.7 Incoming Email............................................. 26 6.8 (Optional) Configure your VCS to Automatically Update Patches................... 28 6.9 (Optional) Configure the Patchwork Cron Job.............................. 28 7 Configuration
    [Show full text]
  • Patchwork Release 2.0-Alpha
    Patchwork Release 2.0-alpha Dec 28, 2018 User Documentation 1 Overview 3 1.1 Projects..................................................3 1.2 People..................................................3 1.3 Users...................................................3 1.4 Submissions...............................................4 1.5 Patch Metadata..............................................4 1.6 Collections................................................6 1.7 Events..................................................6 2 Design 9 2.1 Patchwork should supplement mailing lists, not replace them......................9 2.2 Don’t pollute the project’s changelogs with Patchwork poop......................9 2.3 Patchwork users shouldn’t require a specific version control system..................9 3 Autodelegation 11 4 Hint Headers 13 5 Clients 15 5.1 pwclient................................................. 15 5.2 git-pw................................................... 15 6 Installation 17 6.1 Deployment Guides, Provisioning Tools and Platform-as-a-Service.................. 17 6.2 Requirements............................................... 17 6.3 Database................................................. 18 6.4 Patchwork................................................ 19 6.5 Reverse Proxy and WSGI HTTP Servers................................ 22 6.6 Django administrative console...................................... 24 6.7 Incoming Email............................................. 24 6.8 (Optional) Configure your VCS to Automatically Update Patches..................
    [Show full text]
  • Patchwork Documentation Release 1.0.0
    Patchwork Documentation Release 1.0.0 Stephen Finucane, Jeremy Kerr, Damien Lespiau February 15, 2016 Contents 1 patchwork 3 1.1 Download.................................................3 1.2 Design..................................................3 1.3 Getting Started..............................................4 1.4 Support..................................................4 2 Deploying Patchwork 5 2.1 Database Configuration.........................................5 2.2 Django Setup...............................................7 2.3 Apache Setup...............................................8 2.4 Configure patchwork...........................................8 2.5 Subscribe a Local Address to the Mailing List.............................9 2.6 Setup your MTA to Deliver Mail to the Parsemail Script........................9 2.7 Set up the patchwork cron script.....................................9 2.8 (Optional) Configure your VCS to Automatically Update Patches...................9 3 User Manual 11 3.1 Submitting patches............................................ 11 3.2 git-pw ................................................. 13 4 Testing with Patchwork 15 4.1 Flow................................................... 15 4.2 git-pw helper commands....................................... 17 4.3 Example: running checkpatch.pl on incoming series.......................... 18 5 REST API 19 5.1 API Patterns............................................... 19 5.2 API Reference.............................................. 20 5.3 API Revisions.............................................
    [Show full text]
  • A Dataset for Github Repository Deduplication: Extended Description
    A Dataset for GitHub Repository Deduplication: Extended Description Diomidis Spinellis1, Zoe Kotti1, and Audris Mockus2 1Athens University of Economics and Business 2University of Tennessee June 17, 2020 Abstract GitHub projects can be easily replicated through the site's fork process or through a Git clone-push sequence. This is a problem for empirical software engineering, because it can lead to skewed results or mistrained machine learning models. We provide a dataset of 10.6 mil- lion GitHub projects that are copies of others, and link each record with the project's ultimate parent. The ultimate parents were derived from a ranking along six metrics. The related projects were calculated as the connected components of an 18.2 million node and 12 million edge denoised graph created by directing edges to ultimate parents. The graph was created by filtering out more than 30 hand-picked and 2.3 million pattern-matched clumping projects. Projects that introduced unwanted clumping were identified by repeatedly visualizing shortest path distances between unrelated important projects. Our dataset identified 30 thousand duplicate projects in an existing popular ref- erence dataset of 1.8 million projects. An evaluation of our dataset against another created independently with different methods found a significant overlap, but also differences attributed to the operational arXiv:2002.02314v3 [cs.SE] 15 Jun 2020 definition of what projects are considered as related. Keywords: Deduplication, fork, project clone, GitHub, dataset This is a technical note expanding reference [? ], which should be cited in preference to this text. 1 In theory, there is no difference between theory and practice, while in practice, there is.
    [Show full text]
  • Linux Kernel Development Documentation Release 4.13.0-Rc4+
    Linux Kernel Development Documentation Release 4.13.0-rc4+ The kernel development community Sep 05, 2017 CONTENTS 1 HOWTO do Linux kernel development 3 2 Code of Conflict 13 3 A guide to the Kernel Development Process 15 4 Submitting patches: the essential guide to getting your code into the kernel 43 5 Linux kernel coding style 55 6 Email clients info for Linux 69 7 Minimal requirements to compile the Kernel 75 8 Submitting Drivers For The Linux Kernel 83 9 The Linux Kernel Driver Interface 87 10Linux kernel management style 91 11Everything you ever wanted to know about Linux -stable releases 95 12Linux Kernel patch submission checklist 99 13Index of Documentation for People Interested in Writing and/or Understanding the Linux Kernel 101 14Applying Patches To The Linux Kernel 113 15Adding a New System Call 119 16Linux magic numbers 129 17Why the “volatile” type class should not be used 133 i ii Linux Kernel Development Documentation, Release 4.13.0-rc4+ So you want to be a Linux kernel developer? Welcome! While there is a lot to be learned about the kernel in a technical sense, it is also important to learn about how our community works. Reading these documents will make it much easier for you to get your changes merged with a minimum of trouble. Below are the essential guides that every developer should read. CONTENTS 1 Linux Kernel Development Documentation, Release 4.13.0-rc4+ 2 CONTENTS CHAPTER ONE HOWTO DO LINUX KERNEL DEVELOPMENT This is the be-all, end-all document on this topic.
    [Show full text]
  • The Buildroot User Manual I
    The Buildroot user manual i The Buildroot user manual The Buildroot user manual ii Contents I Getting started 1 1 About Buildroot 2 2 System requirements 3 2.1 Mandatory packages.................................................3 2.2 Optional packages...................................................4 3 Getting Buildroot 5 4 Buildroot quick start 6 5 Community resources 8 II User guide 9 6 Buildroot configuration 10 6.1 Cross-compilation toolchain............................................. 10 6.1.1 Internal toolchain backend.......................................... 11 6.1.2 External toolchain backend.......................................... 11 6.1.3 Build an external toolchain with Buildroot.................................. 12 6.1.3.1 External toolchain wrapper.................................... 13 6.2 /dev management................................................... 13 6.3 init system....................................................... 14 7 Configuration of other components 16 8 General Buildroot usage 17 8.1 make tips....................................................... 17 8.2 Understanding when a full rebuild is necessary................................... 19 8.3 Understanding how to rebuild packages....................................... 20 8.4 Offline builds..................................................... 20 The Buildroot user manual iii 8.5 Building out-of-tree.................................................. 20 8.6 Environment variables................................................ 21 8.7 Dealing efficiently with
    [Show full text]
  • Linux Process Documentation
    Linux Process Documentation The kernel development community Jul 14, 2020 CONTENTS i ii Linux Process Documentation So you want to be a Linux kernel developer? Welcome! While there is a lot to be learned about the kernel in a technical sense, it is also important to learn about how our community works. Reading these documents will make it much easier for you to get your changes merged with a minimum of trouble. Below are the essential guides that every developer should read. CONTENTS 1 Linux Process Documentation 2 CONTENTS CHAPTER ONE LINUX KERNEL LICENSING RULES The Linux Kernel is provided under the terms of the GNU General Public License version 2 only (GPL-2.0), as provided in LICENSES/preferred/GPL-2.0, with an explicit syscall exception described in LICENSES/exceptions/Linux-syscall-note, as described in the COPYING file. This documentation file provides a description of how each source file shouldbe annotated to make its license clear and unambiguous. It doesn’t replace the Kernel’s license. The license described in the COPYING file applies to the kernel source as a whole, though individual source files can have a different license which is required tobe compatible with the GPL-2.0: GPL-1.0+ : GNU General Public License v1.0 or later GPL-2.0+ : GNU General Public License v2.0 or later LGPL-2.0 : GNU Library General Public License v2 only LGPL-2.0+ : GNU Library General Public License v2 or later LGPL-2.1 : GNU Lesser General Public License v2.1 only LGPL-2.1+ : GNU Lesser General Public License v2.1 or later Aside from that, individual files can be provided under a dual license, e.g.one of the compatible GPL variants and alternatively under a permissive license like BSD, MIT etc.
    [Show full text]
  • Latest Version of Patchwork Is Available with Git
    Patchwork Documentation Release 1.0.0 Stephen Finucane, Jeremy Kerr, Damien Lespiau Nov 28, 2018 Contents 1 patchwork 3 1.1 Download.................................................3 1.2 Design..................................................3 1.3 Getting Started..............................................4 1.4 Support..................................................4 2 Deploying Patchwork 5 2.1 Database Configuration.........................................5 2.2 Django Setup...............................................7 2.3 Apache Setup...............................................8 2.4 Configure patchwork...........................................9 2.5 Subscribe a Local Address to the Mailing List.............................9 2.6 Setup your MTA to Deliver Mail to the parsemail Script........................9 2.7 Set up the patchwork cron script.....................................9 2.8 (Optional) Configure your VCS to Automatically Update Patches...................9 3 User Manual 11 3.1 Submitting patches............................................ 11 3.2 git-pw ................................................. 13 4 Testing with Patchwork 15 4.1 Flow................................................... 15 4.2 git-pw helper commands....................................... 17 4.3 Example: running checkpatch.pl on incoming series.......................... 18 5 REST API 19 5.1 API Patterns............................................... 19 5.2 API Reference.............................................. 20 5.3 API Revisions.............................................
    [Show full text]