Generating Commit Messages from Git Diffs

Total Page:16

File Type:pdf, Size:1020Kb

Generating Commit Messages from Git Diffs Generating Commit Messages from Git Diffs Sven van Hal Mathieu Post Kasper Wendel Delft University of Technology Delft University of Technology Delft University of Technology [email protected] [email protected] [email protected] ABSTRACT be exploited by machine learning. The hypothesis is that methods Commit messages aid developers in their understanding of a con- based on machine learning, given enough training data, are able tinuously evolving codebase. However, developers not always doc- to extract more contextual information and latent factors about ument code changes properly. Automatically generating commit the why of a change. Furthermore, Allamanis et al. [1] state that messages would relieve this burden on developers. source code is “a form of human communication [and] has similar Recently, a number of different works have demonstrated the statistical properties to natural language corpora”. Following the feasibility of using methods from neural machine translation to success of (deep) machine learning in the field of natural language generate commit messages. This work aims to reproduce a promi- processing, neural networks seem promising for automated commit nent research paper in this field, as well as attempt to improve upon message generation as well. their results by proposing a novel preprocessing technique. Jiang et al. [12] have demonstrated that generating commit mes- A reproduction of the reference neural machine translation sages with neural networks is feasible. This work aims to reproduce model was able to achieve slightly better results on the same dataset. the results from [12] on the same and a different dataset. Addition- When applying more rigorous preprocessing, however, the per- ally, efforts are made to improve upon these results by applying a formance dropped significantly. This demonstrates the inherent different data processing technique. More specific, the following shortcoming of current commit message generation models, which research questions will be answered: perform well by memorizing certain constructs. RQ1: Can the results from Jiang et al. [12] be reproduced? Future research directions might include improving diff embed- RQ2: Does a more rigorous dataset processing technique im- dings and focusing on specific groups of commits. prove the results of the neural model? This paper is structured as follows. In Section 2, background KEYWORDS information about deep neural networks and neural machine trans- Commit Message Generation, Software Engineering, Sequence-to- lation is covered. In Section 3, the state-of-the-art in commit mes- Sequence, Neural Machine Translation sage generation from source code changes is reviewed. Section 4 describes the implementation of the neural model. Section 5 covers 1 INTRODUCTION preprocessing techniques and analyzes the resulting dataset char- acteristics. Section 6 presents the evaluation results and Section 7 Software development is a continuous process: developers incre- discusses the performance and limitations. Section 8 summarizes mentally add, change or remove code and store software revisions the findings and points to promising future research directions. in a Version Control System (VCS). Each changeset (or: diff ) is provided with a commit message, which is a short, human-readable 2 BACKGROUND summary of the what and why of the change [6]. Commit messages document the development process and aid 2.1 Neural Machine Translation developers in their understanding of the state of the software and A recent development in deep learning is sequence-to-sequence its evolution over time. However, developers do not always prop- learning for Neural Machine Translation [7, 25]. Translation can erly document code changes or write commit messages of low be seen as a probabilistic process, where the goal is to find a target quality [9]. This applies to code documentation in general and sentence y = ¹y1; :::;ynº from a source sequence x = ¹x1; :::; xmº negatively impacts developer performance [22]. Automatically gen- that maximizes the conditional probability of y given the source arXiv:1911.11690v1 [cs.SE] 26 Nov 2019 erating high-quality commit messages from diffs would relieve the sentence x, mathematically depicted as arg maxy p¹yjxº [4]. But this burden of writing commit messages off developers, and improve conditional distribution is of course not given and has to be learned the documentation of the codebase. by a model from the supplied data. Sutskever et al. [25] and Cho As demonstrated by Buse and Weimer [6], Cortes-Coy et al. [8] et al. [7] both have proposed a structure to learn this distribution: and Shen et al. [24], predefined rules and text templates can be used a model that consists of a encoder and decoder component that are to generate commit messages that are largely preferred to developer- trained simultaneously. The encoder component tries to encode written messages in terms of descriptiveness. However, the main the variable length input sequence x to a fixed length hidden state drawbacks of these methods are that 1) only the what of a change vector. This hidden state is then supplied to the decoder component is described, 2) the generated messages are too comprehensive to that tries to decode it into the variable length target sequence y. replace short commit messages, and 3) these methods do neither A Recurrent Neural Network (RNN) is used in this to sequentially scale nor generalize on unseen constructions, because of the use of read the variable length sequence and produces a fixed size vector. hand-crafted templates and rules. Over the years, different architectural changes of encoder and According to a recent study by Jiang and McMillan [13], code decoder components were proposed. Sutskever et al. [25] introduces changes and commit messages exhibit distinct patterns that can multi-layered RNN with Long-Short-Term memory units (LSTM), where both the encoder and decoder consist of multiple layers (4 number of issues with BLEU, such as the fact that sentences have to in their research). Each layer in the encoder produces a fixed size be identical to get the highest score and that a higher BLEU score hidden state that is passed onto the corresponding layer in the not always equals a better translation. The metric is computed decoder, where the results are combined into a target sequence using “a combination of unigram-precision, unigram-recall, and a prediction. One unexplainable factor noted by the authors of this measure of fragmentation that is designed to directly capture how architecture is that it produces better results if the source sequence well-ordered the matched words in the machine translation are in is reversed. Note that in each step during decoding, the LSTM only relation to the reference” [5]. has access to the hidden state from the previous timestep, and the previous predicted token. Cho et al. [7] uses a slightly different approach in their model and 3 RELATED WORK uses Gated Recurrent Units (GRU) as RNN components. The encoder The first works about commit message generation were published reads the source sequence sequentially and produces a hidden state, independently at the same time by Loyola et al. [18] and Jiang et al. denoted as the context vector. Decoding of a token can now be done [12]. Both approaches feature a similar attentional RNN encoder- based on the previously hidden state of the decoder, the previous decoder architecture. predicted token, and the generated context vector. The intuition Loyola et al. [18] use a vanilla encoder-decoder architecture, of this architecture is that it reduces information compression as similar to the architecture Iyer et al. [11] used for code summariza- each decoder step has access to the whole source sequence. The tion. The encoder network is simply a lookup table for the input decoder hidden states now only need to retrain information that token embedding. The decoder network is a RNN with dropout- was previously predicted. regularized long short-term memory (LSTM) cells. Dropout is also Still, the performance of this process suffers when input sen- used at the encoder layer and reduces the risk of overfitting on the tences start to increase and the information can not be compressed training data. A global attention model is used to help the decoder into hidden states [7]. Bahdanau et al. [4] therefore extended the focus on the most important parts of the diffs. encoder decoder model such that it learns to align and translate Jiang et al. [12] propose a more intricate architecture, where the jointly with the help of attention. At each time decoding step, the encoder network is also a RNN. This way, the token embedding model searches for a set of position in the source sentence where can be trained for better model performance. The authors do not the most relevant information is concentrated. The context vectors implement the network themselves, but instead use Nematus, a corresponding to these positions and the previous generated pre- specialized toolkit for neural machine translation [23]. Besides dicted tokens are then used for prediction of the target sequence. It using dropout in all layers, Nematus also uses the computationally is also possible to compute attention in different ways as shown by more efficient GRU cells instead of LSTM cells. Luong et al. [19]. Liu et al. [17] investigate the model and results of [12] and found that memorization is the largest contributor to their good results. 2.2 Evaluation Metrics For almost all correctly generated commit messages, a very similar BLEU [21] is the most frequently used similarity metric to evaluate commits was found in the training set. By removing the noisy com- the quality of machine translations. BLEU measures how many mits, the model performance drops by 55%. To illustrate the short- word sequences from the reference text occur in the generated comings, Liu et al. [17] propose NNGen, a naive nearest-neighbor text and uses a (slightly modified) n-gram precision to generate a based approach that re-uses commit messages from similar diffs.
Recommended publications
  • Version Control 101 Exported from Please Visit the Link for the Latest Version and the Best Typesetting
    Version Control 101 Exported from http://cepsltb4.curent.utk.edu/wiki/efficiency/vcs, please visit the link for the latest version and the best typesetting. Version Control 101 is created in the hope to minimize the regret from lost files or untracked changes. There are two things I regret. I should have learned Python instead of MATLAB, and I should have learned version control earlier. Version control is like a time machine. It allows you to go back in time and find out history files. You might have heard of GitHub and Git and probably how steep the learning curve is. Version control is not just Git. Dropbox can do version control as well, for a limited time. This tutorial will get you started with some version control concepts from Dropbox to Git for your needs. More importantly, some general rules are suggested to minimize the chance of file losses. Contents Version Control 101 .............................................................................................................................. 1 General Rules ................................................................................................................................... 2 Version Control for Files ................................................................................................................... 2 DropBox or Google Drive ............................................................................................................. 2 Version Control on Confluence ...................................................................................................
    [Show full text]
  • Introduction to Version Control with Git
    Warwick Research Software Engineering Introduction to Version Control with Git H. Ratcliffe and C.S. Brady Senior Research Software Engineers \The Angry Penguin", used under creative commons licence from Swantje Hess and Jannis Pohlmann. March 12, 2018 Contents 1 About these Notes1 2 Introduction to Version Control2 3 Basic Version Control with Git4 4 Releases and Versioning 11 Glossary 14 1 About these Notes These notes were written by H Ratcliffe and C S Brady, both Senior Research Software Engineers in the Scientific Computing Research Technology Platform at the University of Warwick for a series of Workshops first run in December 2017 at the University of Warwick. This document contains notes for a half-day session on version control, an essential part of the life of a software developer. This work, except where otherwise noted, is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International Li- cense. To view a copy of this license, visit http://creativecommons.org/ licenses/by-nc-nd/4.0/. The notes may redistributed freely with attribution, but may not be used for commercial purposes nor altered or modified. The Angry Penguin and other reproduced material, is clearly marked in the text and is not included in this declaration. The notes were typeset in LATEXby H Ratcliffe. Errors can be reported to [email protected] 1.1 Other Useful Information Throughout these notes, we present snippets of code and pseudocode, in particular snippets of commands for shell, make, or git. These often contain parts which you should substitute with the relevant text you want to use.
    [Show full text]
  • Sistemas De Control De Versiones De Última Generación (DCA)
    Tema 10 - Sistemas de Control de Versiones de última generación (DCA) Antonio-M. Corbí Bellot Tema 10 - Sistemas de Control de Versiones de última generación (DCA) II HISTORIAL DE REVISIONES NÚMERO FECHA MODIFICACIONES NOMBRE Tema 10 - Sistemas de Control de Versiones de última generación (DCA) III Índice 1. ¿Qué es un Sistema de Control de Versiones (SCV)?1 2. ¿En qué consiste el control de versiones?1 3. Conceptos generales de los SCV (I) 1 4. Conceptos generales de los SCV (II) 2 5. Tipos de SCV. 2 6. Centralizados vs. Distribuidos en 90sg 2 7. ¿Qué opciones tenemos disponibles? 2 8. ¿Qué podemos hacer con un SCV? 3 9. Tipos de ramas 3 10. Formas de integrar una rama en otra (I)3 11. Formas de integrar una rama en otra (II)4 12. SCV’s con los que trabajaremos 4 13. Git (I) 5 14. Git (II) 5 15. Git (III) 5 16. Git (IV) 6 17. Git (V) 6 18. Git (VI) 7 19. Git (VII) 7 20. Git (VIII) 7 21. Git (IX) 8 22. Git (X) 8 23. Git (XI) 9 Tema 10 - Sistemas de Control de Versiones de última generación (DCA) IV 24. Git (XII) 9 25. Git (XIII) 9 26. Git (XIV) 10 27. Git (XV) 10 28. Git (XVI) 11 29. Git (XVII) 11 30. Git (XVIII) 12 31. Git (XIX) 12 32. Git. Vídeos relacionados 12 33. Mercurial (I) 12 34. Mercurial (II) 12 35. Mercurial (III) 13 36. Mercurial (IV) 13 37. Mercurial (V) 13 38. Mercurial (VI) 14 39.
    [Show full text]
  • Higher Inductive Types (Hits) Are a New Type Former!
    Git as a HIT Dan Licata Wesleyan University 1 1 Darcs Git as a HIT Dan Licata Wesleyan University 1 1 HITs 2 Generator for 2 equality of equality HITs Homotopy Type Theory is an extension of Agda/Coq based on connections with homotopy theory [Hofmann&Streicher,Awodey&Warren,Voevodsky,Lumsdaine,Garner&van den Berg] 2 Generator for 2 equality of equality HITs Homotopy Type Theory is an extension of Agda/Coq based on connections with homotopy theory [Hofmann&Streicher,Awodey&Warren,Voevodsky,Lumsdaine,Garner&van den Berg] Higher inductive types (HITs) are a new type former! 2 Generator for 2 equality of equality HITs Homotopy Type Theory is an extension of Agda/Coq based on connections with homotopy theory [Hofmann&Streicher,Awodey&Warren,Voevodsky,Lumsdaine,Garner&van den Berg] Higher inductive types (HITs) are a new type former! They were originally invented[Lumsdaine,Shulman,…] to model basic spaces (circle, spheres, the torus, …) and constructions in homotopy theory 2 Generator for 2 equality of equality HITs Homotopy Type Theory is an extension of Agda/Coq based on connections with homotopy theory [Hofmann&Streicher,Awodey&Warren,Voevodsky,Lumsdaine,Garner&van den Berg] Higher inductive types (HITs) are a new type former! They were originally invented[Lumsdaine,Shulman,…] to model basic spaces (circle, spheres, the torus, …) and constructions in homotopy theory But they have many other applications, including some programming ones! 2 Generator for 2 equality of equality Patches Patch a a 2c2 diff b d = < b c c --- > d 3 3 id a a b b
    [Show full text]
  • Homework 0: Account Setup for Course and Cloud FPGA Intro Questions
    Cloud FPGA Homework 0 Fall 2019 Homework 0 Jakub Szefer 2019/10/20 Please follow the three setup sections to create BitBucket git repository, install LATEX tools or setup Overleaf account, and get access to the course's git repository. Once you have these done, answer the questions that follow. Submit your solutions as a single PDF file generated from a template; more information is at end in the Submission Instructions section. Setup BitBucket git Repository This course will use git repositories for code development. Each student should setup a free BitBucket (https://bitbucket.org) account and create a git repository for the course. Please make the repository private and give WRITE access to your instructor ([email protected]). Please send the URL address of the repository to the instructor by e-mail. Make sure there is a README:md file in the repository (access to the repository will be tested by a script that tries to download the README:md from the repository address you share). Also, if you are using a Apple computer, please add :gitignore file which contains one line: :DS Store (to prevent the hidden :DS Store files from accidentally being added to the repository). If you have problems accessing BitBucket git from the command line, please see the Appendix. Setup LATEX and Overleaf Any written work (including this homework's solutions) will be submitted as PDF files generated using LATEX [1] from provided templates. Students can setup a free Overleaf (https://www. overleaf.com) account to edit LATEX files and generate PDFs online; or students can install LATEX tools on their computer.
    [Show full text]
  • Brno University of Technology Bulk Operation
    BRNO UNIVERSITY OF TECHNOLOGY VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ FACULTY OF INFORMATION TECHNOLOGY FAKULTA INFORMAČNÍCH TECHNOLOGIÍ DEPARTMENT OF INFORMATION SYSTEMS ÚSTAV INFORMAČNÍCH SYSTÉMŮ BULK OPERATION ORCHESTRATION IN MULTIREPO CI/CD ENVIRONMENTS HROMADNÁ ORCHESTRÁCIA V MULTIREPO CI/CD PROSTREDIACH MASTER’S THESIS DIPLOMOVÁ PRÁCE AUTHOR Bc. JAKUB VÍŠEK AUTOR PRÁCE SUPERVISOR Ing. MICHAL KOUTENSKÝ VEDOUCÍ PRÁCE BRNO 2021 Brno University of Technology Faculty of Information Technology Department of Information Systems (DIFS) Academic year 2020/2021 Master's Thesis Specification Student: Víšek Jakub, Bc. Programme: Information Technology and Artificial Intelligence Specializatio Computer Networks n: Title: Bulk Operation Orchestration in Multirepo CI/CD Environments Category: Networking Assignment: 1. Familiarize yourself with the principle of CI/CD and existing solutions. 2. Familiarize yourself with the multirepo approach to software development. 3. Analyze the shortcomings of existing CI/CD solutions in the context of multirepo development with regard to user comfort. Focus on scheduling and deploying bulk operations on multiple interdependent repositories as part of a single logical branching pipeline. 4. Propose and design a solution to these shortcomings. 5. Implement said solution. 6. Test and evaluate the solution's functionality in a production environment. Recommended literature: Humble, Jez, and David Farley. Continuous delivery. Upper Saddle River, NJ: Addison- Wesley, 2011. Forsgren, Nicole, Jez Humble, and Gene Kim. Accelerate : building and scaling high performing technology organizations. Portland, OR: IT Revolution Press, 2018. Nicolas Brousse. 2019. The issue of monorepo and polyrepo in large enterprises. In Proceedings of the Conference Companion of the 3rd International Conference on Art, Science, and Engineering of Programming (Programming '19). Association for Computing Machinery, New York, NY, USA, Article 2, 1-4.
    [Show full text]
  • Version Control – Agile Workflow with Git/Github
    Version Control – Agile Workflow with Git/GitHub 19/20 November 2019 | Guido Trensch (JSC, SimLab Neuroscience) Content Motivation Version Control Systems (VCS) Understanding Git GitHub (Agile Workflow) References Forschungszentrum Jülich, JSC:SimLab Neuroscience 2 Content Motivation Version Control Systems (VCS) Understanding Git GitHub (Agile Workflow) References Forschungszentrum Jülich, JSC:SimLab Neuroscience 3 Motivation • Version control is one aspect of configuration management (CM). The main CM processes are concerned with: • System building • Preparing software for releases and keeping track of system versions. • Change management • Keeping track of requests for changes, working out the costs and impact. • Release management • Preparing software for releases and keeping track of system versions. • Version control • Keep track of different versions of software components and allow independent development. [Ian Sommerville,“Software Engineering”] Forschungszentrum Jülich, JSC:SimLab Neuroscience 4 Motivation • Keep track of different versions of software components • Identify, store, organize and control revisions and access to it • Essential for the organization of multi-developer projects is independent development • Ensure that changes made by different developers do not interfere with each other • Provide strategies to solve conflicts CONFLICT Alice Bob Forschungszentrum Jülich, JSC:SimLab Neuroscience 5 Content Motivation Version Control Systems (VCS) Understanding Git GitHub (Agile Workflow) References Forschungszentrum Jülich,
    [Show full text]
  • Making the Most of Git and Github
    Contributing to Erlang Making the Most of Git and GitHub Tom Preston-Werner Cofounder/CTO GitHub @mojombo Quick Git Overview Git is distributed Tom PJ Chris Git is snapshot-based The Codebase 1 Snapshots have zero or more parents 1 2 Branching in Git is easy 1 2 3 4 Merging in Git is easy too 1 2 3 5 4 A branch is just a pointer to a snapshot master 1 2 3 5 4 Branches move as new snapshots are taken master 1 2 3 5 6 4 Tags are like branches that never move master 1 2 3 5 6 4 v1.0.0 Contributing to Erlang Fork, Clone, and Configure Install and Configure Git git config --global user.name "Tom Preston-Werner" git config --global user.email [email protected] Sign up on GitHub Fork github.com/erlang/otp Copy your clone URL Clone the repo locally git clone [email protected]:mojombo/otp.git replace with your username Verify the clone worked $ cd otp $ ls AUTHORS bootstrap EPLICENCE configure.in INSTALL-CROSS.md erl-build-tool-vars.sh INSTALL-WIN32.md erts INSTALL.md lib Makefile.in make View the history $ git log Add a remote for the upstream (erlang/otp) $ git remote add upstream \ git://github.com/erlang/otp.git Repositories GitHub GitHub erlang/otp mojombo/otp upstream origin Local otp Create a branch List all branches $ git branch * dev Create a branch off of “dev” and switch to it $ git checkout -b mybranch Both branches now point to the same commit dev mybranch Make Changes Each commit should: Contain a single logical change Compile cleanly Not contain any cruft Have a good commit message Review your changes $ git status $ git diff Commit your
    [Show full text]
  • Colors in Bitbucket Pull Request
    Colors In Bitbucket Pull Request Ligulate Bay blueprints his hays craving gloomily. Drearier and anaglyphic Nero license almost windingly, though Constantinos divulgating his complaints limits. Anglophilic and compartmentalized Lamar exemplified her clippings eternalised plainly or caping valorously, is Kristopher geoidal? Specifically I needed to axe at route eager to pull them a tenant ID required to hustle up. The Blue Ocean UI has a navigation bar possess the toll of its interface, Azure Repos searches the designated folders in reading order confirm, but raise some differences. Additionally for GitHub pull requests this tooltip will show assignees labels reviewers and build status. While false disables it a pull. Be objective to smell a stride, and other cases can have? Configuring project version control settings. When pulling or. This pull list is being automatically deployed with Vercel. Best practice rules to bitbucket pull harness review coverage is a vulnerability. By bitbucket request in many files in revision list. Generally speaking I rebase at lest once for every pull request I slide on GitHub It today become wildly. Disconnected from pull request commits, color coding process a remote operations. The color tags option requires all tags support. Give teams bitbucket icon now displays files from the pull request sidebar, colors in bitbucket pull request, we consider including a repo authentication failures and. Is their question about Bitbucket Cloud? Bitbucket open pull requests Bitbucket open pull requests badge bitbucketpr-rawuserrepo Bitbucket Server open pull requests Bitbucket Server open pull. Wait awhile the browser to finish rendering before scrolling. Adds syntax highlight for pull requests Double click fabric a broad to deny all occurrences.
    [Show full text]
  • Create a Pull Request in Bitbucket
    Create A Pull Request In Bitbucket Waverley is unprofitably bombastic after longsome Joshuah swings his bentwood bounteously. Despiteous Hartwell fathomsbroaches forcibly. his advancements institutionalized growlingly. Barmiest Heywood scandalize some dulocracy after tacit Peyter From an effect is your own pull remote repo bitbucket create the event handler, the bitbucket opens the destination branch for a request, if i am facing is Let your pet see their branches, commit messages, and pull requests in context with their Jira issues. You listen also should the Commits tab at the top gave a skill request please see which commits are included, which provide helpful for reviewing big pull requests. Keep every team account to scramble with things, like tablet that pull then got approved, when the build finished, and negotiate more. Learn the basics of submitting a on request, merging, and more. Now we made ready just send me pull time from our seven branch. Awesome bitbucket cloud servers are some nifty solutions when pull request a pull. However, that story ids will show in the grasp on all specified stories. Workzone can move the trust request automatically when appropriate or a percentage of reviewers have approved andor on successful build results. To cost up the webhook and other integration parameters, you need two set although some options in Collaborator and in Bitbucket. Go ahead but add a quote into your choosing. If you delete your fork do you make a saw, the receiver can still decline your request ask the repository to pull back is gone. Many teams use Jira as the final source to truth of project management.
    [Show full text]
  • Scaling Git with Bitbucket Data Center
    Scaling Git with Bitbucket Data Center Considerations for large teams switching to Git Contents What is Git, why do I want it, and why is it hard to scale? 01 Scaling Git with Bitbucket Data Center 05 What about compliance? 11 Why choose Bitbucket Data Center? 13 01 What is Git, why do I want it, and why is it hard to scale? So. Your software team is expanding and taking on more high-value projects. That’s great news! The bad news, however, is that your centralized version control system isn’t really cutting it anymore. For growing IT organizations, Some of the key benefits Codebase safety moving to a distributed version control system is now of adopting Git are: Git is designed with maintaining the integrity considered an inevitable shift. This paper outlines some of managed source code as a top priority, using secure algorithms to preserve your code, change of the benefits of Git as a distributed version control system history, and traceability against both accidental and how Bitbucket Data Center can help your company scale and malicious change. Distributed development its Git-powered operations smoothly. Community Distributed development gives each developer a working copy of the full repository history, Git has become the expected version control making development faster by speeding up systems in many circles, and is very popular As software development increases in complexity, and the commit process and reducing developers’ among open source projects. This means its easy development teams become more globalized, centralized interdependence, as well as their dependence to take advantage of third party libraries and on a network connection.
    [Show full text]
  • Changeset-Based Topic Modeling of Software Repositories
    IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 1, NO. 1, MONTH YEAR 1 Changeset-Based Topic Modeling of Software Repositories Christopher S. Corley, Kostadin Damevski, Nicholas A. Kraft Abstract—The standard approach to applying text retrieval models to code repositories is to train models on documents representing program elements. However, code changes lead to model obsolescence and to the need to retrain the model from the latest snapshot. To address this, we previously introduced an approach that trains a model on documents representing changesets from a repository and demonstrated its feasibility for feature location. In this paper, we expand our work by investigating: a second task (developer identification), the effects of including different changeset parts in the model, the repository characteristics that affect the accuracy of our approach, and the effects of the time invariance assumption on evaluation results. Our results demonstrate that our approach is as accurate as the standard approach for projects with most changes localized to a subset of the code, but less accurate when changes are highly distributed throughout the code. Moreover, our results demonstrate that context and messages are key to the accuracy of changeset-based models and that the time invariance assumption has a statistically significant effect on evaluation results, providing overly-optimistic results. Our findings indicate that our approach is a suitable alternative to the standard approach, providing comparable accuracy while eliminating retraining costs. Index Terms—changesets; feature location; developer identification; program comprehension; mining software repositories; online topic modeling F 1 INTRODUCTION Online topic models, such as online LDA [7], natively sup- Researchers have identified numerous applications for text port the online addition of new documents, but they still retrieval (TR) models in facilitating software maintenance cannot accommodate modifications to existing documents.
    [Show full text]