Generating Commit Messages from Git Diffs
Total Page:16
File Type:pdf, Size:1020Kb
Generating Commit Messages from Git Diffs Sven van Hal Mathieu Post Kasper Wendel Delft University of Technology Delft University of Technology Delft University of Technology [email protected] [email protected] [email protected] ABSTRACT be exploited by machine learning. The hypothesis is that methods Commit messages aid developers in their understanding of a con- based on machine learning, given enough training data, are able tinuously evolving codebase. However, developers not always doc- to extract more contextual information and latent factors about ument code changes properly. Automatically generating commit the why of a change. Furthermore, Allamanis et al. [1] state that messages would relieve this burden on developers. source code is “a form of human communication [and] has similar Recently, a number of different works have demonstrated the statistical properties to natural language corpora”. Following the feasibility of using methods from neural machine translation to success of (deep) machine learning in the field of natural language generate commit messages. This work aims to reproduce a promi- processing, neural networks seem promising for automated commit nent research paper in this field, as well as attempt to improve upon message generation as well. their results by proposing a novel preprocessing technique. Jiang et al. [12] have demonstrated that generating commit mes- A reproduction of the reference neural machine translation sages with neural networks is feasible. This work aims to reproduce model was able to achieve slightly better results on the same dataset. the results from [12] on the same and a different dataset. Addition- When applying more rigorous preprocessing, however, the per- ally, efforts are made to improve upon these results by applying a formance dropped significantly. This demonstrates the inherent different data processing technique. More specific, the following shortcoming of current commit message generation models, which research questions will be answered: perform well by memorizing certain constructs. RQ1: Can the results from Jiang et al. [12] be reproduced? Future research directions might include improving diff embed- RQ2: Does a more rigorous dataset processing technique im- dings and focusing on specific groups of commits. prove the results of the neural model? This paper is structured as follows. In Section 2, background KEYWORDS information about deep neural networks and neural machine trans- Commit Message Generation, Software Engineering, Sequence-to- lation is covered. In Section 3, the state-of-the-art in commit mes- Sequence, Neural Machine Translation sage generation from source code changes is reviewed. Section 4 describes the implementation of the neural model. Section 5 covers 1 INTRODUCTION preprocessing techniques and analyzes the resulting dataset char- acteristics. Section 6 presents the evaluation results and Section 7 Software development is a continuous process: developers incre- discusses the performance and limitations. Section 8 summarizes mentally add, change or remove code and store software revisions the findings and points to promising future research directions. in a Version Control System (VCS). Each changeset (or: diff ) is provided with a commit message, which is a short, human-readable 2 BACKGROUND summary of the what and why of the change [6]. Commit messages document the development process and aid 2.1 Neural Machine Translation developers in their understanding of the state of the software and A recent development in deep learning is sequence-to-sequence its evolution over time. However, developers do not always prop- learning for Neural Machine Translation [7, 25]. Translation can erly document code changes or write commit messages of low be seen as a probabilistic process, where the goal is to find a target quality [9]. This applies to code documentation in general and sentence y = ¹y1; :::;ynº from a source sequence x = ¹x1; :::; xmº negatively impacts developer performance [22]. Automatically gen- that maximizes the conditional probability of y given the source arXiv:1911.11690v1 [cs.SE] 26 Nov 2019 erating high-quality commit messages from diffs would relieve the sentence x, mathematically depicted as arg maxy p¹yjxº [4]. But this burden of writing commit messages off developers, and improve conditional distribution is of course not given and has to be learned the documentation of the codebase. by a model from the supplied data. Sutskever et al. [25] and Cho As demonstrated by Buse and Weimer [6], Cortes-Coy et al. [8] et al. [7] both have proposed a structure to learn this distribution: and Shen et al. [24], predefined rules and text templates can be used a model that consists of a encoder and decoder component that are to generate commit messages that are largely preferred to developer- trained simultaneously. The encoder component tries to encode written messages in terms of descriptiveness. However, the main the variable length input sequence x to a fixed length hidden state drawbacks of these methods are that 1) only the what of a change vector. This hidden state is then supplied to the decoder component is described, 2) the generated messages are too comprehensive to that tries to decode it into the variable length target sequence y. replace short commit messages, and 3) these methods do neither A Recurrent Neural Network (RNN) is used in this to sequentially scale nor generalize on unseen constructions, because of the use of read the variable length sequence and produces a fixed size vector. hand-crafted templates and rules. Over the years, different architectural changes of encoder and According to a recent study by Jiang and McMillan [13], code decoder components were proposed. Sutskever et al. [25] introduces changes and commit messages exhibit distinct patterns that can multi-layered RNN with Long-Short-Term memory units (LSTM), where both the encoder and decoder consist of multiple layers (4 number of issues with BLEU, such as the fact that sentences have to in their research). Each layer in the encoder produces a fixed size be identical to get the highest score and that a higher BLEU score hidden state that is passed onto the corresponding layer in the not always equals a better translation. The metric is computed decoder, where the results are combined into a target sequence using “a combination of unigram-precision, unigram-recall, and a prediction. One unexplainable factor noted by the authors of this measure of fragmentation that is designed to directly capture how architecture is that it produces better results if the source sequence well-ordered the matched words in the machine translation are in is reversed. Note that in each step during decoding, the LSTM only relation to the reference” [5]. has access to the hidden state from the previous timestep, and the previous predicted token. Cho et al. [7] uses a slightly different approach in their model and 3 RELATED WORK uses Gated Recurrent Units (GRU) as RNN components. The encoder The first works about commit message generation were published reads the source sequence sequentially and produces a hidden state, independently at the same time by Loyola et al. [18] and Jiang et al. denoted as the context vector. Decoding of a token can now be done [12]. Both approaches feature a similar attentional RNN encoder- based on the previously hidden state of the decoder, the previous decoder architecture. predicted token, and the generated context vector. The intuition Loyola et al. [18] use a vanilla encoder-decoder architecture, of this architecture is that it reduces information compression as similar to the architecture Iyer et al. [11] used for code summariza- each decoder step has access to the whole source sequence. The tion. The encoder network is simply a lookup table for the input decoder hidden states now only need to retrain information that token embedding. The decoder network is a RNN with dropout- was previously predicted. regularized long short-term memory (LSTM) cells. Dropout is also Still, the performance of this process suffers when input sen- used at the encoder layer and reduces the risk of overfitting on the tences start to increase and the information can not be compressed training data. A global attention model is used to help the decoder into hidden states [7]. Bahdanau et al. [4] therefore extended the focus on the most important parts of the diffs. encoder decoder model such that it learns to align and translate Jiang et al. [12] propose a more intricate architecture, where the jointly with the help of attention. At each time decoding step, the encoder network is also a RNN. This way, the token embedding model searches for a set of position in the source sentence where can be trained for better model performance. The authors do not the most relevant information is concentrated. The context vectors implement the network themselves, but instead use Nematus, a corresponding to these positions and the previous generated pre- specialized toolkit for neural machine translation [23]. Besides dicted tokens are then used for prediction of the target sequence. It using dropout in all layers, Nematus also uses the computationally is also possible to compute attention in different ways as shown by more efficient GRU cells instead of LSTM cells. Luong et al. [19]. Liu et al. [17] investigate the model and results of [12] and found that memorization is the largest contributor to their good results. 2.2 Evaluation Metrics For almost all correctly generated commit messages, a very similar BLEU [21] is the most frequently used similarity metric to evaluate commits was found in the training set. By removing the noisy com- the quality of machine translations. BLEU measures how many mits, the model performance drops by 55%. To illustrate the short- word sequences from the reference text occur in the generated comings, Liu et al. [17] propose NNGen, a naive nearest-neighbor text and uses a (slightly modified) n-gram precision to generate a based approach that re-uses commit messages from similar diffs.