Improving Code Autocompletion with Transfer Learning
Total Page:16
File Type:pdf, Size:1020Kb
Improving Code Autocompletion with Transfer Learning Wen Zhou Seohyun Kim Vijayaraghavan Murali Gareth Ari Aye Facebook Inc. Facebook Inc. Facebook Inc. Facebook Inc. Menlo Park, U.S.A. Menlo Park, U.S.A. Menlo Park, U.S.A. Menlo Park, U.S.A. [email protected] [email protected] [email protected] [email protected] Abstract—Software language models have achieved promising results predicting code completion usages, and several industry studies have described successful IDE integrations. Recently, accuracy in autocompletion prediction improved 12.8% [1] from training on a real-world dataset collected from programmers’ IDE activity. But what if limited examples of IDE autocompletion in the target programming language are available for model training? In this paper, we investigate the efficacy of pretraining autocompletion models on non-IDE, non-autocompletion, and different-language example code sequences. We find that these unsupervised pretrainings improve model accuracy by over 50% on very small fine-tuning datasets and over 10% on 50k labeled examples. We confirm the real-world impact of these pretrainings in an online setting through A/B testing on thousands of IDE autocompletion users, finding that pretraining is responsible for increases of up to 6.63% autocompletion usage. Index Terms—Machine learning, neural networks, software language models, naturalness, code completion, integrated de- velopment environments, software tools I. INTRODUCTION Fig. 1: Example of autocomplete in an IDE. Autocompletion is the most frequently used IDE feature [2]. Significant attention has been given to improving suggestion prediction through machine learning [3]–[6] by feeding code to models as a sequence of tokens or even AST nodes [7]. Even with a large user population, the tool’s vendor may Figure 1 shows an example of autocomplete powered by deep not be able to log autocompletion events to build a training learning in an IDE. Several recent studies [1] and [8] have dataset. But perhaps the vendor is able to collect random, non- demonstrated the strengths of real-world and weaknesses of autocompletion code sequences during code authoring. These synthetic datasets in training and evaluating autocompletion would suffice to train a task-agnostic language model (LM) models. Concept drift between real-world and synthetic ex- for autocompletion prediction, but under the assumption that amples can be ameliorated by only showing models real- tokens used in autocompletion follow the same distribution world autocompletion selections. But what if autocompletion as tokens in arbitrary code sequences. One flaw in this as- arXiv:2105.05991v1 [cs.SE] 12 May 2021 examples from IDE usage in the target language are hard to sumption is that modern IDE autocompletion tools restrict the come by? token types which can be suggested, commonly disallowing Consider the various stages of code that could be used punctuation, literals, and variable declarations. Furthermore, for autocompletion training data. Figure 2 shows the different several recent studies [1] [8] have shown a myriad of dif- code datasets. First, we have the code in IDE- these are snap- ferences between random tokens sampled from source code shots of code that is currently being written in an IDE. From and those used in autocompletion. Can knowledge from these here, there are two actions that provide two more datasets. unlabeled code sequences transfer to the autocompletion task? First is after a developer uploads the code to code review and On the other hand, a huge corpus of open-source code is accepted. This committed code in central version control commits written by millions of developers and spanning provides a more permanent, mature snapshot of the code. The all major programming languages is available on GitHub. other is when a developer accepts autocompletion suggestions There is an intuitive relationship between code commits and from the IDE. The selection, along with the surrounding developers’ IDE activity since a commit is the first code context, is logged. This provides us with a supervised dataset artifact that a developer will produce after code authoring. for autocompletion. Can knowledge from these commits transfer to modeling code RQ1: How do autocompletion models benefit from combining unsupervised pretraining with task-specific fine-tuning? How does their performance improve across offline and online evaluation? Our experimental results show that pretraining on code sequences collected during code authoring and fine-tuning on tokens selected through autocompletion produces models which outperform training on only one of these two datasets. We show further that such a model drives greater online tool usage. RQ2: What is the impact of pretraining on a large source code dataset obtained from outside of code authoring? Can these pretrained software language models be fine-tuned on Fig. 2: Different stages of code that could be used for autocompletion training data. IDE dataset consists of snapshots of source code files collected during IDE autocompletion to achieve better accuracy with fewer code authoring. Once a developer submits the code to code review, it becomes real-world examples? part of Facebook’s version control data. From the IDE, a developer can also choose an autocompletion suggestion, which would be logged (along with the We show that starting from a model pretrained on files surrounding context code). appearing in version control commits decreases the number of real-world autocompletion examples required to achieve high accuracy. Additionally, our experiments suggest that there is authoring behaviors? diminishing marginal benefit to pretraining as the number of real-world examples grows. Another potential cause for insufficient labeled data is a non-uniform programming language usage distribution. While RQ3: Consider the case where a large training corpus is there are undeniably shared concepts and constructs across available in one language but not another. Can pretraining programming languages, there are often major differences in a multilingual model on the language with more training syntax, programming style, and language constructs. Addition- data benefit the language with less data? ally, language-specific differences in IDE tools may impact programmers’ autocompletion behaviors. At Facebook, we To answer this question, we pretrain a multilingual have a large developer activity dataset in the Hack program- model on examples from one language before fine-tuning ming language containing millions of real-world autocomple- on examples from another. Our results show that fewer tion examples. But there is also a long tail of less frequently target-language examples are required to achieve high used languages such as OCaml. These languages lack enough accuracy after pretraining on different-language examples. labeled data for robust model training, but they are still used We again observe diminishing marginal benefit to pretraining by a sufficient developer population to warrant autocompletion on different-language examples as the number of available support. Can knowledge from the more popular languages target-language examples grows. transfer to others where labeled data is insufficient? Contributions In this paper, we explore these questions regarding transfer learning for autocompletion empirically. Advances in Trans- 1) We pretrain two transformer software language models former [9]–[13]-based neural models have popularized trans- GPT-2 [9] and BART [11] on source code files obtained fer learning in the deep learning community. Often transfer from version control commits and show how their per- learning consists of “pre-training” such a model on large formance on autocompletion prediction improves through corpora of unlabeled data in an unsupervised manner and fine-tuning on real-world IDE code sequences by 2.18%. then “fine-tuning” on a (possibly smaller) corpus of labeled 2) The GPT-2 model is trained on two real-world datasets: data. The latter fine-tuning corpus is typically drawn from code sequences logged during IDE authoring and auto- the same distribution as the test dataset. In our study, the completion selections. A third variant is pretrained on the evaluation task is to predict IDE autocompletion selections former and fine-tuned on the latter corpus to demonstrate made by programmers in a given language, and our fine- how the combination of pretraining and task-specific fine- tuning datasets consist of real-world autocompletion events tuning lead to a superior model, outperforming the base collected from IDE usage at Facebook. We explore the effect model by 3.29%. of transfer learning by pretraining models on non-IDE, non- 3) We show that pretraining on a different programming autocompletion, and different programming language code langauge boosts accuracy by 13.1% when comparing a sequences. Specifically, we answer the following research model pretrained on Hack examples and fine-tuned on questions: 10k Python examples versus only training on Python examples. 4) We prove that improvements across these three trans- TABLE I: Various datasets used in this paper. fer learning dimensions—task, domain, and language Python Hack —translate into increased autocompletion tool usage by # tokens # samples # tokens # samples 3.86%, 6.63%, 4.64%, respectively, comparing these Autocompletion 543,401,684 3,201,299 642,886,605 3,792,485 models through online A/B tests. IDE 540,200,385 3,201,299 639,094,120 3,792,485 Commit - - 629,192,335 3,452,434 Outline The rest of this paper is organized to first introduce our experimental setup in Section II. In this section we describe B. Tokenization the corpora, language models, and evaluation methodologies One difficulty in modeling source code as compared to employed in our study. Section III reports experimental results, natural language is that code introduces new vocabulary at a supporting the answers to our research questions. Section IV, far higher rate [15]. Recognizing and predicting rare and novel Section V, and Section VI discuss threats to validity, related tokens from an open vocabulary poses a challenge when our work, and future work, respectively. Finally, Section VII models are trained to recognize a fixed set of terms.