Sequence Model Design for Code Completion in the Modern IDE
Total Page:16
File Type:pdf, Size:1020Kb
Sequence Model Design for Code Completion in the Modern IDE Gareth Ari Aye Gail E. Kaiser Google Inc., Columbia University Columbia University [email protected] [email protected] ABSTRACT 1 INTRODUCTION Code completion plays a prominent role in modern integrated de- Code completion is a tremendously popular tool for coding assis- velopment environments (IDEs). Machine learning has become tance, implemented across a wide range of programming languages ubiquitous in analogous natural language writing and search so- and environments. In An Empirical Investigation of Code Comple- ware, surfacing more relevant autocompletions and search sug- tion Usage by Professional Soware Developers, Marasoiu et al. map gestions in fewer keystrokes. Prior research has reported training out the diversity of use cases it fullls for programmers, including high-accuracy, deep neural networks for modeling source code, but correctness checking, typing assistance, and API search [24]. A lile aention has been given to the practical constraints imposed study of programmers’ behaviors within the Eclipse IDE found by interactive developer tools. that autocomplete was used up to several times per minute [28], In particular, neural language models for source code modeling as oen as copy-paste! Historically, completion suggestions have like the one described in Maybe Deep Neural Networks are the Best been based primarily on static analysis and, as a result, suered Choice for Modeling Source Code[20] are framed around code comple- from low relevance [9]. Applying the constraints imposed by a tion, but only report accuracy of next-token prediction. However, programming language’s grammar and type system produces all in order for a language model (LM) to work well within real-world valid suggestions but says nothing about which are likely. code completion systems, it must also • always make suggestions that produce valid code that type- 1.1 Language modeling checks, to support code completion’s role in correctness- In order to provide more relevant autocomplete results, researchers checking, have looked to exploit the naturalness of source code through lan- • return instantaneous results to help programmers code guage modeling [14, 30]. Given a token sequence S = t1t2 ··· tn, a more eciently in fewer keystrokes, and language model (LM) estimates the probability p¹Sº as a product of • be small enough to t comfortably on disk and in memory conditional probabilities on developer workstations, since virtually all modern IDEs În p¹Sº = p¹ti jt1; t2; :::; ti−1º run locally and support oine usage. i=1 State-of-art LMs for natural language are typically based on To meet these additional requirements, we propose a novel de- recurrent neural networks (RNNs) and have shown remarkable sign for predicting top-k next tokens that combines static analysis’ prediction accuracy in tasks ranging from machine translation to ability to enumerate all valid keywords and in-scope identiers with speech recognition [25]. Figure 3 depicts the basic RNN congu- the ability of a language model to place a probability distribution ration: an input sequence x1; x2; :::; xn that the network learns to over them. Our model mixes character-level input representation encode into hidden states h1;h2; :::;hn and decode as output. Each with token output to represent out-of-vocabulary (OOV) tokens hidden state is computed as a combination of the previous hidden meaningfully and minimize prediction latency. OOV tokens can be state and the next input sequence value. predicted through detection of local repetition common in soware. In source code modeling, the input sequence x1; x2; :::; xn con- is design achieves state-of-art accuracy in source code modeling sists of vectors representing the previous n tokens, abstract syntax and ts the constraints imposed by real-world code completion tree (AST) nodes, characters, or even partial tokens. Commonly implementations in modern IDEs. these inputs are represented by their integer index into an input vocabulary V and the network will learn a dimensionality reduc- jV j m arXiv:2004.05249v1 [cs.SE] 10 Apr 2020 CCS CONCEPTS tion f ¹xº : IR ! IR for m << jV j through an embedding layer. •So ware and its engineering ! So ware maintenance tools; Whereas one-hot encoded vectors x; x’ 2 R jV j have x ? x’, the more compact vector space Rm can capture semantic relationships KEYWORDS among the transformed input vectors. Machine learning, neural networks, soware language models, nat- For the classication problem of selecting a high probability uralness, code completion, integrated development environments, next word x from a vocabulary V , the RNN is oen connected to a soware tools somax output and optimized to minimize cross-entropy loss. At a high-level, the somax probability function S¹xº : IRjV j ! IRjV j Permission to make digital or hard copies of part or all of this work for personal or ÍjV j produces an output vector y so that yi = 1. is function classroom use is granted without fee provided that copies are not made or distributed i=1 for prot or commercial advantage and that copies bear this notice and the full citation allows the network to learn during training to decode the nal on the rst page. Copyrights for third-party components of this work must be honored. RNN hidden state as probabilities of each word x 2 V . For all other uses, contact the owner/author(s). One of the most important choices in building an LM for source ICSE 2020, Seoul, South Korea © 2019 Copyright held by the owner/author(s). 978-x-xxxx-xxxx-x/YY/MM...$15.00 code is how this output vocabulary V is constructed. As in the DOI: 10.1145/nnnnnnn.nnnnnnn input sequence vocabulary, it could range from scanner tokens to ICSE 2020, May 23-29, 2020, Seoul, South Korea Gareth Ari Aye and Gail E. Kaiser Figure 3: Unfolded recurrent neural network Figure 1: Completion suggestions before our ML ranking Finally, programmers expect that accepting completion sugges- tions will, at the very least, produce programs that compile. As a maer of fact, Marasoiu et al. found in a study of Dart programmers’ usage of code completion that developers oen leverage autocom- plete as a quick check of their code’s correctness [24]. In general, an LM cannot guarantee valid output. Penalizing invalid suggestions more heavily than valid but incorrect ones at training time by in- corporating type analysis is an option, but would only decrease the likelihood of invalid suggestions. To ensure that suggestions are Figure 2: Completion suggestions a er our ML ranking always valid, the model should be asked to rank already validated tokens or else any invalid suggestions it produces must be ltered individual characters. e model’s vocabulary has implications for out post-prediction. what can be predicted and how quickly predictions can be made. For example, a character-level model with V = f0; 1; :::; 255g corre- 1.3 Summary of contributions sponding to ascii characters can output any arbitrary alphanumeric • is work details and evaluates a design for incorporating word, but requires numerous prediction requests. On the other the predictive power of language modeling within existing hand, choosing V to be the set of keywords for a given program- IDE code completion systems. ming language makes it so only keywords and nothing else can be • We discuss and compare prior work on neural source code predicted, but whole tokens can be predicted in a single request. modeling to address the open vocabulary problem. • State-of-art accuracy is reported for source code modeling 1.2 Modern IDE constraints on a dynamic programming language. • Popular IDEs such as Visual Studio, IntelliJ, and Eclipse have in e largest of three corpora we studied is made available common that support for various programming languages is pro- along with our source code [3]. vided through a plugin architecture. is enables programmers to is paper is organized to rst give a brief history of the research augment their IDE with additional language-specic functionality that inuenced our design. We then delve into the specics of the by downloading and installing extensions. ese plugins provide design with a focus on modeling. We cover input representation, interactive functionality to assist programmers writing soware neural architecture, and training congurations, contrasting the and include features like syntax highlighting, reporting of errors details and performance of a character-input model with a token- and warnings, and code completion. input model. e design section concludes with a discussion of how Since one of the values that autocomplete provides is typing model results can be synthesized with the list of keywords and in- assistance, developers are interested in instantaneous completion scope identiers produced by type analysis. We then characterize suggestions. e user experience literature states that completion each of the datasets used for model training and evaluation, and results must arrive within 100ms of user action to be perceived as report prediction accuracy in the context of comparable source code instantaneous [26, 32]. is latency upper bound puts limits on LM modeling results. An additional benchmark for prediction latency size, the amount of processing that can be done before and aer is reported to demonstrate the tness of our approach for the IDE model predictions, and the number of predictions that can be made seing. Lastly, along with a concluding discussion, we consider within an autocomplete request. threats to the validity of our work. With regard to model size, deep neural networks for language modeling might contain hundreds of millions or even billions of 2 BACKGROUND parameters [19]. Since each parameter represents a decimal value, e idea of leveraging machine learning to improve code comple- an LM can quickly exceed the memory and disk capacity of a de- tion suggestion ranking was proposed as early as 2009 in Bruch et veloper machine.