Maybe Deep Neural Networks Are the Best Choice for Modeling

Maybe Deep Neural Networks are the Best Choice for Modeling Source Code Rafael-Michael Karampatsis Charles Sutton The University of Edinburgh Google AI, The University of Edinburgh and The Alan Edinburgh, United Kingdom Turing Institute ABSTRACT in other words, code that is well-written, easy to read, and natu- Statistical language modeling techniques have successfully been ral. There is now a large literature on language modeling for code applied to source code, yielding a variety of new software devel- [6, 12, 22, 37, 40, 58, 69]. opment tools, such as tools for code suggestion and improving Such models have enabled research on a broad suite of new soft- readability. A major issue with these techniques is that code intro- ware engineering tools. For example, the ability to automatically duces new vocabulary at a far higher rate than natural language, as quantify the naturalness of software has enabled new tools for au- new identifier names proliferate. But traditional language models tocompletion [40, 64], improving code readability [2, 63], and pro- limit the vocabulary to a fixed set of common words. For code, this gram repair [62]. Furthermore, recent work in natural langugage strong assumption has been shown to have a significant negative processing (NLP) [24, 59] has shown that LMs and sequence mod- effect on predictive performance. But the open vocabulary version els in general learn useful word embeddings, which can then be of the neural network language models for code have not been used for downstream tasks in the same way as older, word2vec- introduced in the literature. We present a new open-vocabulary style embeddings [54]. Such continuous embeddings have formed neural language model for code that is not limited to a fixed vocab- the foundation of important software engineering tools. Examples ulary of identifier names. We employ a segmentation into subword of such are suggesting readable function and class names [3], sum- units, subsequences of tokens chosen based on a compression cri- marizing source code [5, 43], predicting bugs [60], detecting code terion, following previous work in machine translation. Our net- clones [73], comment generation [42], fixing syntactic errors [47], work achieves best in class performance, outperforming even the and variable de-obfuscation [9]. Therefore, improved LMs for code state-of-the-art methods of Hellendoorn and Devanbu that are de- have the potential to enable improvements in a diverse variety of signed specifically to model code. Furthermore, we present a sim- software engineering tools. ple method for dynamically adapting the model to a new test project, However, a provocative recent paper by Hellendoorn and De- resulting in increased performance. We showcase our methodol- vanbu [37] argues that there is a key general challenge in deep ogy on code corpora in three different languages of over a billion learning models of code, which significantly hinders their useful- tokens each, hundreds of times larger than in previous work. To ness. This is the out of vocabulary (OOV) problem, which is that our knowledge, this is the largest neural language model for code new identifier names are continuously invented by developers [6]. that has been reported. These out-of-vocabulary (OOV) tokens cannot be predicted by a language model with a fixed vocabulary, because they have not oc- CCS CONCEPTS curred in the training set. Although this problem also exists in natural languge text, it is much more severe in code. For instance, the • Software and its engineering → Software maintenance tools. vocabulary size of 74046 tokens used in [37] covers only about 85% KEYWORDS of identifier occurrences and 20% of distinct identifiers that appear in the test set. Therefore standard methods from NLP of dealing naturalness, language models, BPE, code, software tools, neural with this problem, such as introducing a single special token to rep- network resent OOV tokens [17], still fall short of fully addressing the OOV arXiv:1903.05734v1 [cs.SE] 13 Mar 2019 problem for code. Instead, Hellendoorn and Devanbu [37] present 1 INTRODUCTION a n-gram LM with several code-specific extensions. This enhanced Large corpora of open source software projects present an oppor- n-gram model shows improved performance over an off-the-shelf tunity for creating new software development tools based on ma- neural language model for code, a surprising result because for chine learning [4]. Many of these methods are based on the hypoth- natural language, neural models consistently outperform n-gram esis that much of software is natural, that is, because software is models. Based on these improvements, Hellendoorn and Devanbu written for humans to read, it displays some of the same statistical raise the provocative suggestion that deep models might not be the properties as natural language. To quantify the degree of natural- best choice for modeling sournce code. ness of a piece of software, Hindle et al [40] propose the use of In this paper, we argue that to the contrary, perhaps deep net- statistical language modeling. A language model is a probability works are a good choice for modeling source code, because it is pos- distribution over strings; by training a language model (LM) on sible to overcome the limitations highlighted by [37]. More specif- a large corpus of well-written code, we hope that the LM will as- ically, we address the key challenge of deep models of source code sign high probability to new code that is similar to the training set, that was highlighted in the previous work, namely, the OOV problem, by introducing an open-vocabulary neural language model for This work was supported in part by the EPSRC Centre for Doctoral Training in Data Science, funded by the UK Engineering and Physical Sciences Research Council (grant EP/L016427/1) and the University of Edinburgh. source code. An open vocabulary model is not restricted to a fixed- based on the assumption that software is characterized by simi- sized vocabulary determined at training time; for example, some lar statistical properties to natural language; viz., the naturalness types of open vocabulary models predict novel tokens character- hypothesis [40]. It should be expected that software is repetitive by-character. Our open vocabulary model is based on the idea of and predictable. Indeed this successfully verified in [30]. Hindle et subword units, following previous work from neural machine trans- al. [40] introduced the naturalness hypothesis and showcased that lation [65]. Subword units are a way of combining the strengths of Java code is actually less entropic than a natural language (Eng- character-level and token-level models. Each subword unit is a se- lish). Nguyen et al. [58] augmented n-gram LMs with semantic in- quence of characters that occurs as a subsequence of some token formation such as the role of a token in the program, e.g., variable, in the training set; the model outputs a sequence of subword units operator, etc. On the other hand, other researchers attempted to instead of a sequence of tokens. Including all single characters as exploit the localness of code by augmenting n-gram models with a subword units will allow the model to predict all possible tokens, cache component [69]. The cache contains n-grams that previously so there is no need for special OOV handling. The vocabulary of appeared in the current file or in nearby ones, such as files from subword units in the model is inferred from the training set using the same package. The model was shown to offer lower entropies a compression-based heuristic called byte pair encoding (BPE). for code but not for English. Later, Hellendoorn and Devanbu [37] The use of subword unit NLMs has two main advantages for extended this idea to nested scopes, outperforming vanilla NLMs code: First, even for an OOV token t that has never occurred in and achieving best-in-class performance on a Java corpus. the training data, its subword units will have occurred in train- Beyond n-gram models, several other types of methods have ing, so the model can predict what will follow t based on what been employed to model code. A generative model for code called tended to follow its subword units in the training data. Second, probabilistic higher order grammar (PHOG) was introduced by [12], training of a subword unit NLM on large corpora corpora is much which generalizes probabilistic context free grammars [44]. Also, faster than a token level model, as a relatively smaller number of both simple RNNs were used in [74] and LSTMs [22] to learn an subword units can still lead to good performance. On large open- NLM for code. The LSTMs were shown to perform much better source corpora, we show that our model is indeed more effective than simple RNNs. This conclusion was later confirmed by Hel- for code language modeling than previous n-gram or neural mod- lendoorn and Devanbu [37]. Lastly, [48] attempt to improve code els. On Java, our model is able to achieve a predictive performance completion performance on OOV words by augmenting an RNN of 3.15 bits per token and 70.84% MRR in a cross project evalua- with a pointer network [72]. Their pointer network learns a copy tion. Simultaneously achieving predictive performance of 1.04 bits mechanism over the current context using attention that is use- per token and 81.16% MRR in a within project setting, This is the ful in code completion. Once an OOV token has been used once, best performance that we are aware of in the literature for a single the copy mechanism can learn to re-use it, but unlike our model it model. cannot predict its first usage and it is not designed to learn depen- In particular, our contributions can be summarised as follows: dencies between the OOV token and the next tokens as it learns no representations for OOV words, in contrast to our method.

Maybe Deep Neural Networks Are the Best Choice for Modeling

Improving Neural Language Models with A

Formatting Instructions for NIPS -17

A Study on Neural Network Language Modeling

Eusipco 2013 1569741253

Master's Thesis Preparation Report

Chapter 4 N-Grams

Modeling Vocabulary for Big Code Machine Learning

A Dynamic Language Model for Speech Recognition

Language Model Techniques in Machine Translation Diplomarbeit

A Neurophysiologically-Inspired Statistical Language Model

Arxiv:1910.05493V1 [Cs.LG] 12 Oct 2019 1

Context-Sensitive Code Completion