PYMT5: Multi-Mode Translation of Natural Language and PYTHON Code with Transformers
Total Page:16
File Type:pdf, Size:1020Kb
PYMT5: multi-mode translation of natural language and PYTHON code with transformers Colin B. Clement∗ Dawn Drain Microsoft Cloud and AI Microsoft Cloud and AI [email protected] [email protected] Jonathan Timchecky Alexey Svyatkovskiy Stanford University Microsoft Cloud and AI [email protected] [email protected] Neel Sundaresan Microsoft Cloud and AI [email protected] Abstract generation (summarization), and achieved a ROUGE-L F-score of 24.8 for method Simultaneously modeling source code generation and 36.7 for docstring genera- and natural language has many excit- tion. ing applications in automated software development and understanding. Pur- 1 Introduction suant to achieving such technology, we introduce PYMT5, the PYTHON method Software is a keystone of modern society, text-to-text transfer transformer, which is touching billions of people through services trained to translate between all pairs of and devices daily. Writing and documenting PYTHON method feature combinations: a the source code of this software are challeng- single model that can both predict whole ing and labor-intensive tasks; software devel- methods from natural language documen- opers need to repeatedly refer to online doc- tation strings (docstrings) and summarize code into docstrings of any common style. umentation resources in order to understand We present an analysis and modeling ef- existing code bases to make progress. Devel- fort of a large-scale parallel corpus of 26 oper productivity can be improved by the pres- million PYTHON methods and 7.7 mil- ence of source code documentation and a de- lion method-docstring pairs, demonstrat- velopment environment featuring intelligent, ing that for docstring and method gen- machine-learning-based code completion and eration, PYMT5 outperforms similarly- analysis tools. sized auto-regressive language models Recent progress in natural language process- (GPT2) which were English pre-trained arXiv:2010.03150v1 [cs.LG] 7 Oct 2020 or randomly initialized. On the CODE- ing (NLP), especially encoder/decoder-based SEARCHNET test set, our best model pre- transformer models (Vaswani et al., 2017) dicts 92.1% syntactically correct method and pre-training (Radford et al., 2018; Lewis bodies, achieved a BLEU score of 8.59 for et al., 2019), has led to state-of-the-art per- method generation and 16.3 for docstring formance on language modeling, classifica- ∗Corresponding author tion (Devlin et al., 2018), translation (Raffel y Work done during a Microsoft internship et al., 2019), summarization (Liu and Lap- ata, 2019), grammar correction (Bryant et al., (2017) and treat PYTHON and its docstrings 2017), entity recognition, dialogue genera- as fundamentally no different than other ‘natu- tion (Budzianowski and Vulic´, 2019), and ral’ languages, representing both source code more. Along with these quantitative advances and natural language docstrings as sequences have come deeper understanding of the learned of tokens sharing the same vocabulary. Here hidden representations which power transform- we present a multi-mode translation method ers (Kovaleva et al., 2019; Voita et al., 2019; resulting in PYMT5, the PYTHON method Clark et al., 2019; Ethayarajh, 2019). While text-to-text transfer transformer (inspired by they are arguably not ‘natural,’ programming the text-to-text transfer transformer T5 (Raffel languages are increasingly becoming model- et al., 2019)). Our single model can both learn ing playgrounds for NLP modeling. Since code/language generation and understand the these languages by definition have a gram- relationships between them. mar, syntax, and known relationships between The paper is organized as follows: we entities, they offer enticing opportunities for begin in sec.2 by presenting examples of an even deeper probing of NLP models and the performance of our novel multi-mode tasks. Beyond theoretical importance, many PYMT5 —the PYTHON method text-to-text NLP tasks have practical utility in software transfer transformer model—which we trained development environments: language model- to translate between all pairs of combinations ing or generation can be used for code com- of method signatures, docstrings, and bod- pletion (Raychev et al., 2014; Bruch et al., ies which do not have the same feature in 2009; Svyatkovskiy et al., 2019, 2020), transla- both the source and target. In sec. 2.1 we de- tion/summarization to generate documentation scribe our training data and the pre-processing or natural language summaries (Moreno et al., steps for source code and natural language 2013; Scalabrino et al., 2017; Wan et al., 2018; we followed, and compared it to existing par- Alon et al., 2018) or even summarize a set of allel docstring-method corpora like CODE- code changes (Moreno et al., 2014), transla- SEARCHNET (CSN)(Husain et al., 2019) and tion and grammar error correction to patch and that presented by Barone et al (Barone and Sen- detect bugs (Zhai et al., 2019), and joint em- nrich, 2017). In sec.2.2 we explain our BART- bedding of code and natural language for code like (Lewis et al., 2019) pre-training scheme, search (Husain et al., 2019; Gu et al., 2018). demonstrating a 25× speed-up in training time for docstring generation. Next, in sec. 2.3 we In this work we focus on jointly modeling analyze and classify PYTHON docstrings, en- both source code (PYTHON) and concomitant abling style-conditioned docstring generation natural language documentation (docstrings) in PYMT5. In sections3 and4, we discuss with transformers, through the study of dual PYMT5 results on method generation and doc- tasks: generating method code bodies from string generation respectively and compare it signatures and docstrings, and generating doc- to two GPT2 models randomly initialized and strings from signatures and method code bod- pre-trained on English. ies. While previous work (Allamanis et al., 2015; Yin and Neubig, 2017) has leveraged the 2 Multi-mode training grammar of code to extract features like the Ab- stract Syntax Tree for modeling (treating code Figure1 shows examples of inputs and out- and natural language as separate modalities), puts of our model PYMT5 for 3 example tasks: we follow examples like Barone and Sennrich (top, blue) predicting a body from a method Figure 1: Real examples of PYMT5 performing method generation using combinations of signatures and docstrings. A leading comment in the input sequence instructs the model to output a particular combination of features, e.g. ‘# target signature and body’ instructs PYMT5 to predict both a signature and body. """Count the number of even numbers # target docstring style oneline in a list""" def count_even_numbers_in_list(lst): count=0 """Count the number of even for example in lst: numbers in a list. if ((example%2) ==0): count +=1 Parameters return count ---------- PYMT5 lst : list The list to count even # target docstring style numpydoc numbers in. def count_even_numbers_in_list(lst): count=0 Returns for example in lst: ------- if ((example%2) ==0): int count +=1 The number of even numbers return count in the list.""" Figure 2: PYMT5 performing docstring generation on an example method, showing the output when the target prefix indicates one line (top, blue) and Numpydoc docstring (bottom, red) styles. signature, (middle, red) predicting a whole types of ‘lst’ and ‘numbers’ to be iterables method from a natural language docstring, containing numbers. and (bottom, green) predicting a body from PYMT5 can also be prompted with source a signature and docstring. Note that the com- code to produce a docstring summary in ment ‘# target <specification>’ in- various styles. Figure2 shows the model structs the model to choose a particular form prompted with one of the methods generated of output. Further note that PYMT5 correctly by PYMT5 in Fig.1 (top, blue), in both a learns to interpret natural language: it inter- ‘one line’ (top, blue) style and a ‘Numpydoc’ prets ‘even’ as being related to ‘(example (bottom, red) style. It infers the intent from the %2) == 0’, and ‘greater than 1000’ signature name and code, and even infers that as ‘number > 1000’. The model also pro- type of the argument is a list and return type duces syntactically correct code (as we will int. It produces the same terse one sentence discuss later, we never show the model syntac- summary of the function in both cases. tically incorrect code), and correctly infers the In order to teach PYMT5 to maximally re- late the separate method features (signatures, space or tab conventions, successfully parsing docstrings, bodies), we trained it to translate 97.3% of the 2.3 million unique PYTHON files. between all pairs of feature combinations in We used the PYTHON module astunparse which the same feature does not appear in both to take the AST for each method and unparse the source and target. This scheme is also ad- them back into source code, so that our fine- vantageous as our corpus is unbalanced, with tuned model was never trained on syntactically only 1/5 methods featuring docstrings, and so incorrect code. The statistics of our method- the model can learn to leverage all the features docstring corpus are summarized in Table.1. whether they are present or not. Additionally, it Our parallel method-docstring corpus is twice has been shown that code is more ‘predictable’ as large as the next largest irrespective of lan- than natural language (Hindle et al., 2012). If guage and over 15× as large as the next largest the method and argument names are a domi- PYTHON parallel corpus, both in CSN. nating signal due to their relatively rigid struc- For each method, we ignored comments as ture, the model may learn to ignore the content they generally represent trivia and are not part of docstrings. This multi-mode method over- of the normal language syntax. We cleaned the comes that by training the model to generate docstrings by removing non-ASCII characters, method bodies from docstrings alone. See the normalizing Unicode, and replacing commit appendix for a more detailed description of the hashes, file paths, and URLs with placeholder multi-mode training scheme.