Semi-Supervised Code Generation by Editing Intents [.Pdf]

Semi-Supervised Code Generation by Editing Intents Edgar Chen advised by Graham Neubig, Bogdan Vasilescu Carnegie Mellon University [email protected] Intent: count the occurrences of for reference materials or contents on question and a list item answer sites such as StackOverflow (SO). How- Code: l . count ( ’b’ ) ever, this means that the programmer must search for the right question, then the right answer, then Figure 1: A datapoint from StackOverflow, repre- understand the code in the answer well enough to senting a Question (intent) and its Answer (snip- implement his own version of it to attain the re- pet) sults that he is interested in, which can be tough for a novice. Abstract Semantic parsing is the category of tasks in NLP which involve transforming some unstruc- In this work, we explore the seman- tured natural language text into a structured log- tic parsing task of generating code in ical representation. These representations can a general-purpose programming language range from simple grammars for querying knowl- (e.g. Python) from a natural language de- edge bases (Berant et al., 2013) to smartphone scription. One significant issue with the voice commands which can be executed by vir- current state of code generation from nat- tual agents such as Siri and Alexa (Kumar et al., ural language is the lack of data in a ready- 2017). NLP researchers have shown significant to-translate form, ie written as a command progress on learning these representations using with variable names specified. We propose data. However, these representations are often a deep generative model to rewrite and limited in their expressivity due to limitations in augment utterances such as StackOver- the expressive power of the syntax and the limited flow questions into an acceptable form by domain of the task. The ability to generate code adding variable names and other impor- in general-purpose programming languages is de- tant information. We treat the rewritten sirable as they are both sufficiently expressive for natural language representation as a latent programmable tasks and can easily be understood variable, and sample an edited natural lan- and integrated by programmers into existing sys- guage sentence which we condition the tems and codebases. code generation process on. We train the One necessity for learning a strong code gen- models separately in a supervised setting eration system is data which includes a concise and together in an unsupervised setting. natural language intent for which the corresponding code snippet captures all of the instructions 1 Introduction and detail in the intent, while not including any Programmers learning a new programming lan- extraneous code. Previous works have focused guage like Java or Python can often struggle with on cases where code is annotated, line-by-line, implementing a simple command into code, due to with detailed natural language descriptions that unfamiliarity with syntax or other issues. Even ex- have variable names (Oda et al., 2015), or where perienced programmers can forget the specific de- the code can only interact with a fixed number tails of certain library functions. Often, program- of variables (i.e. pieces in a game) (Ling et al., mers stuck on such details will query search en- 2016). We would like to our models to capture gines with natural language descriptions looking the complexities of all questions that programmers can ask, which can be achieved by mining data Intent: count the occurrences of from a question-and-answer site like StackOver- a list item flow which allows users the freedom to ask any Code: l . count ( ’b’ ) code-related question. However, using the ques- Rewritten Intent: count the tion text from an SO post is often insufficient in- occurrences of item ’b’ in listl formation to actually generate code. In a usual code generation task, an example input would be Figure 2: A datapoint separated into 3 parts: intent a fully-formed command ”How to get maximal el- in blue, code in green, and generable tokens in red. ement in list ’my list’”, and its corresponding output would be a line of code: max(my list). fect. However, on StackOverflow a question posted by Xu et al. (Lin et al., 2017) have worked on users would take the form ”How to get maximal learning to generate Bash shell script commands element in a list?”, expressing only an abstract from natural language. Their approach relies on intent, and the answer to the post would still be synthesizing a template for a Bash one-liner, then max(my list). Despite the fact that the in- filling in the arguments. They show good re- tent and snippet agree, the intent does not contain sults in both accuracy and human evaluation, and enough information to generate the entire code show that their model is able to capture rela- snippet, as it is missing the variable name of the tively complex intents as commands and argu- list. ments. However, they limit the scope of the task One approach to fixing this issue would be man- to running Unix command-line utilities with ar- ual annotation, but the process is slow and takes guments, which means that many of the power- a skilled programmer in the target programming ful elements of Bash syntax are lost, such as con- language, and thus is not easily scaled. We take trol flow, I/O redirection, and variables, as most a two-step approach to utilize this data: First, we command-line use of Bash does not include these construct a model which rewrites the abstract nat- features. Our work is a step closer to generating ural language intents into concrete commands by code in a general-purpose language which allows augmenting them with variables from the code these more complex syntactic features. and other important information such as variable types. Then, we construct a model which gen- 3 Methods erates code snippets from these rewritten intents. 3.1 Problem Description Because data with rewritten intents is sparse, we treat the rewritten intents as underlying latent vari- A sample of fully annotated data is shown in Fig- ables which we infer to perform the final code gen- ure2. We would like to infer high-quality rewrit- eration step, and train our models in tandem, semi- ten intents from the intent (English) and code supervised. This approach surpasses our baseline (Python). The quality of a rewritten intent is mea- approach of ignoring the rewriting step and just sured by its faithfulness to the original intent and generating code directly from the intent, as well its completeness in providing a fully-formed natu- as learning rewritten intent generation using only ral language command allowing humans and ma- the small supervised dataset. chines to write code with no additional information. To facilitate the generation of these rewrit- 2 Related Work ten intents, we require that they are composed of mostly copies of words from the intent and Gradient-based optimization of generative latent- variable names and literals (strings, integers, etc.) variable models was first popularized with the from the code. This means that the rewritten in- Variational Auto-Encoder (VAE) (Kingma and tent will mostly contain content from the original Welling, 2013). However, Kingma and Welling intent, with some variable names possibly added focused on sampling and training continuous la- to allow for a complete generation. In Figure2, tent variables. Miao and Blunsom(2016) re- for example, without the variable names inserted purpose the ideas from the VAE to train models into the rewritten intent, it is impossible to write structured, discrete latent variables, and demon- the code; even if we know that we want to call strate that the latent variables can be used as the the count function, there’s no indication as to summaries in sentence compression to good ef- which list it should be called from or what ob- ject it should be called on. The rewritten intent The probability of generating a copy or token tells us that the list is l and that count should be with index i as the next word is Pr(zt = i) = qt;i. called on ’b’. One last element of the rewritten Finally, we calculate ht+1. If we decided to intent is the generable tokens: these are common copy a token, we would like to use information words such as prepositions (in, of, on) and types about the copied token, so we augment the input (list, file) which are sometimes not found in the to the LSTM f, the embedding of the copied word intents due to their brevity. We allow the model to e(zt), with the encoding of the copied token r[zt]. generate these for higher fidelity and clearer match If zt was not copied and instead generated, then with the manually annotated rewritten intents. we let r[zt] = 0. 3.2 Model The model can be separated into two parts: there is ht+1 = f(ht; [e(wt); r[zt]]) (5) a rewritten intent generator and a code generator. 3.2.2 Code Generator 3.2.1 Rewritten Intent Generator The code generator is also an encoder-decoder The base model is an encoder-decoder (Cho et al., LSTM model. We use a bidirectional LSTM to 2014) model. We use fine-tuned embeddings e(w) encode the rewritten intent, to obtain a list of en- to convert each token in the snippet and intent codings s = BiLSTM(z1; z2; :::; zjzj). Then, we into a vectors, then encode each sequence sepa- calculate scores for copying and generating code rately with bidirectional LSTMs (Hochreiter and tokens y^ = [^y1; :::; y^jy^j]. For copies, we employ Schmidhuber, 1997).

Load more