Arxiv:1603.06744V2 [Cs.CL] 8 Jun 2016 Verse Predictors (Koehn Et Al., 2007; Wong and Marginal Likelihood Over Latent Predictors and Gen- Mooney, 2006)

Latent Predictor Networks for Code Generation

Wang Ling♦ Edward Grefenstette♦ Karl Moritz Hermann♦ Toma´sˇ Kociskˇ y´♦♣ Andrew Senior♦ Fumin Wang♦ Phil Blunsom♦♣

♦Google DeepMind ♣University of Oxford {lingwang,etg,kmh,tkocisky,andrewsenior,awaw,pblunsom}@google.com

Abstract

Many language generation tasks require the production of text conditioned on both structured and unstructured inputs. We present a novel neural network architecture which generates an output sequence conditioned on an arbitrary number of input functions. Crucially, our approach allows both the choice of conditioning context and the granularity of generation, for example characters or tokens, to be Figure 1: Example MTG and HS cards. marginalised, thus permitting scalable and effective training. Using this framework, we address the problem of generating programming code from a mixed natural lan- dictor to copy “The Foundation” from the input, guage and structured specification. We and a another one to find the answer “Issac Asi- create two new data sets for this paradigm mov” by searching through a database. However, derived from the collectible trading card training multiple predictors is in itself a challeng- games Magic the Gathering and Hearth- ing task, as no annotation exists regarding the pre- stone. On these, and a third preexisting dictor used to generate each output token. Fur- corpus, we demonstrate that marginalis- thermore, predictors generate segments of differ- ing multiple predictors allows our model ent granularity, as database queries can generate to outperform strong benchmarks. multiple tokens while a word softmax generates a single token. In this work we introduce Latent 1 Introduction Predictor Networks (LPNs), a novel neural archi- The generation of both natural and formal lan- tecture that fulfills these desiderata: at the core guages often requires models conditioned on di- of the architecture is the exact computation of the arXiv:1603.06744v2 [cs.CL] 8 Jun 2016 verse predictors (Koehn et al., 2007; Wong and marginal likelihood over latent predictors and gen- Mooney, 2006). Most models take the restrictive erated segments allowing for scalable training. approach of employing a single predictor, such as We introduce a new corpus for the automatic a word softmax, to predict all tokens of the output generation of code for cards in Trading Card sequence. To illustrate its limitation, suppose we Games (TCGs), on which we validate our model 1. wish to generate the answer to the question “Who TCGs, such as Magic the Gathering (MTG) and wrote The Foundation?” as “The Foundation was Hearthstone (HS), are games played between two written by Isaac Asimov”. The generation of the players that build decks from an ever expanding words “Issac Asimov” and “The Foundation” from pool of cards. Examples of such cards are shown a word softmax trained on annotated data is un- in Figure 1. Each card is identified by its attributes likely to succeed as these words are sparse. A ro- bust model might, for example, employ one pre- 1Dataset available at https://deepmind.com/publications.html (e.g., name and cost) and has an effect that is de- MTG HS scribed in a text box. Digital implementations of Programming Language Java Python these games implement the game logic, which in- Cards 13,297 665 cludes the card effects. This is attractive from Cards (Train) 11,969 533 a data extraction perspective as not only are the Cards (Validation) 664 66 Cards (Test) 664 66 data annotations naturally generated, but we can Singular Fields 6 4 also view the card as a specification communi- Text Fields 8 2 cated from a designer to a software engineer. Words In Description (Average) 21 7 This dataset presents additional challenges to Characters In Code (Average) 1,080 352 prior work in code generation (Wong and Mooney, 2006; Jones et al., 2012; Lei et al., 2013; Artzi Table 1: Statistics of the two TCG datasets. et al., 2015; Quirk et al., 2015), including the handling of structured input—i.e. cards are com- health) and four text fields (cost, type, name, and posed by multiple sequences (e.g., name and description), whereas HS cards have eight singu- description)—and attributes (e.g., attack and cost), lar fields (attack, health, cost and durability, rar- and the length of the generated sequences. Thus, ity, type, race and class) and two text fields (name we propose an extension to attention-based neu- and description). Text fields are tokenized by ral models (Bahdanau et al., 2014) to attend over splitting on whitespace and punctuation, with ex- structured inputs. Finally, we propose a code com- ceptions accounting for domain specific artifacts pression method to reduce the size of the code (e.g., Green mana is described as “{G}” in MTG). without impacting the quality of the predictions. Empty fields are replaced with a “NIL” token. Experiments performed on our new datasets, The code for the HS card in Figure 1 is shown and a further pre-existing one, suggest that our ex- in Figure 2. The effect of “drawing cards until the tensions outperform strong benchmarks. player has as many cards as the opponent” is im- The paper is structured as follows: We first plemented by computing the difference between describe the data collection process (Section 2) the players’ hands and invoking the draw method and formally define our problem and our base- that number of times. This illustrates that the map- line method (Section 3). Then, we propose our ping between the description and the code is non- extensions, namely, the structured attention mech- linear, as no information is given in the text regard- anism (Section 4) and the LPN architecture (Sec- ing the specifics of the implementation. tion 5). We follow with the description of our code compression algorithm (Section 6). Our model class DivineFavor(SpellCard): def __init__(self): is validated by comparing with multiple bench- super().__init__("Divine Favor", 3, marks (Section 7). Finally, we contextualize our CHARACTER_CLASS.PALADIN, CARD_RARITY.RARE) def use(self, player, game): findings with related work (Section 8) and present super().use(player, game) difference= len(game.other_player.hand) the conclusions of this work (Section 9). - len(player.hand) fori in range(0, difference): player.draw() 2 Dataset Extraction

We obtain data from open source implementations Figure 2: Code for the HS card “Divine Favor”. of two different TCGs, MTG in Java2 and HS in Python.3 The statistics of the corpora are illustrated in Table 1. In both corpora, each card is im- 3 Problem Definition plemented in a separate class file, which we strip Given the description of a card x, our decoding of imports and comments. We categorize the con- problem is to find the code yˆ so that: tent of each card into two different groups: singular fields that contain only one value; and text yˆ = argmax log P (y | x) (1) fields, which contain multiple words representing y different units of meaning. In MTG, there are six Here log P (y | x) is estimated by a given model. singular fields (attack, defense, rarity, set, id, and We define y = y1..y|y| as the sequence of char- 2github.com/magefree/mage/ acters of the code with length |y|. We index each 3github.com/danielyule/hearthbreaker/ input field with k = 1..|x|, where |x| quantifies the number of input fields. |xk| denotes the number of tokens in xk and xki selects the i-th token. 4 Structured Attention Background When |x| = 1, the attention model of Bahdanau et al. (2014) applies. Following the chain rule, log P (y|x) = P t=1..|y| log P (yt|y1..yt−1, x), each token yt is predicted conditioned on the previously generated sequence y ..y and input sequence x = 1 t−1 1 Figure 3: Illustration of the structured attention x ..x . Probability are estimated with a soft- 11 1|x1| mechanism operating on a single time stamp t. max over the vocabulary Y :

p(yt|y1..yt−1, x1) = softmax(ht) (2) yt∈Y character level (cf. C2W row). A context-aware where ht is the Recurrent Neural Network (RNN) representation is built for words in the text fields state at time stamp t, which is modeled as using a bidirectional LSTM (cf. Bi-LSTM row). g(yt−1, ht−1, zt). g(·) is a recurrent update func- Computing attention over multiple input fields is tion for generating the new state ht based on problematic as each input field’s vectors have dif- the previous token yt−1, the previous state ht−1, ferent sizes and value ranges. Thus, we learn a and the input text representation zt. We imple- linear projection mapping each input token xki to ment g using a Long Short-Term Memory (LSTM) a vector with a common dimensionality and value RNNs (Hochreiter and Schmidhuber, 1997). range (cf. Linear row). Denoting this process as The attention mechanism generates the repre- f(xki), we extend Equation 3 as: sentation of the input sequence x = x11..x1|x1|, and zt is computed as the weighted sum zt = aki = softmax(v(f(xki), ht−1)) (4) xki∈x P a h(x ), where a is the attention co- i=1..|x1| i 1i i efficient obtained for token x1i and h is a func- Here a scalar coefficient aki is computed for each tion that maps each x1i to a continuous vector. In input token xki (cf. “Tanh”, “Linear”, and “Soft- general, h is a function that projects x1i by learn- max” rows). Thus, the overall input representation ing a lookup table, and then embedding contex- zt is computed as: tual words by defining an RNN. Coefficients ai X z = a f(x ) are computed with a softmax over input tokens t ij ki (5) k=1..|x|,i=1..|xk| x11..x1|x1|:

ai = softmax (v(h(x1i), ht−1)) (3) 5 Latent Predictor Networks x1i∈x Background In order to decode from x to y, Function v computes the affinity of each token x 1i many words must be copied into the code, such and the current output context h . A common t−1 as the name of the card, the attack and the cost implementation of v is to apply a linear projection values. If we observe the HS card in Figure 1 from h(x ): h (where : is the concatenation 1i t−1 and the respective code in Figure 2, we observe operation) into a fixed size vector, followed by a that the name “Divine Favor” must be copied into tanh and another linear projection. the class name and in the constructor, along with Our Approach We extend the computation of the cost of the card “3”. As explained earlier, zt for cases when x corresponds to multiple fields. this problem is not specific to our task: for in- Figure 3 illustrates how the MTG card “Serra An- stance, in the dataset of Oda et al. (2015), a model gel” is encoded, assuming that there are two singu- must learn to map from timeout = int ( lar fields and one text field. We first encode each timeout ) to “convert timeout into an integer.”, token xki using the C2W model described in Ling where the name of the variable “timeout” must et al. (2015), which is a replacement for lookup ta- be copied into the output sequence. The same is- bles where word representations are learned at the sue exists for proper nouns in machine translation Figure 4: Generation process for the code init(‘Tirion Fordring’,8,6,6) using LPNs.

which are typically copied from one language to row). The same applies to the generation of the the other. Pointer networks (Vinyals et al., 2015) attack, health and cost values as each of these pre- address this by deﬁning a probability distribution dictors is an element in R. Thus, we deﬁne our ob- over a set of units that can be copied c = c1..c|c|. jective function as a marginal log likelihood func- The probability of copying a unit ci is modeled as: tion over a latent variable ω:

p(c ) = softmax(v(h(c ), q)) (6) i i X ci∈c log P (y | x) = log P (y, ω | x) (7) As in the attention model (Equation 3), v is a func- ω∈ω¯ tion that computes the affinity between an embed- Formally, ω is a sequence of pairs rt, st, where ded copyable unit h(ci) and an arbitrary vector q. rt ∈ R denotes the predictor that is used at timestamp t and s the generated string. We decom- Our Approach Combining pointer networks t pose P (y, ω | x) as the product of the probabilities with a character-based softmax is in itself difficult of segments s and predictors r : as these generate segments of different granularity t t and there is no ground truth of which predictor to Y P (y, ω | x) = P (st, rt | y1..yt−1, x) = use at each time stamp. We now describe Latent rt,st∈ω Predictor Networks, which model the conditional Y P (s | y ..y , x, r )P (r | y ..y , x) probability log P (y|x) over the latent sequence of t 1 t−1 t t 1 t−1 r ,s ∈ω predictors used to generate y. t t We assume that our model uses multiple pre- where the generation of each segment is per- dictors r ∈ R, where each r can generate formed in two steps: select the predictor rt with multiple segments st = yt..yt+|st|−1 with ar- probability P (rt | y1..yt−1, x) and then gener- bitrary length |st| at time stamp t. An ex- ate st conditioned on predictor rt with probabil- ample is illustrated in Figure 4, where we ob- ity log P (st | y1..yt−1, x, rt). The probability of serve that to generate the code init(‘Tirion each predictor is computed using a softmax over Fordring’,8,6,6), a pointer network can all predictors in R conditioned on the previous 13 be used to generate the sequences y7 =Tirion state ht−1 and the input representation zt (cf. “Se- 22 and y14=Fordring (cf. “Copy From Name” lect Predictor” box). Then, the probability of gen- row). These sequences can also be generated us- erating the segment st depends on the predictor ing a character softmax (cf. “Generate Characters” type. We define three types of predictors: Character Generation Generate a single char- α14 is the sum of the path that generates the se- acter from observed characters from the training quence init(‘Tirion with characters alone, data. Only one character is generated at each time and the path that generates the word Tirion by stamp with probability given by Equation 2. copying from the input. β21 includes the probability of all paths that follow the generation of Copy Singular Field For singular fields only Fordring, which include 2×3×3 different paths the field itself can be copied, for instance, the due to the three decision points that follow (e.g. value of the attack and cost attributes or the type generating 8 using a character softmax vs. copy- of card. The size of the generated segment is the ing from the cost). Finally, ξ refers to the path number of characters in the copied field and the r that generates Fordring character by character. segment is generated with probability 1. While the number of possible paths grows ex- Copy Text Field For text fields, we allow each ponentially, α and β can be computed efficiently of the words xki within the field to be copied. using the forward-backward algorithm for Semi- The probability of copying a word is learned with Markov models (Sarawagi and Cohen, 2005), a pointer network (cf. “Copy From Name” box), where we associate P (rt | y1..yt−1, x) to edges where h(ci) is set to the representation of the word and P (st | y1..yt−1, x, rt) to nodes in the Markov f(xki) and q is the concatenation ht−1 : zt of the chain. state and input vectors. This predictor generates a The derivative ∂ log P (y|x) can be com- ∂P (st|y1..yt−1,x,rt) segment with the size of the copied word. puted using the same logic: It is important to note that the state vector ht−1 is generated by building an RNN over the se- ∂αt,s P (st | y1..yt−1, x, rt)β + ξr quence of characters up until the time stamp t − 1, t t+|st|−1 t = P (y | x)∂P (st | y1..yt−1, x, rt) i.e. the previous context yt−1 is encoded at the α β character level. This allows the number of pos- t,rt t+|st|−1 sible states to remain tractable at training time. α|y|+1

5.1 Inference Once again, we denote αt,rt = αtP (rt | At training time we use back-propagation to max- y1..yt−1, x) as the cumulative probability of all imize the probability of observed code, according values of ω that lead to st, exclusive. to Equation 7. Gradient computation must be per- An intuitive interpretation of the derivatives is formed with respect to each computed probabil- that gradient updates will be stronger on prob- ity P (rt | y1..yt−1, x) and P (st | y1..yt−1, x, rt). ability chains that are more likely to generate The derivative ∂ log P (y|x) yields: the output sequence. For instance, if the model ∂P (rt|y1..yt−1,x) learns a good predictor to copy names, such as Fordring, other predictors that can also gener- ∂αtP (rt | y1..yt−1, x)βt,rt + ξrt αtβt,rt = ate the same sequences, such as the character soft- P (y | x)∂P (rt | y1..yt−1, x) α|y|+1 max will allocate less capacity to the generation Here αt denotes the cumulative probability of all of names, and focus on elements that they excel at values of ω up until time stamp t and α|y|+1 yields (e.g. generation of keywords). the marginal probability P (y | x). βt,r = P (st | t 5.2 Decoding y1..yt−1)βt+|st|−1 denotes the cumulative probability starting from predictor rt at time stamp t, ex- Decoding is performed using a stack-based de- clusive. This includes the probability of the gener- coder with beam search. Each state S corre- ated segment P (st | y1..yt−1, x, rt) and the proba- sponds to a choice of predictor rt and segment st bility of all values of ω starting from timestamp t+ at a given time stamp t. This state is scored as |st|−1, that is, all possible sequences that generate V (S) = log P (st | y1..yt−1, x, rt) + log P (rt | segment y after segment st is produced. For com- y1..yt−1, x) + V (prev(S)), where prev(S) de- pleteness, ξr denotes the cumulative probabilities notes the predecessor state of S. At each time of all ω that do not include rt. To illustrate this, stamp, the n states with the highest scores V we refer to Figure 4 and consider the timestamp are expanded, where n is the size of the beam. t = 14, where the segment s14 =Fordring is For each predictor rt, each output st generates a generated. In this case, the cumulative probability new state. Finally, at each timestamp t, all states which produce the same output up to that point are X v size merged by summing their probabilities. X1 card)⇓{⇓super(card);⇓}⇓@Override⇓public 1041 X2 bility 1002 6 Code Compression X3 ;⇓this. 964 X4 (UUID ownerId)⇓{⇓super(ownerId 934 As the attention-based model traverses all input X5 public 907 X6 new 881 units at each generation step, generation becomes X7 copy() 859 quite expensive for datasets such as MTG where X8 }”)X3expansionSetCode = ” 837 the average card code contains 1,080 characters. X9 X6CardType[]{CardType. 815 X10 ffect 794 While this is not the essential contribution in our paper, we propose a simple method to compress Table 2: First 10 compressed units in MTG. We the code while maintaining the structure of the replaced newlines with ⇓ and spaces with . code, allowing us to train on datasets with longer code (e.g., MTG). Once vˆ is obtained, we replace all occurrences The idea behind that method is that many of vˆ with a new non-terminal symbol. This pro- keywords in the programming language (e.g., cess is repeated until a desired average size for the public and return) as well as frequently used code is reached. While training is performed on functions and classes (e.g., Card) can be learned the compressed code, the decoding will undergo without character level information. We exploit an additional step, where the compressed code is this by mapping such strings onto additional sym- restored by expanding the all Xi. Table 2 shows bols Xi (e.g., public class copy() → the ﬁrst 10 replacements from the MTG dataset, “X1 X2 X3()”). Formally, we seek the string vˆ reducing its average size from 1080 to 794. among all strings V (max) up to length max that maximally reduces the size of the corpus: 7 Experiments

vˆ = argmax (len(v) − 1)C(v) (8) Datasets Tests are performed on the two v∈V (max) datasets provided in this paper, described in Ta- where C(v) is the number of occurrences of ble 1. Additionally, to test the model’s ability of v in the training corpus and len(v) its length. generalize to other domains, we report results in (len(v) − 1)C(v) can be seen as the number of the Django dataset (Oda et al., 2015), comprising characters reduced by replacing v with a non- of 16000 training, 1000 development and 1805 test terminal symbol. To find q(v) efficiently, we lever- annotations. Each data point consists of a line of age the fact that C(v) ≤ C(v0) if v contains v0. It Python code together with a manually created nat- follows that (max − 1)C(v) ≤ (max − 1)C(v0), ural language description. which means that the maximum compression ob- Neural Benchmarks We implement two stan- tainable for v at size max is always lower than dard neural networks, namely a sequence-to- 0 that of v . Thus, if we can find a v¯ such that sequence model (Sutskever et al., 2014) and an 0 (len(¯v) − 1)C(¯v) > (max − 1)C(v ), that is v¯ attention-based model (Bahdanau et al., 2014). at the current size achieves a better compression The former is adapted to work with multiple in- 0 rate than v at the maximum length, then it fol- put fields by concatenating them, while the latter lows that all sequences that contain v can be dis- uses our proposed attention model. These models carded as candidates. Based on this idea, our itera- are denoted as “Sequence” and “Attention”. tive search starts by obtaining the counts C(v) for all segments of size s = 2, and computing the best Machine Translation Baselines Our problem scoring segment v¯. Then, we build a list L(s) of can also be viewed in the framework of seman- all segments that achieve a better compression rate tic parsing (Wong and Mooney, 2006; Lu et al., than v¯ at their maximum size. At size s + 1, only 2008; Jones et al., 2012; Artzi et al., 2015). Unfor- segments that contain a element in L(s − 1) need tunately, these approaches define strong assump- to be considered, making the number of substrings tions regarding the grammar and structure of the to be tested to be tractable as s increases. The al- output, which makes it difficult to generalize for gorithm stops once s reaches max or the newly other domains (Kwiatkowski et al., 2010). How- generated list L(s) contains no elements. ever, the work in Andreas et al. (2013) provides evidence that using machine translation systems ial. For instance, the correctness cards with con- without committing to such assumptions can lead ditional (e.g. if player has no cards, to results competitive with the systems described then draw a card) or non-deterministc (e.g. above. We follow the same approach and create put a random card in your hand) ef- a phrase-based (Koehn et al., 2007) model and a fects cannot be simply validated by running the hierarchical model (or PCFG) (Chiang, 2007) as code. benchmarks for the work presented here. As these models are optimized to generate words, not char- Setup The multiple input types (Figure 3) are acters, we implement a tokenizer that splits on all hyper-parametrized as follows: The C2W model punctuation characters, except for the “ ” charac- (cf. “C2W” row) used to obtain continuous vec- ter. We also facilitate the task by splitting Camel- tors for word types uses character embeddings of Case words (e.g., class TirionFordring size 100 and LSTM states of size 300, and gener- → class Tirion Fordring). Otherwise all ates vectors of size 300. We also report on results class names would not be generated correctly by using word lookup tables of size 300, where we these methods. We used the models implemented replace singletons with a special unknown token in Moses to generate these baselines using stan- with probability 0.5 during training, which is then dard parameters, using IBM Alignment Model 4 used for out-of-vocabulary words. For text fields, for word alignments (Och and Ney, 2003), MERT the context (cf. “Bi-LSTM” row) is encoded with for tuning (Sokolov and Yvon, 2011) and a 4-gram a Bi-LSTM of size 300 for the forward and back- Kneser-Ney Smoothed language model (Heafield ward states. Finally, a linear layer maps the differ- et al., 2013). These models will be denoted as ent input tokens into a common space with of size “Phrase” and “Hierarchical”, respectively. 300 (cf. “Linear” row). As for the attention model, we used an hidden layer of size 200 before ap- Retrieval Baseline It was reported in (Quirk et plying the non-linearity (row “Tanh”). As for the al., 2015) that a simple retrieval method that out- decoder (Figure 4), we encode output characters puts the most similar input for each sample, mea- with size 100 (cf. “output (y)” row), and an LSTM sured using Levenshtein Distance, leads to good state of size 300 and an input representation of results. We implement this baseline by computing size 300 (cf. “State(h+z)” row). For each pointer the average Levenshtein Distance for each input network (e.g., “Copy From Name” box), the inter- field. This baseline is denoted “Retrieval”. section between the input units and the state units are performed with a vector of size 200. Train- Evaluation A typical metric is to compute the ing is performed using mini-batches of 20 sam- accuracy of whether the generated code exactly ples using AdaDelta (Zeiler, 2012) and we report matches the reference code. This is informative results using the iteration with the highest BLEU as it gives an intuition of how many samples can score on the validation set (tested at intervals of be used without further human post-editing. How- 5000 mini-batches). Decoding is performed with a ever, it does not provide an illustration on the de- beam of 1000. As for compression, we performed gree of closeness to achieving the correct code. a grid search over compressing the code from 0% Thus, we also test using BLEU-4 (Papineni et to 80% of the original average length over inter- al., 2002) at the token level. There are clearly vals of 20% for the HS and Django datasets. On problems with these metrics. For instance, source the MTG dataset, we are forced to compress the code can be correct without matching the refer- code up to 80% due to performance issues when ence. The code in Figure 2, could have also been training with extremely long sequences. implemented by calling the draw function in an cycle that exists once both players have the same 7.1 Results number of cards in their hands. Some tasks, such Baseline Comparison Results are reported in as the generation of queries (Zelle and Mooney, Table 3. Regarding the retrieval results (cf. “Re- 1996), have overcome this problem by executing trieval” row), we observe the best BLEU the query and checking if the result is the same scores among the baselines in the card datasets as the annotation. However, we shall leave the (cf. “MTG” and “HS” columns). A key advantage study of these methologies for future work, as of this method is that retrieving existing entities adapting these methods for our tasks is not triv- guarantees that the output is well formed, with no MTG HS Django Component Comparison We present ablation BLEU Acc BLEU Acc BLEU Acc results in order to analyze the contribution of each Retrieval 54.9 0.0 62.5 0.0 18.6 14.7 of our modifications. Removing the C2W model Phrase 49.5 0.0 34.1 0.0 47.6 31.5 (cf. “– C2W” row) yields a small deterioration, as Hierarchical 50.6 0.0 43.2 0.0 35.9 9.5 word lookup tables are more susceptible to spar- Sequence 33.8 0.0 28.5 0.0 44.1 33.2 sity. The only exception is in the HS dataset, Attention 50.1 0.0 43.9 0.0 58.9 38.8 where lookup tables perform better. We believe Our System 61.4 4.8 65.6 4.5 77.6 62.3 that this is because the small size of the training – C2W 60.9 4.4 67.1 4.5 75.9 60.9 – Compress - - 59.7 6.1 76.3 61.3 set does not provide enough evidence for the char- – LPN 52.4 0.0 42.0 0.0 63.3 40.8 acter model to scale to unknown words. Surpris- – Attention 39.1 0.5 49.9 3.0 48.8 34.5 ingly, running our model compression code (cf. “– Compress” row) actually yields better results. Ta- Table 3: BLEU and Accuracy scores for the pro- ble 4 provides an illustration of the results for dif- posed task on two in-domain datasets (HS and ferent compression rates. We obtain the best re- MTG) and an out-of-domain dataset (Django). sults with an 80% compression rate (cf. “BLEU Scores” block), while maximising the time each Compression 0% 20% 40% 60% 80% card is processed (cf. “Seconds Per Card” block). Seconds Per Card While the reason for this is uncertain, it is simi- Softmax 2.81 2.36 1.88 1.42 0.94 lar to the finding that language models that output LPN 3.29 2.65 2.35 1.93 1.41 characters tend to under-perform those that output BLEU Scores Softmax 44.2 46.9 47.2 51.4 52.7 words (Jozefowicz´ et al., 2016). This applies when LPN 59.7 62.8 61.1 66.4 67.1 using the regular optimization process with a character softmax (cf. “Softmax” rows), but also when Table 4: Results with increasing compression rates using the LPN (cf. “LPN” rows). We also note with a regular softmax (cf. “Softmax”) and a LPN that the training speed of LPNs is not significantly (cf. “LPN”). Performance values (cf. “Seconds Per lower as marginalization is performed with a dy- Card” block) are computed using one CPU. namic program. Finally, a significant decrease is observed if we remove the pointer networks (cf. “– LPN” row). These improvements also generalize syntactic errors such as producing a non-existent to sequence-to-sequence models (cf. “– Attention” function call or generating incomplete code. As row), as the scores are superior to the sequence-to- BLEU penalizes length mismatches, generating sequence benchmark (cf. “Sequence” row). code that matches the length of the reference provides a large boost. The phrase-based transla- Result Analysis Examples of the code gener- tion model (cf. “Phrase” row) performs well in ated for two cards are illustrated in Figure 5. the Django (cf. “Django” column), where map- We obtain the segments that were copied by the ping from the input to the output is mostly mono- pointer networks by computing the most likely tonic, while the hierarchical model (cf. “Hierar- predictor for those segments. We observe from the chical” row) yields better performance on the card marked segments that the model effectively copies datasets as the concatenation of the input fields the attributes that match in the output, including needs to be reordered extensively into the out- the name of the card that must be collapsed. As put sequence. Finally, the sequence-to-sequence expected, the majority of the errors originate from model (cf. “Sequence” row) yields extremely low inaccuracies in the generation of the effect of the results, mainly due to the lack of capacity needed card. While it is encouraging to observe that a to memorize whole input and output sequences, small percentage of the cards are generated cor- while the attention based model (cf. “Attention” rectly, it is worth mentioning that these are the re- row) produces results on par with phrase-based sult of many cards possessing similar effects. The systems. Finally, we observe that by including all “Madder Bomber” card is generated correctly as the proposed components (cf. “Our System” row), there is a similar card “Mad Bomber” in the train- we obtain significant improvements over all base- ing set, which implements the same effect, except lines in the three datasets and is the only one that that it deals 3 damage instead of 6. Yet, it is a obtains non-zero accuracies in the card datasets. promising result that the model was able to capture this difference. However, in many cases, effects et al., 2015), Bayesian Tree Transducers (Jones et that radically differ from seen ones tend to be gen- al., 2012; Lei et al., 2013) and Probabilistic Con- erated incorrectly. In the card “Preparation”, we text Free Grammars (Andreas et al., 2013). The observe that while the properties of the card are work in natural language programming (Vadas and generated correctly, the effect implements a unre- Curran, 2005; Manshadi et al., 2013), where users lated one, with the exception of the value 3, which write lines of code from natural language, is also is correctly copied. Yet, interestingly, it still gener- related to our work. Finally, the reverse map- ates a valid effect, which sets a minion’s attack to ping from code into natural language is explored 3. Investigating better methods to accurately gen- in (Oda et al., 2015). erate these effects will be object of further studies. Character-based sequence-to-sequence models have previously been used to generate code from natural language in (Mou et al., 2015). Inspired by these works, LPNs provide a richer framework by employing attention models (Bahdanau et al., 2014), pointer networks (Vinyals et al., 2015) and character-based embeddings (Ling et al., 2015). Our formulation can also be seen as a generaliza- tion of Allamanis et al. (2016), who implement a special case where two predictors have the same granularity (a sub-token softmax and a pointer network). Finally, HMMs have been employed in neural models to marginalize over label sequences in (Collobert et al., 2011; Lample et al., 2016) by modeling transitions between labels. Figure 5: Examples of decoded cards from HS. Copied segments are marked in green and incor- 9 Conclusion rect segments are marked in red. We introduced a neural network architecture named Latent Prediction Network, which allows efficient marginalization over multiple predictors. 8 Related Work Under this architecture, we propose a generative model for code generation that combines a char- While we target widely used programming lan- acter level softmax to generate language-specific guages, namely, Java and Python, our work is tokens and multiple pointer networks to copy key- related to studies on the generation of any ex- words from the input. Along with other exten- ecutable code. These include generating regu- sions, namely structured attention and code com- lar expressions (Kushman and Barzilay, 2013), pression, our model is applied on on both exist- and the code for parsing input documents (Lei ing datasets and also on a newly created one with et al., 2013). Much research has also been in- implementations of TCG game cards. Our experi- vested in generating formal languages, such as ments show that our model out-performs multiple database queries (Zelle and Mooney, 1996; Be- benchmarks, which demonstrate the importance of rant et al., 2013), agent specific language (Kate combining different types of predictors. et al., 2005) or smart phone instructions (Le et al., 2013). Finally, mapping natural language into a sequence of actions for the generation of References executable code (Branavan et al., 2009). Fi- [Allamanis et al.2016] M. Allamanis, H. Peng, and nally, a considerable effort in this task has fo- C. Sutton. 2016. A Convolutional Attention Net- cused on semantic parsing (Wong and Mooney, work for Extreme Summarization of Source Code. 2006; Jones et al., 2012; Lei et al., 2013; Artzi ArXiv e-prints, February. et al., 2015; Quirk et al., 2015). Recently pro- [Andreas et al.2013] Jacob Andreas, Andreas Vlachos, posed models focus on Combinatory Categorical and Stephen Clark. 2013. Semantic parsing as ma- Grammars (Kushman and Barzilay, 2013; Artzi chine translation. In Proceedings of the 51st Annual Meeting of the Association for Computational Lin- Twentieth National Conference on Artificial Intelli- guistics, pages 47–52, August. gence (AAAI-05), pages 1062–1068, Pittsburgh, PA, July. [Artzi et al.2015] Yoav Artzi, Kenton Lee, and Luke Zettlemoyer. 2015. Broad-coverage ccg semantic [Koehn et al.2007] Philipp Koehn, Hieu Hoang, parsing with amr. In Proceedings of the 2015 Con- Alexandra Birch, Chris Callison-Burch, Marcello ference on Empirical Methods in Natural Language Federico, Nicola Bertoldi, Brooke Cowan, Wade Processing, pages 1699–1710, September. Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrejˇ Bojar, Alexandra Constantin, and Evan [Bahdanau et al.2014] Dzmitry Bahdanau, Kyunghyun Herbst. 2007. Moses: Open source toolkit for Cho, and Yoshua Bengio. 2014. Neural machine statistical machine translation. In Proceedings of translation by jointly learning to align and translate. the 45th Annual Meeting of the ACL on Interactive CoRR, abs/1409.0473. Poster and Demonstration Sessions, pages 177–180.

[Berant et al.2013] Jonathan Berant, Andrew Chou, [Kushman and Barzilay2013] Nate Kushman and Roy Frostig, and Percy Liang. 2013. Semantic pars- Regina Barzilay. 2013. Using semantic unifica- ing on freebase from question-answer pairs. In Pro- tion to generate regular expressions from natural ceedings of the 2013 Conference on Empirical Meth- language. In Proceedings of the 2013 Conference ods in Natural Language Processing, pages 1533– of the North American Chapter of the Association 1544. for Computational Linguistics: Human Language Technologies, pages 826–836, Atlanta, Georgia, [Branavan et al.2009] S. R. K. Branavan, Harr Chen, June. Luke S. Zettlemoyer, and Regina Barzilay. 2009. Reinforcement learning for mapping instructions to [Kwiatkowski et al.2010] Tom Kwiatkowski, Luke actions. In Proceedings of the Joint Conference of Zettlemoyer, Sharon Goldwater, and Mark Steed- the 47th Annual Meeting of the ACL and the 4th In- man. 2010. Inducing probabilistic ccg grammars ternational Joint Conference on Natural Language from logical form with higher-order unification. In Processing of the AFNLP, pages 82–90. Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages [Chiang2007] David Chiang. 2007. Hierarchi- 1223–1233. cal phrase-based translation. Comput. Linguist., 33(2):201–228, June. [Lample et al.2016] G. Lample, M. Ballesteros, S. Sub- ramanian, K. Kawakami, and C. Dyer. 2016. Neural [Collobert et al.2011] Ronan Collobert, Jason Weston, Architectures for Named Entity Recognition. ArXiv Leon´ Bottou, Michael Karlen, Koray Kavukcuoglu, e-prints, March. and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. J. Mach. Learn. Res., [Le et al.2013] Vu Le, Sumit Gulwani, and Zhendong 12:2493–2537, November. Su. 2013. Smartsynth: Synthesizing smartphone automation scripts from natural language. In Pro- [Heafield et al.2013] Kenneth Heafield, Ivan ceeding of the 11th Annual International Confer- Pouzyrevsky, Jonathan H. Clark, and Philipp ence on Mobile Systems, Applications, and Services, Koehn. 2013. Scalable modified Kneser-Ney lan- pages 193–206. guage model estimation. In Proceedings of the 51th Annual Meeting on Association for Computational [Lei et al.2013] Tao Lei, Fan Long, Regina Barzilay, Linguistics, pages 690–696. and Martin Rinard. 2013. From natural language specifications to program input parsers. In Proceed- [Hochreiter and Schmidhuber1997] Sepp Hochreiter ings of the 51st Annual Meeting of the Association and Jurgen¨ Schmidhuber. 1997. Long short- for Computational Linguistics (Volume 1: Long Pa- term memory. Neural Comput., 9(8):1735–1780, pers), pages 1294–1303, Sofia, Bulgaria, August. November. [Ling et al.2015] Wang Ling, Tiago Lu´ıs, Lu´ıs Marujo, [Jones et al.2012] Bevan Keeley Jones, Mark Johnson, Ramon´ Fernandez Astudillo, Silvio Amir, Chris and Sharon Goldwater. 2012. Semantic parsing Dyer, Alan W Black, and Isabel Trancoso. 2015. with bayesian tree transducers. In Proceedings of Finding function in form: Compositional character the 50th Annual Meeting of the Association for Com- models for open vocabulary word representation. In putational Linguistics, pages 488–496. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. [Jozefowicz´ et al.2016] Rafal Jozefowicz,´ Oriol Vinyals, Mike Schuster, Noam Shazeer, and [Lu et al.2008] Wei Lu, Hwee Tou Ng, Wee Sun Lee, Yonghui Wu. 2016. Exploring the limits of and Luke S. Zettlemoyer. 2008. A generative model language modeling. CoRR, abs/1602.02410. for parsing natural language to meaning representations. In Proceedings of the 2008 Conference on [Kate et al.2005] Rohit J. Kate, Yuk Wah Wong, and Empirical Methods in Natural Language Process- Raymond J. Mooney. 2005. Learning to transform ing, EMNLP ’08, pages 783–792, Stroudsburg, PA, natural to formal languages. In Proceedings of the USA. Association for Computational Linguistics. [Manshadi et al.2013] Mehdi Hafezi Manshadi, Daniel [Wong and Mooney2006] Yuk Wah Wong and Ray- Gildea, and James F. Allen. 2013. Integrating pro- mond J. Mooney. 2006. Learning for semantic gramming by example and natural language pro- parsing with statistical machine translation. In Pro- gramming. In Marie desJardins and Michael L. ceedings of the Main Conference on Human Lan- Littman, editors, AAAI. AAAI Press. guage Technology Conference of the North Amer- ican Chapter of the Association of Computational [Mou et al.2015] Lili Mou, Rui Men, Ge Li, Lu Zhang, Linguistics, pages 439–446. and Zhi Jin. 2015. On end-to-end program generation from user intention by deep neural networks. [Zeiler2012] Matthew D. Zeiler. 2012. ADADELTA: CoRR, abs/1510.07211. an adaptive learning rate method. CoRR, abs/1212.5701. [Och and Ney2003] Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical [Zelle and Mooney1996] John M. Zelle and Ray- alignment models. Comput. Linguist., 29(1):19–51, mond J. Mooney. 1996. Learning to parse database March. queries using inductive logic programming. In [Oda et al.2015] Yusuke Oda, Hiroyuki Fudaba, Gra- AAAI/IAAI, pages 1050–1055, Portland, OR, Au- ham Neubig, Hideaki Hata, Sakriani Sakti, Tomoki gust. AAAI Press/MIT Press. Toda, and Satoshi Nakamura. 2015. Learning to generate pseudo-code from source code using statistical machine translation. In 30th IEEE/ACM Inter- national Conference on Automated Software Engi- neering (ASE), Lincoln, Nebraska, USA, November. [Papineni et al.2002] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 311–318. [Quirk et al.2015] Chris Quirk, Raymond Mooney, and Michel Galley. 2015. Language to code: Learn- ing semantic parsers for if-this-then-that recipes. In Proceedings of the 53rd Annual Meeting of the As- sociation for Computational Linguistics, pages 878– 888, Beijing, China, July. [Sarawagi and Cohen2005] Sunita Sarawagi and William W. Cohen. 2005. Semi-markov conditional random fields for information extraction. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages 1185–1192. MIT Press. [Sokolov and Yvon2011] Artem Sokolov and François Yvon. 2011. Minimum Error Rate Semi-Ring. In Mikel Forcada and Heidi Depraetere, editors, Pro- ceedings of the European Conference on Machine Translation, pages 241–248, Leuven, Belgium. [Sutskever et al.2014] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. CoRR, abs/1409.3215. [Vadas and Curran2005] David Vadas and James R. Curran. 2005. Programming with unrestricted natural language. In Proceedings of the Australasian Language Technology Workshop 2005, pages 191– 199, Sydney, Australia, December. [Vinyals et al.2015] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer networks. In C. Cortes, N.D. Lawrence, D.D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Infor- mation Processing Systems 28, pages 2674–2682. Curran Associates, Inc.