Data Recombination for Neural Semantic

Robin Jia Percy Liang Computer Science Department Computer Science Department Stanford University Stanford University [email protected] [email protected]

Abstract Original Examples what are the major cities in utah ? Modeling crisp logical regularities is cru- what states border maine ? cial in semantic parsing, making it difficult Induce Grammar for neural models with no task-specific prior knowledge to achieve good results. Synchronous CFG In this paper, we introduce data recom- bination, a novel framework for inject- Sample New Examples ing such prior knowledge into a model.

From the training data, we induce a high- Recombinant Examples precision synchronous context-free gram- what are the major cities in [states border [maine]] ? mar, which captures important conditional what are the major cities in [states border [utah]] ? independence properties commonly found what states border [states border [maine]] ? in semantic parsing. We then train a what states border [states border [utah]] ? sequence-to-sequence recurrent network Train Model (RNN) model with a novel attention-based copying mechanism on datapoints sam- Sequence-to-sequence RNN pled from this grammar, thereby teaching the model about these structural proper- ties. Data recombination improves the ac- Figure 1: An overview of our system. Given a curacy of our RNN model on three se- dataset, we induce a high-precision synchronous mantic parsing datasets, leading to new context-free grammar. We then sample from this state-of-the-art performance on the stan- grammar to generate new “recombinant” exam- dard GeoQuery dataset for models with ples, which we use to train a sequence-to-sequence comparable supervision. RNN.

1 Introduction have made swift inroads into many structured pre- Semantic parsing—the precise translation of nat- diction tasks in NLP, including machine trans- ural language utterances into logical forms—has lation (Sutskever et al., 2014; Bahdanau et al., many applications, including question answer- 2014) and syntactic parsing (Vinyals et al., 2015b; ing (Zelle and Mooney, 1996; Zettlemoyer and Dyer et al., 2015). Because RNNs make very few Collins, 2005; Zettlemoyer and Collins, 2007; domain-specific assumptions, they have the poten- Liang et al., 2011; Berant et al., 2013), instruc- tial to succeed at a wide variety of tasks with min- tion following (Artzi and Zettlemoyer, 2013b), imal feature engineering. However, this flexibil- and regular expression generation (Kushman and ity also puts RNNs at a disadvantage compared Barzilay, 2013). Modern semantic parsers (Artzi to standard semantic parsers, which can generalize and Zettlemoyer, 2013a; Berant et al., 2013) naturally by leveraging their built-in awareness of are complex pieces of software, requiring hand- logical compositionality. crafted features, lexicons, and grammars. In this paper, we introduce data recombina- Meanwhile, recurrent neural networks (RNNs) tion, a generic framework for declaratively inject-

12 Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 12–22, Berlin, Germany, August 7-12, 2016. c 2016 Association for Computational Linguistics

(2015b) showed that an RNN can reliably predict GEO x:“what is the population of iowa ?” tree-structured outputs in a linear fashion. y: _answer ( NV , ( We evaluate our system on three existing se- _population ( NV , V1 ) , _const ( V0 , _stateid ( iowa ) ) ) ) mantic parsing datasets. Figure 2 shows sample ATIS input-output pairs from each of these datasets. x:“can you list all flights from chicago to milwaukee” GeoQuery (GEO) contains natural language y: ( _lambda $0 e ( _and • ( _flight $0 ) questions about US geography paired with ( _from $0 chicago : _ci ) corresponding database queries. We ( _to $0 milwaukee : _ci ) ) ) use the standard split of 600 training exam- Overnight x:“when is the weekly standup” ples and 280 test examples introduced by y: ( call listValue ( call Zettlemoyer and Collins (2005). We prepro- getProperty meeting.weekly_standup cess the logical forms to De Brujin index no- ( string start_time ) ) ) tation to standardize variable naming. ATIS (ATIS) contains natural language • Figure 2: One example from each of our domains. queries for a flights database paired with We tokenize logical forms as shown, thereby cast- corresponding database queries written in ing semantic parsing as a sequence-to-sequence . We train on 4473 examples task. and evaluate on the 448 test examples used by Zettlemoyer and Collins (2007). ing prior knowledge into a domain-general struc- Overnight (OVERNIGHT) contains logical • tured prediction model. In data recombination, forms paired with natural language para- prior knowledge about a task is used to build a phrases across eight varied subdomains. high-precision generative model that expands the Wang et al. (2015) constructed the dataset by empirical distribution by allowing fragments of generating all possible logical forms up to different examples to be combined in particular some depth threshold, then getting multiple ways. Samples from this generative model are natural language paraphrases for each logi- then used to train a domain-general model. In the cal form from workers on Amazon Mechan- case of semantic parsing, we construct a genera- ical Turk. We evaluate on the same train/test tive model by inducing a synchronous context-free splits as Wang et al. (2015). grammar (SCFG), creating new examples such as those shown in Figure 1; our domain-general In this paper, we only explore learning from log- model is a sequence-to-sequence RNN with a ical forms. In the last few years, there has an novel attention-based copying mechanism. Data emergence of semantic parsers learned from de- recombination boosts the accuracy of our RNN notations (Clarke et al., 2010; Liang et al., 2011; model on three semantic parsing datasets. On the Berant et al., 2013; Artzi and Zettlemoyer, 2013b). GEO dataset, data recombination improves test ac- While our system cannot directly learn from deno- curacy by 4.3 percentage points over our baseline tations, it could be used to rerank candidate deriva- RNN, leading to new state-of-the-art results for tions generated by one of these other systems. models that do not use a seed lexicon for predi- 3 Sequence-to-sequence RNN Model cates. Our sequence-to-sequence RNN model is based 2 Problem statement on existing attention-based neural machine trans- lation models (Bahdanau et al., 2014; Luong et al., We cast semantic parsing as a sequence-to- 2015a), but also includes a novel attention-based sequence task. The input utterance x is a sequence copying mechanism. Similar copying mechanisms (in) of words x1, . . . , xm , the input vocabulary; have been explored in parallel by Gu et al. (2016) ∈ V similarly, the output logical form y is a sequence and Gulcehre et al. (2016). of tokens y , . . . , y (out), the output vocab- 1 n ∈ V ulary. A linear sequence of tokens might appear 3.1 Basic Model to lose the hierarchical structure of a logical form, Encoder. The encoder converts the input se- but there is precedent for this choice: Vinyals et al. quence x1, . . . , xm into a sequence of context-

13 sensitive embeddings b1, . . . , bm using a bidirec- model has difficulty generalizing to the long tail of tional RNN (Bahdanau et al., 2014). First, a word entity names commonly found in semantic parsing (in) embedding function φ maps each word xi to a datasets. Conveniently, entity names in the input fixed-dimensional vector. These vectors are fed as often correspond directly to tokens in the output input to two RNNs: a forward RNN and a back- (e.g., “iowa” becomes iowa in Figure 2).1 ward RNN. The forward RNN starts with an initial To capture this intuition, we introduce a new F hidden state h0, and generates a sequence of hid- attention-based copying mechanism. At each time F F den states h1, . . . , hm by repeatedly applying the step j, the decoder generates one of two types of recurrence actions. As before, it can write any word in the output vocabulary. In addition, it can copy any in- F (in) F hi = LSTM(φ (xi), hi 1). (1) put word xi directly to the output, where the prob- − ability with which we copy xi is determined by The recurrence takes the form of an LSTM the attention score on xi. Formally, we define a (Hochreiter and Schmidhuber, 1997). The back- latent action aj that is either Write[w] for some ward RNN similarly generates hidden states w (out) or Copy[i] for some i 1, . . . , m . B B ∈ V ∈ { } hm, . . . , h1 by processing the input sequence in We then have reverse order. Finally, for each input position i, we define the context-sensitive embedding bi to be P (aj = Write[w] x, y1:j 1) exp(Uw[sj, cj]), F B | − ∝ the concatenation of hi and hi (8) Decoder. The decoder is an attention-based P (aj = Copy[i] x, y1:j 1) exp(eji). (9) | − ∝ model (Bahdanau et al., 2014; Luong et al., 2015a) The decoder chooses aj with a softmax over all that generates the output sequence y1, . . . , yn one token at a time. At each time step j, it writes these possible actions; yj is then a deterministic function of aj and x. During training, we maxi- yj based on the current hidden state sj, then up- mize the log-likelihood of y, marginalizing out a. dates the hidden state to sj+1 based on sj and yj. Formally, the decoder is defined by the following Attention-based copying can be seen as a com- equations: bination of a standard softmax output layer of an attention-based model (Bahdanau et al., 2014) and (s) F B a Pointer Network (Vinyals et al., 2015a); in a s1 = tanh(W [hm, h1 ]). (2) Pointer Network, the only way to generate output e = s W (a)b . (3) ji j> i is to copy a symbol from the input. exp(e ) α = ji . (4) ji m exp(e ) 4 Data Recombination i0=1 ji0 m P 4.1 Motivation cj = αjibi. (5) The main contribution of this paper is a novel data Xi=1 P (yj = w x, y1:j 1) exp(Uw[sj, cj]). (6) recombination framework that injects important | − ∝ prior knowledge into our oblivious sequence-to- s = LSTM([φ(out)(y ), c ], s ). (7) j+1 j j j sequence RNN. In this framework, we induce a When not specified, i ranges over 1, . . . , m and high-precision generative model from the training { } j ranges over 1, . . . , n . Intuitively, the α ’s de- data, then sample from it to generate new training { } ji fine a probability distribution over the input words, examples. The process of inducing this generative describing what words in the input the decoder is model can leverage any available prior knowledge, focusing on at time j. They are computed from which is transmitted through the generated exam- ples to the RNN model. A key advantage of our the unnormalized attention scores eji. The matri- ces W (s), W (a), and U, as well as the embedding two-stage approach is that it allows us to declare function φ(out), are parameters of the model. desired properties of the task which might be hard to capture in the model architecture.

3.2 Attention-based Copying 1On GEO and ATIS, we make a point not to rely on or- In the basic model of the previous section, the next thography for non-entities such as “state” to _state, since this leverages information not available to previous models output word yj is chosen via a simple softmax over (Zettlemoyer and Collins, 2005) and is much less language- all words in the output vocabulary. However, this independent.

14 Examples (“what states border texas ?”, answer(NV, (state(V0), next_to(V0, NV), const(V0, stateid(texas))))) (“what is the highest mountain in ohio ?”, answer(NV, highest(V0, (mountain(V0), loc(V0, NV), const(V0, stateid(ohio)))))) Rules created by ABSENTITIES ROOT “what states border STATEID ?”, → h answer(NV, (state(V0), next_to(V0, NV), const(V0, stateid(STATEID )))) i STATEID “texas”, texas → h i ROOT “what is the highest mountain in STATEID ?”, answer(NV,→ h highest(V0, (mountain(V0), loc(V0, NV), const(V0, stateid(STATEID ))))) i STATEID “ohio”, ohio → h i Rules created by ABSWHOLEPHRASES ROOT “what states border STATE ?”, answer(NV, (state(V0), next_to(V0, NV), STATE )) → h i STATE “states border texas”, state(V0), next_to(V0, NV), const(V0, stateid(texas)) → h i ROOT “what is the highest mountain in STATE ?”, → h answer(NV, highest(V0, (mountain(V0), loc(V0, NV), STATE ))) i Rules created by CONCAT-2 ROOT SENT1 SENT2, SENT1 SENT2 → h i SENT “what states border texas ?”, answer(NV,→ h (state(V0), next_to(V0, NV), const(V0, stateid(texas)))) i SENT “what is the highest mountain in ohio ?”, answer(NV,→ h highest(V0, (mountain(V0), loc(V0, NV), const(V0, stateid(ohio))))) i

Figure 3: Various grammar induction strategies illustrated on GEO. Each strategy converts the rules of an input grammar into rules of an output grammar. This figure shows the base case where the input grammar has rules ROOT x, y for each (x, y) pair in the training dataset. → h i

Our approach generalizes data augmentation, this need. which is commonly employed to inject prior knowledge into a model. Data augmenta- 4.2 General Setting tion techniques focus on modeling invariances— In the general setting of data recombination, we transformations like translating an image or start with a training set of (x, y) pairs, which D adding noise that alter the inputs x, but do not defines the empirical distribution pˆ(x, y). We then change the output y. These techniques have fit a generative model p˜(x, y) to pˆ which gener- proven effective in areas like computer vision alizes beyond the support of pˆ, for example by (Krizhevsky et al., 2012) and speech recognition splicing together fragments of different examples. (Jaitly and Hinton, 2013). We refer to examples in the support of p˜ as re- In semantic parsing, however, we would like to combinant examples. Finally, to train our actual model p (y x), we maximize the expected value capture more than just invariance properties. Con- θ | of log p (y x), where (x, y) is drawn from p˜. sider an example with the utterance “what states θ | border texas ?”. Given this example, it should be easy to generalize to questions where “texas” is 4.3 SCFGs for Semantic Parsing replaced by the name of any other state: simply For semantic parsing, we induce a synchronous replace the mention of Texas in the logical form context-free grammar (SCFG) to serve as the with the name of the new state. Underlying this backbone of our generative model p˜. An SCFG phenomenon is a strong conditional independence consists of a set of production rules X α, β , → h i principle: the meaning of the rest of the sentence where X is a category (non-terminal), and α and β is independent of the name of the state in ques- are sequences of terminal and non-terminal sym- tion. Standard data augmentation is not sufficient bols. Any non-terminal symbols in α must be to model such phenomena: instead of holding y aligned to the same non-terminal symbol in β, and fixed, we would like to apply simultaneous trans- vice versa. Therefore, an SCFG defines a set of formations to x and y such that the new x still joint derivations of aligned pairs of strings. In our maps to the new y. Data recombination addresses case, we use an SCFG to represent joint deriva-

15 tions of utterances x and logical forms y (which 4.3.2 Abstracting Whole Phrases for us is just a sequence of tokens). After we Our second grammar induction strategy, ABSW- induce an SCFG G from , the corresponding D HOLEPHRASES, abstracts both entities and whole generative model p˜(x, y) is the distribution over phrases with their types. For each grammar rule pairs (x, y) defined by sampling from G, where X α, β in G , we add up to two rules to we choose production rules to apply uniformly at → h i in G . First, if α contains tokens that string match random. out to an entity in β, we replace both occurrences with It is instructive to compare our SCFG-based the type of the entity, similarly to rule (i) from AB- data recombination with WASP (Wong and SENTITIES. Second, if we can infer that the entire Mooney, 2006; Wong and Mooney, 2007), which expression β evaluates to a set of a particular type uses an SCFG as the actual semantic parsing (e.g. state) we create a rule that maps the type model. The grammar induced by WASP must have to α, β . In practice, we also use some simple good coverage in order to generalize to new in- h i rules to strip question identifiers from α, so that puts at test time. WASP also requires the imple- the resulting examples are more natural. Again, mentation of an efficient algorithm for computing refer to Figure 3 for a concrete example. the conditional probability p(y x). In contrast, | our SCFG is only used to convey prior knowl- This strategy works because of a more general edge about conditional independence structure, so conditional independence property: the meaning it only needs to have high precision; our RNN of any semantically coherent phrase is condition- model is responsible for boosting recall over the ally independent of the rest of the sentence, the entire input space. We also only need to forward cornerstone of compositional semantics. Note that sample from the SCFG, which is considerably eas- this assumption is not always correct in general: ier to implement than conditional inference. for example, phenomena like anaphora that in- volve long-range context dependence violate this Below, we examine various strategies for induc- assumption. However, this property holds in most ing a grammar G from a dataset . We first en- D existing semantic parsing datasets. code as an initial grammar with rules ROOT D x, y for each (x, y) . Next, we will → h i ∈ D define each grammar induction strategy as a map- 4.3.3 Concatenation ping from an input grammar Gin to a new gram- The final grammar induction strategy is a surpris- mar Gout. This formulation allows us to compose ingly simple approach we tried that turns out to grammar induction strategies (Section 4.3.4). work. For any k 2, we define the CONCAT-k ≥ strategy, which creates two types of rules. First, 4.3.1 Abstracting Entities we create a single rule that has ROOT going to Our first grammar induction strategy, ABSENTI- a sequence of k SENT’s. Then, for each root- level rule ROOT α, β in G , we add the rule TIES, simply abstracts entities with their types. → h i in SENT α, β to G . See Figure 3 for an ex- We assume that each entity e (e.g., texas) has → h i out a corresponding type e.t (e.g., state), which we ample. infer based on the presence of certain predicates Unlike ABSENTITIES and ABSWHOLE- in the logical form (e.g. stateid). For each PHRASES, concatenation is very general, and can grammar rule X α, β in G , where α con- be applied to any sequence transduction problem. → h i in tains a token (e.g., “texas”) that string matches Of course, it also does not introduce additional an entity (e.g., texas) in β, we add two rules information about compositionality or indepen- to Gout: (i) a rule where both occurrences are re- dence properties present in semantic parsing. placed with the type of the entity (e.g., state), However, it does generate harder examples for the and (ii) a new rule that maps the type to the en- attention-based RNN, since the model must learn tity (e.g., STATEID “texas”, texas ; we re- to attend to the correct parts of the now-longer → h i serve the category name STATE for the next sec- input sequence. Related work has shown that tion). Thus, Gout generates recombinant examples training a model on more difficult examples can that fuse most of one example with an entity found improve generalization, the most canonical case in a second example. A concrete example from the being dropout (Hinton et al., 2012; Wager et al., GEO domain is given in Figure 3. 2013).

16 comparison to previous work, as string matching function TRAIN(dataset D, number of epochs T , number of examples to sample n) between input words and predicate names is not Induce grammar G from D commonly used. We prevent copying by prepend- Initialize RNN parameters θ randomly for each iteration t = 1,...,T do ing underscores to predicate tokens; see Figure 2 Compute current learning rate ηt for examples. Initialize current dataset Dt to D On ATIS alone, when doing attention-based for i = 1, . . . , n do Sample new example (x0, y0) from G copying and data recombination, we leverage Add (x0, y0) to Dt an external lexicon that maps natural language end for phrases (e.g., “kennedy airport”) to entities (e.g., Shuffle Dt for each example (x, y) in Dt do jfk:ap). When we copy a word that is part of θ θ + ηt log pθ(y x) a phrase in the lexicon, we write the entity asso- end for← ∇ | end for ciated with that lexicon entry. When performing end function data recombination, we identify entity alignments based on matching phrases and entities from the lexicon. Figure 4: The training procedure with data recom- We run all experiments with 200 hidden units bination. We first induce an SCFG, then sample and 100-dimensional word vectors. We initial- new recombinant examples from it at each epoch. ize all parameters uniformly at random within the interval [ 0.1, 0.1]. We maximize the log- − 4.3.4 Composition likelihood of the correct logical form using stochastic gradient descent. We train the model We note that grammar induction strategies can for a total of 30 epochs with an initial learning rate be composed, yielding more complex grammars. of 0.1, and halve the learning rate every 5 epochs, Given any two grammar induction strategies f1 starting after epoch 15. We replace word vectors and f2, the composition f1 f2 is the grammar ◦ for words that occur only once in the training set induction strategy that takes in Gin and returns with a universal word vector. Our model f1(f2(Gin)). For the strategies we have defined, is implemented in Theano (Bergstra et al., 2010). we can perform this operation symbolically on the grammar rules, without having to sample from the When performing data recombination, we sam- ple a new round of recombinant examples from intermediate grammar f2(Gin). our grammar at each epoch. We add these ex- 5 Experiments amples to the original training dataset, randomly shuffle all examples, and train the model for the We evaluate our system on three domains: GEO, epoch. Figure 4 gives pseudocode for this training ATIS, and OVERNIGHT. For ATIS, we report procedure. One important hyperparameter is how logical form exact match accuracy. For GEO and many examples to sample at each epoch: we found OVERNIGHT, we determine correctness based on that a good rule of thumb is to sample as many re- denotation match, as in Liang et al. (2011) and combinant examples as there are examples in the Wang et al. (2015), respectively. training dataset, so that half of the examples the 5.1 Choice of Grammar Induction Strategy model sees at each epoch are recombinant. At test time, we use beam search with beam size We note that not all grammar induction strate- 5. We automatically balance missing right paren- gies make sense for all domains. In particular, theses by adding them at the end. On GEO and we only apply ABSWHOLEPHRASES to GEO and OVERNIGHT, we then pick the highest-scoring OVERNIGHT. We do not apply ABSWHOLE- logical form that does not yield an executor error PHRASES to ATIS, as the dataset has little nesting when the corresponding denotation is computed. structure. On ATIS, we just pick the top prediction on the 5.2 Implementation Details beam. We tokenize logical forms in a domain-specific 5.3 Impact of the Copying Mechanism manner, based on the syntax of the formal lan- guage being used. On GEO and ATIS, we dis- First, we measure the contribution of the attention- allow copying of predicate names to ensure a fair based copying mechanism to the model’s overall

17 GEO ATIS OVERNIGHT For our main results, we train our model with a va- No Copying 74.6 69.9 76.7 With Copying 85.0 76.3 75.8 riety of data recombination strategies on all three datasets. These results are summarized in Tables 2 Table 1: Test accuracy on GEO, ATIS, and and 3. We compare our system to the baseline of OVERNIGHT, both with and without copying. On not using any data recombination, as well as to OVERNIGHT, we average across all eight domains. state-of-the-art systems on all three datasets. We find that data recombination consistently GEO ATIS Previous Work improves accuracy across the three domains we Zettlemoyer and Collins (2007) 84.6 evaluated on, and that the strongest results come Kwiatkowski et al. (2010) 88.9 from composing multiple strategies. Combin- Liang et al. (2011)2 91.1 Kwiatkowski et al. (2011) 88.6 82.8 ing ABSWHOLEPHRASES,ABSENTITIES, and Poon (2013) 83.5 CONCAT-2 yields a 4.3 percentage point improve- Zhao and Huang (2015) 88.9 84.2 ment over the baseline without data recombina- Our Model No Recombination 85.0 76.3 tion on GEO, and an average of 1.7 percentage ABSENTITIES 85.4 79.9 points on OVERNIGHT. In fact, on GEO, we ABSWHOLEPHRASES 87.5 achieve test accuracy of 89.3%, which surpasses CONCAT-2 84.6 79.0 CONCAT-3 77.5 the previous state-of-the-art, excluding Liang et al. AWP + AE 88.9 (2011), which used a seed lexicon for predicates. AE + C2 78.8 AWP + AE + C2 89.3 On ATIS, we experiment with concatenating more AE + C3 83.3 than 2 examples, to make up for the fact that we cannot apply ABSWHOLEPHRASES, which gen- Table 2: Test accuracy using different data recom- erates longer examples. We obtain a test accu- bination strategies on GEO and ATIS. AE is AB- racy of 83.3 with ABSENTITIES composed with SENTITIES, AWP is ABSWHOLEPHRASES, C2 is CONCAT-3, which beats the baseline by 7 percent- CONCAT-2, and C3 is CONCAT-3. age points and is competitive with the state-of-the- art. performance. On each task, we train and evalu- For ate two models: one with the copying mechanism, Data recombination without copying. completeness, we also investigated the effects and one without. Training is done without data re- of data recombination on the model without combination. The results are shown in Table 1. attention-based copying. We found that recom- On GEO and ATIS, the copying mechanism bination helped significantly on GEO and ATIS, helps significantly: it improves test accuracy by but hurt the model slightly on OVERNIGHT. On 10.4 percentage points on GEO and 6.4 points GEO, the best data recombination strategy yielded on ATIS. However, on OVERNIGHT, adding the test accuracy of 82.9%, for a gain of 8.3 percent- copying mechanism actually makes our model age points over the baseline with no copying and perform slightly worse. This result is somewhat no recombination; on ATIS, data recombination expected, as the OVERNIGHT dataset contains a gives test accuracies as high as 74.6%, a 4.7 point very small number of distinct entities. It is also gain over the same baseline. However, no data re- notable that both systems surpass the previous best combination strategy improved average test accu- system on OVERNIGHT by a wide margin. racy on OVERNIGHT; the best one resulted in a We choose to use the copying mechanism in all 0.3 percentage point decrease in test accuracy. We subsequent experiments, as it has a large advan- hypothesize that data recombination helps less on tage in realistic settings where there are many dis- OVERNIGHT in general because the space of pos- tinct entities in the world. The concurrent work of sible logical forms is very limited, making it more Gu et al. (2016) and Gulcehre et al. (2016), both of like a large multiclass classification task. There- whom propose similar copying mechanisms, pro- fore, it is less important for the model to learn vides additional evidence for the utility of copying good compositional representations that general- on a wide range of NLP tasks. ize to new logical forms at test time.

5.4 Main Results ours, as they as they used a seed lexicon mapping words to predicates. We explicitly avoid using such prior knowledge 2The method of Liang et al. (2011) is not comparable to in our system.

18 BASKETBALL BLOCKS CALENDAR HOUSING PUBLICATIONS RECIPES RESTAURANTS SOCIAL Avg. Previous Work Wang et al. (2015) 46.3 41.9 74.4 54.0 59.0 70.8 75.9 48.2 58.8 Our Model No Recombination 85.2 58.1 78.0 71.4 76.4 79.6 76.2 81.4 75.8 ABSENTITIES 86.7 60.2 78.0 65.6 73.9 77.3 79.5 81.3 75.3 ABSWHOLEPHRASES 86.7 55.9 79.2 69.8 76.4 77.8 80.7 80.9 75.9 CONCAT-2 84.7 60.7 75.6 69.8 74.5 80.1 79.5 80.8 75.7 AWP + AE 85.2 54.1 78.6 67.2 73.9 79.6 81.9 82.1 75.3 AWP + AE + C2 87.5 60.2 81.0 72.5 78.3 81.0 79.5 79.6 77.5

Table 3: Test accuracy using different data recombination strategies on the OVERNIGHT tasks.

as well. In comparison, applying ABSENTITIES Depth-2 (same length) x:“rel:12 of rel:17 of ent:14” alone, which generates examples of the same y: ( _rel:12 ( _rel:17 _ent:14 ) ) length as those in the original dataset, was Depth-4 (longer) generally less effective. x:“rel:23 of rel:36 of rel:38 of rel:10 of ent:05” y: ( _rel:23 ( _rel:36 ( _rel:38 We conducted additional experiments on artifi- ( _rel:10 _ent:05 ) ) ) ) cial data to investigate the importance of adding longer, harder examples. We experimented with adding new examples via data recombination, as Figure 5: A sample of our artificial data. well as adding new independent examples (e.g. to simulate the acquisition of more training data). We constructed a simple world containing a set of enti-

100 ties and a set of binary relations. For any n, we can generate a set of depth-n examples, which involve

80 the composition of n relations applied to a single entity. Example data points are shown in Figure 5.

60 We train our model on various datasets, then test it on a set of 500 randomly chosen depth-2 exam- 40 ples. The model always has access to a small seed Test accuracy (%) accuracy Test training set of 100 depth-2 examples. We then add Same length, independent 20 Longer, independent one of four types of examples to the training set: Same length, recombinant

0 Longer, recombinant Same length, independent: New randomly • chosen depth-2 examples.3 0 100 200 300 400 500 Longer, independent: Randomly chosen • Number of additional examples depth-4 examples. Same length, recombinant: Depth-2 exam- • Figure 6: The results of our artificial data exper- ples sampled from the grammar induced by iments. We see that the model learns more from applying ABSENTITIES to the seed dataset. longer examples than from same-length examples. Longer, recombinant: Depth-4 examples • sampled from the grammar induced by apply- 5.5 Effect of Longer Examples ing ABSWHOLEPHRASES followed by AB- SENTITIES to the seed dataset. Interestingly, strategies like ABSWHOLE- PHRASES and CONCAT-2 help the model even To maintain consistency between the independent though the resulting recombinant examples are and recombinant experiments, we fix the recombi- generally not in the support of the test distribution. nant examples across all epochs, instead of resam- In particular, these recombinant examples are on pling at every epoch. In Figure 6, we plot accu- average longer than those in the actual dataset, racy on the test set versus the number of additional which makes them harder for the attention-based examples added of each of these four types. As model. Indeed, for every domain, our best 3Technically, these are not completely independent, as we accuracy numbers involved some form of concate- sample these new examples without replacement. The same nation, and often involved ABSWHOLEPHRASES applies to the longer “independent” examples.

19 expected, independent examples are more help- recombination strategies, these techniques only ful than the recombinant ones, but both help the change inputs x, while keeping the labels y fixed. model improve considerably. In addition, we see Additionally, these paraphrasing-based transfor- that even though the test dataset only has short ex- mations can be described in terms of grammar amples, adding longer examples helps the model induction, so they can be incorporated into our more than adding shorter ones, in both the inde- framework. pendent and recombinant cases. These results un- In data recombination, data generated by a high- derscore the importance training on longer, harder precision generative model is used to train a sec- examples. ond, domain-general model. Generative oversam- pling (Liu et al., 2007) learns a generative model 6 Discussion in a multiclass classification setting, then uses it to generate additional examples from rare classes In this paper, we have presented a novel frame- in order to combat label imbalance. Uptraining work we term data recombination, in which we (Petrov et al., 2010) uses data labeled by an ac- generate new training examples from a high- curate but slow model to train a computationally precision generative model induced from the orig- cheaper second model. Vinyals et al. (2015b) gen- inal training dataset. We have demonstrated erate a large dataset of constituency parse trees its effectiveness in improving the accuracy of a by taking sentences that multiple existing systems sequence-to-sequence RNN model on three se- parse in the same way, and train a neural model on mantic parsing datasets, using a synchronous this dataset. context-free grammar as our generative model. Some of our induced grammars generate ex- There has been growing interest in applying amples that are not in the test distribution, but neural networks to semantic parsing and related nonetheless aid in generalization. Related work tasks. Dong and Lapata (2016) concurrently de- has also explored the idea of training on altered veloped an attention-based RNN model for se- or out-of-domain data, often interpreting it as a mantic parsing, although they did not use data re- form of regularization. Dropout training has been combination. Grefenstette et al. (2014) proposed shown to be a form of adaptive regularization a non-recurrent neural model for semantic pars- (Hinton et al., 2012; Wager et al., 2013). Guu et al. ing, though they did not run experiments. Mei et (2015) showed that encouraging a knowledge base al. (2016) use an RNN model to perform a related completion model to handle longer path queries task of instruction following. acts as a form of structural regularization. Our proposed attention-based copying mech- Language is a blend of crisp regularities and anism bears a strong resemblance to two mod- soft relationships. Our work takes RNNs, which els that were developed independently by other excel at modeling soft phenomena, and uses a groups. Gu et al. (2016) apply a very similar copy- highly structured tool—synchronous context free ing mechanism to text summarization and single- grammars—to infuse them with an understanding turn dialogue generation. Gulcehre et al. (2016) of crisp structure. We believe this paradigm for si- propose a model that decides at each step whether multaneously modeling the soft and hard aspects to write from a “shortlist” vocabulary or copy from of language should have broader applicability be- the input, and report improvements on machine yond semantic parsing. translation and text summarization. Another piece of related work is Luong et al. (2015b), who train Acknowledgments This work was supported by a neural system to copy rare the NSF Graduate Research Fellowship under words, relying on an external system to generate Grant No. DGE-114747, and the DARPA Com- alignments. municating with Computers (CwC) program under Prior work has explored using paraphrasing for ARO prime contract no. W911NF-15-1-0462. data augmentation on NLP tasks. Zhang et al. (2015) augment their data by swapping out words Reproducibility. All code, data, and for synonyms from WordNet. Wang and Yang experiments for this paper are avail- (2015) use a similar strategy, but identify similar able on the CodaLab platform at https: words and phrases based on cosine distance be- //worksheets.codalab.org/worksheets/ tween vector space embeddings. Unlike our data 0x50757a37779b485f89012e4ba03b6f4f/.

20 References S. Hochreiter and J. Schmidhuber. 1997. Long short- term memory. Neural Computation, 9(8):1735– Y. Artzi and L. Zettlemoyer. 2013a. UW SPF: The 1780. University of Washington semantic parsing frame- work. arXiv preprint arXiv:1311.3011. N. Jaitly and G. E. Hinton. 2013. Vocal tract length perturbation (vtlp) improves speech recog- Y. Artzi and L. Zettlemoyer. 2013b. Weakly super- nition. In International Conference on Machine vised learning of semantic parsers for mapping in- Learning (ICML). structions to actions. Transactions of the Associ- ation for Computational Linguistics (TACL), 1:49– A. Krizhevsky, I. Sutskever, and G. E. Hinton. 2012. 62. Imagenet classification with deep convolutional neu- ral networks. In Advances in Neural Information D. Bahdanau, K. Cho, and Y. Bengio. 2014. Neural Processing Systems (NIPS), pages 1097–1105. machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. N. Kushman and R. Barzilay. 2013. Using semantic unification to generate regular expressions from nat- J. Berant, A. Chou, R. Frostig, and P. Liang. 2013. ural language. In Human Language Technology and Semantic parsing on Freebase from question-answer North American Association for Computational Lin- pairs. In Empirical Methods in Natural Language guistics (HLT/NAACL), pages 826–836. Processing (EMNLP). T. Kwiatkowski, L. Zettlemoyer, S. Goldwater, and J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pas- M. Steedman. 2010. Inducing probabilistic CCG canu, G. Desjardins, J. Turian, D. Warde-Farley, and grammars from logical form with higher-order uni- Y. Bengio. 2010. Theano: a CPU and GPU math fication. In Empirical Methods in Natural Language expression compiler. In Python for Scientific Com- Processing (EMNLP), pages 1223–1233. puting Conference. T. Kwiatkowski, L. Zettlemoyer, S. Goldwater, and J. Clarke, D. Goldwasser, M. Chang, and D. Roth. M. Steedman. 2011. Lexical generalization in 2010. Driving semantic parsing from the world’s re- CCG grammar induction for semantic parsing. In sponse. In Computational Natural Language Learn- Empirical Methods in Natural Language Processing ing (CoNLL), pages 18–27. (EMNLP), pages 1512–1523.

L. Dong and M. Lapata. 2016. Language to logical P. Liang, M. I. Jordan, and D. Klein. 2011. Learn- form with neural attention. In Association for Com- ing dependency-based compositional semantics. In putational Linguistics (ACL). Association for Computational Linguistics (ACL), pages 590–599. C. Dyer, M. Ballesteros, W. Ling, A. Matthews, and N. A. Smith. 2015. Transition-based dependency A. Liu, J. Ghosh, and C. Martin. 2007. Generative parsing with stack long short-term memory. In As- oversampling for mining imbalanced datasets. In In- sociation for Computational Linguistics (ACL). ternational Conference on Data Mining (DMIN).

E. Grefenstette, P. Blunsom, N. de Freitas, and K. M. M. Luong, H. Pham, and C. D. Manning. 2015a. Hermann. 2014. A deep architecture for seman- Effective approaches to attention-based neural ma- tic parsing. In ACL Workshop on Semantic Parsing, chine translation. In Empirical Methods in Natural pages 22–27. Language Processing (EMNLP), pages 1412–1421.

J. Gu, Z. Lu, H. Li, and V. O. Li. 2016. Incorporating M. Luong, I. Sutskever, Q. V. Le, O. Vinyals, and copying mechanism in sequence-to-sequence learn- W. Zaremba. 2015b. Addressing the rare word ing. In Association for Computational Linguistics problem in neural machine translation. In Associa- (ACL). tion for Computational Linguistics (ACL), pages 11– 19. C. Gulcehre, S. Ahn, R. Nallapati, B. Zhou, and Y.Ben- gio. 2016. Pointing the unknown words. In Associ- H. Mei, M. Bansal, and M. R. Walter. 2016. Listen, ation for Computational Linguistics (ACL). attend, and walk: Neural mapping of navigational instructions to action sequences. In Association for K. Guu, J. Miller, and P. Liang. 2015. Travers- the Advancement of Artificial Intelligence (AAAI). ing knowledge graphs in vector space. In Em- pirical Methods in Natural Language Processing S. Petrov, P. Chang, M. Ringgaard, and H. Alshawi. (EMNLP). 2010. Uptraining for accurate deterministic ques- tion parsing. In Empirical Methods in Natural Lan- G. E. Hinton, N. Srivastava, A. Krizhevsky, guage Processing (EMNLP). I. Sutskever, and R. R. Salakhutdinov. 2012. Improving neural networks by preventing co- H. Poon. 2013. Grounded unsupervised semantic pars- adaptation of feature detectors. arXiv preprint ing. In Association for Computational Linguistics arXiv:1207.0580. (ACL).

21 I. Sutskever, O. Vinyals, and Q. V. Le. 2014. Se- quence to sequence learning with neural networks. In Advances in Neural Information Processing Sys- tems (NIPS), pages 3104–3112. O. Vinyals, M. Fortunato, and N. Jaitly. 2015a. Pointer networks. In Advances in Neural Information Pro- cessing Systems (NIPS), pages 2674–2682. O. Vinyals, L. Kaiser, T. Koo, S. Petrov, I. Sutskever, and G. Hinton. 2015b. Grammar as a foreign lan- guage. In Advances in Neural Information Process- ing Systems (NIPS), pages 2755–2763. S. Wager, S. I. Wang, and P. Liang. 2013. Dropout training as adaptive regularization. In Advances in Neural Information Processing Systems (NIPS). W. Y. Wang and D. Yang. 2015. That’s so annoying!!!: A lexical and frame-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using #petpeeve tweets. In Empirical Methods in Natural Language Processing (EMNLP). Y. Wang, J. Berant, and P. Liang. 2015. Building a semantic parser overnight. In Association for Com- putational Linguistics (ACL). Y. W. Wong and R. J. Mooney. 2006. Learning for se- mantic parsing with statistical machine translation. In North American Association for Computational Linguistics (NAACL), pages 439–446. Y. W. Wong and R. J. Mooney. 2007. Learning synchronous grammars for semantic parsing with lambda calculus. In Association for Computational Linguistics (ACL), pages 960–967. M. Zelle and R. J. Mooney. 1996. Learning to parse database queries using inductive logic pro- gramming. In Association for the Advancement of Artificial Intelligence (AAAI), pages 1050–1055. L. S. Zettlemoyer and M. Collins. 2005. Learning to map sentences to logical form: Structured classifica- tion with probabilistic categorial grammars. In Un- certainty in Artificial Intelligence (UAI), pages 658– 666. L. S. Zettlemoyer and M. Collins. 2007. Online learn- ing of relaxed CCG grammars for parsing to log- ical form. In Empirical Methods in Natural Lan- guage Processing and Computational Natural Lan- guage Learning (EMNLP/CoNLL), pages 678–687. X. Zhang, J. Zhao, and Y. LeCun. 2015. Character- level convolutional networks for text classification. In Advances in Neural Information Processing Sys- tems (NIPS). K. Zhao and L. Huang. 2015. Type-driven incremen- tal semantic parsing with polymorphism. In North American Association for Computational Linguis- tics (NAACL).

22