Data Recombination for Neural Semantic Parsing

Data Recombination for Neural Semantic Parsing Robin Jia Percy Liang Computer Science Department Computer Science Department Stanford University Stanford University [email protected] [email protected] Abstract Original Examples what are the major cities in utah ? Modeling crisp logical regularities is cru- what states border maine ? cial in semantic parsing, making it difficult Induce Grammar for neural models with no task-specific prior knowledge to achieve good results. Synchronous CFG In this paper, we introduce data recombination, a novel framework for inject- Sample New Examples ing such prior knowledge into a model. From the training data, we induce a high- Recombinant Examples precision synchronous context-free gram- what are the major cities in [states border [maine]] ? mar, which captures important conditional what are the major cities in [states border [utah]] ? independence properties commonly found what states border [states border [maine]] ? in semantic parsing. We then train a what states border [states border [utah]] ? sequence-to-sequence recurrent network Train Model (RNN) model with a novel attention-based copying mechanism on datapoints sam- Sequence-to-sequence RNN pled from this grammar, thereby teaching the model about these structural properties. Data recombination improves the ac- Figure 1: An overview of our system. Given a curacy of our RNN model on three se- dataset, we induce a high-precision synchronous mantic parsing datasets, leading to new context-free grammar. We then sample from this state-of-the-art performance on the stan- grammar to generate new “recombinant” exam- dard GeoQuery dataset for models with ples, which we use to train a sequence-to-sequence comparable supervision. RNN. 1 Introduction have made swift inroads into many structured pre- Semantic parsing—the precise translation of nat- diction tasks in NLP, including machine trans- ural language utterances into logical forms—has lation (Sutskever et al., 2014; Bahdanau et al., many applications, including question answer- 2014) and syntactic parsing (Vinyals et al., 2015b; ing (Zelle and Mooney, 1996; Zettlemoyer and Dyer et al., 2015). Because RNNs make very few Collins, 2005; Zettlemoyer and Collins, 2007; domain-specific assumptions, they have the poten- Liang et al., 2011; Berant et al., 2013), instruc- tial to succeed at a wide variety of tasks with min- tion following (Artzi and Zettlemoyer, 2013b), imal feature engineering. However, this flexibil- and regular expression generation (Kushman and ity also puts RNNs at a disadvantage compared Barzilay, 2013). Modern semantic parsers (Artzi to standard semantic parsers, which can generalize and Zettlemoyer, 2013a; Berant et al., 2013) naturally by leveraging their built-in awareness of are complex pieces of software, requiring hand- logical compositionality. crafted features, lexicons, and grammars. In this paper, we introduce data recombina- Meanwhile, recurrent neural networks (RNNs) tion, a generic framework for declaratively inject- 12 Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 12–22, Berlin, Germany, August 7-12, 2016. c 2016 Association for Computational Linguistics (2015b) showed that an RNN can reliably predict GEO x:“what is the population of iowa ?” tree-structured outputs in a linear fashion. y: _answer ( NV , ( We evaluate our system on three existing se- _population ( NV , V1 ) , _const ( V0 , _stateid ( iowa ) ) ) ) mantic parsing datasets. Figure 2 shows sample ATIS input-output pairs from each of these datasets. x:“can you list all flights from chicago to milwaukee” GeoQuery (GEO) contains natural language y: ( _lambda $0 e ( _and • ( _flight $0 ) questions about US geography paired with ( _from $0 chicago : _ci ) corresponding Prolog database queries. We ( _to $0 milwaukee : _ci ) ) ) use the standard split of 600 training exam- Overnight x:“when is the weekly standup” ples and 280 test examples introduced by y: ( call listValue ( call Zettlemoyer and Collins (2005). We prepro- getProperty meeting.weekly_standup cess the logical forms to De Brujin index no- ( string start_time ) ) ) tation to standardize variable naming. ATIS (ATIS) contains natural language • Figure 2: One example from each of our domains. queries for a flights database paired with We tokenize logical forms as shown, thereby cast- corresponding database queries written in ing semantic parsing as a sequence-to-sequence lambda calculus. We train on 4473 examples task. and evaluate on the 448 test examples used by Zettlemoyer and Collins (2007). ing prior knowledge into a domain-general struc- Overnight (OVERNIGHT) contains logical • tured prediction model. In data recombination, forms paired with natural language para- prior knowledge about a task is used to build a phrases across eight varied subdomains. high-precision generative model that expands the Wang et al. (2015) constructed the dataset by empirical distribution by allowing fragments of generating all possible logical forms up to different examples to be combined in particular some depth threshold, then getting multiple ways. Samples from this generative model are natural language paraphrases for each logi- then used to train a domain-general model. In the cal form from workers on Amazon Mechan- case of semantic parsing, we construct a genera- ical Turk. We evaluate on the same train/test tive model by inducing a synchronous context-free splits as Wang et al. (2015). grammar (SCFG), creating new examples such as those shown in Figure 1; our domain-general In this paper, we only explore learning from log- model is a sequence-to-sequence RNN with a ical forms. In the last few years, there has an novel attention-based copying mechanism. Data emergence of semantic parsers learned from de- recombination boosts the accuracy of our RNN notations (Clarke et al., 2010; Liang et al., 2011; model on three semantic parsing datasets. On the Berant et al., 2013; Artzi and Zettlemoyer, 2013b). GEO dataset, data recombination improves test ac- While our system cannot directly learn from deno- curacy by 4.3 percentage points over our baseline tations, it could be used to rerank candidate deriva- RNN, leading to new state-of-the-art results for tions generated by one of these other systems. models that do not use a seed lexicon for predi- 3 Sequence-to-sequence RNN Model cates. Our sequence-to-sequence RNN model is based 2 Problem statement on existing attention-based neural machine translation models (Bahdanau et al., 2014; Luong et al., We cast semantic parsing as a sequence-to- 2015a), but also includes a novel attention-based sequence task. The input utterance x is a sequence copying mechanism. Similar copying mechanisms (in) of words x1, . , xm , the input vocabulary; have been explored in parallel by Gu et al. (2016) ∈ V similarly, the output logical form y is a sequence and Gulcehre et al. (2016). of tokens y , . , y (out), the output vocab- 1 n ∈ V ulary. A linear sequence of tokens might appear 3.1 Basic Model to lose the hierarchical structure of a logical form, Encoder. The encoder converts the input se- but there is precedent for this choice: Vinyals et al. quence x1, . , xm into a sequence of context- 13 sensitive embeddings b1, . , bm using a bidirec- model has difficulty generalizing to the long tail of tional RNN (Bahdanau et al., 2014). First, a word entity names commonly found in semantic parsing (in) embedding function φ maps each word xi to a datasets. Conveniently, entity names in the input fixed-dimensional vector. These vectors are fed as often correspond directly to tokens in the output input to two RNNs: a forward RNN and a back- (e.g., “iowa” becomes iowa in Figure 2).1 ward RNN. The forward RNN starts with an initial To capture this intuition, we introduce a new F hidden state h0, and generates a sequence of hid- attention-based copying mechanism. At each time F F den states h1, . , hm by repeatedly applying the step j, the decoder generates one of two types of recurrence actions. As before, it can write any word in the output vocabulary. In addition, it can copy any in- F (in) F hi = LSTM(φ (xi), hi 1). (1) put word xi directly to the output, where the prob- − ability with which we copy xi is determined by The recurrence takes the form of an LSTM the attention score on xi. Formally, we define a (Hochreiter and Schmidhuber, 1997). The back- latent action aj that is either Write[w] for some ward RNN similarly generates hidden states w (out) or Copy[i] for some i 1, . , m . B B ∈ V ∈ { } hm, . , h1 by processing the input sequence in We then have reverse order. Finally, for each input position i, we define the context-sensitive embedding bi to be P (aj = Write[w] x, y1:j 1) exp(Uw[sj, cj]), F B | − ∝ the concatenation of hi and hi (8) Decoder. The decoder is an attention-based P (aj = Copy[i] x, y1:j 1) exp(eji). (9) | − ∝ model (Bahdanau et al., 2014; Luong et al., 2015a) The decoder chooses aj with a softmax over all that generates the output sequence y1, . , yn one token at a time. At each time step j, it writes these possible actions; yj is then a deterministic function of aj and x. During training, we maxi- yj based on the current hidden state sj, then up- mize the log-likelihood of y, marginalizing out a. dates the hidden state to sj+1 based on sj and yj. Formally, the decoder is defined by the following Attention-based copying can be seen as a com- equations: bination of a standard softmax output layer of an attention-based model (Bahdanau et al., 2014) and (s) F B a Pointer Network (Vinyals et al., 2015a); in a s1 = tanh(W [hm, h1 ]). (2) Pointer Network, the only way to generate output e = s W (a)b . (3) ji j> i is to copy a symbol from the input.

Load more