Semantic Parsing with Semi-Supervised Sequential Autoencoders
Total Page:16
File Type:pdf, Size:1020Kb
Semantic Parsing with Semi-Supervised Sequential Autoencoders Toma´sˇ Kociskˇ y´†‡ Gabor´ Melis† Edward Grefenstette† Chris Dyer† Wang Ling† Phil Blunsom†‡ Karl Moritz Hermann† †Google DeepMind ‡University of Oxford tkocisky,melisgl,etg,cdyer,lingwang,pblunsom,kmh @google.com { } Abstract This is common in situations where we encode nat- ural language into a logical form governed by some We present a novel semi-supervised approach grammar or database. for sequence transduction and apply it to se- mantic parsing. The unsupervised component While such an autoencoder could in principle is based on a generative model in which latent be constructed by stacking two sequence transduc- sentences generate the unpaired logical forms. ers, modelling the latent variable as a series of dis- We apply this method to a number of semantic crete symbols drawn from multinomial distributions parsing tasks focusing on domains with lim- ited access to labelled training data and ex- creates serious computational challenges, as it re- tend those datasets with synthetically gener- quires marginalising over the space of latent se- ated logical forms. quences Σx∗. To avoid this intractable marginalisa- tion, we introduce a novel differentiable alternative for draws from a softmax which can be used with 1 Introduction the reparametrisation trick of Kingma and Welling (2014). Rather than drawing a discrete symbol in Neural approaches, in particular attention-based Σx from a softmax, we draw a distribution over sym- sequence-to-sequence models, have shown great bols from a logistic-normal distribution at each time promise and obtained state-of-the-art performance step. These serve as continuous relaxations of dis- for sequence transduction tasks including machine crete samples, providing a differentiable estimator translation (Bahdanau et al., 2015), syntactic con- of the expected reconstruction log likelihood. stituency parsing (Vinyals et al., 2015), and seman- tic role labelling (Zhou and Xu, 2015). A key re- We demonstrate the effectiveness of our proposed quirement for effectively training such models is an model on three semantic parsing tasks: the GEO- abundance of supervised data. QUERY benchmark (Zelle and Mooney, 1996; Wong In this paper we focus on learning mappings from and Mooney, 2006), the SAIL maze navigation task input sequences x to output sequences y in domains (MacMahon et al., 2006) and the Natural Language where the latter are easily obtained, but annotation Querying corpus (Haas and Riezler, 2016) on Open- in the form of (x, y) pairs is sparse or expensive to StreetMap. As part of our evaluation, we introduce produce, and propose a novel architecture that ac- simple mechanisms for generating large amounts of commodates semi-supervised training on sequence unsupervised training data for two of these tasks. transduction tasks. To this end, we augment the transduction objective (x y) with an autoencod- In most settings, the semi-supervised model out- 7→ ing objective where the input sequence is treated as a performs the supervised model, both when trained latent variable (y x y), enabling training from on additional generated data as well as on subsets of 7→ 7→ both labelled pairs and unpaired output sequences. the existing data. 1078 Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1078–1087, Austin, Texas, November 1-5, 2016. c 2016 Association for Computational Linguistics Dataset Example what are the high points of states surrounding mississippi GEO answer(high point 1(state(next to 2(stateid(’mississippi’))))) Where are kindergartens in Hamburg? NLMAPS query(area(keyval(‘name’,‘Hamburg’)),nwr(keyval(‘amenity’,‘kindergarten’)),qtype(latlong)) turn right at the bench into the yellow tiled hall SAIL (1, 6, 90) FORWARD - FORWARD - RIGHT - STOP (3, 6, 180) Table 1: Examples of natural language x and logical form y from the three corpora and tasks used in this paper. Note that the SAIL corpus requires additional information in order to map from the instruction to the action sequence. yˆ1 yˆ2 yˆ3 yˆ4 x x x yˆ yˆ yˆ yˆ h1 h2 h3 h1 h2 h3 h4 <s> yˆ1 yˆ2 yˆ3 x˜1 x˜2 x˜3 y y y y x˜ x˜ x˜ h1 h2 h3 h4 h1 h2 h3 y1 y2 y3 y4 <s> Figure 1: SEQ4 model with attention-sequence-to-sequence encoder and decoder. Circle nodes represent random variables. 2 Model where fy→, fy← are non-linear functions applied at each time step to the current token yt and their re- Our sequential autoencoder is shown in Figure 1. y, y, current states ht →1 , ht+1←, respectively. At a high level, it can be seen as two sequence- − to-sequence models with attention (Bahdanau et al., Both the forward and backward functions project 2015) chained together. More precisely, the model the one-hot vector into a dense vector via an embed- consists of four LSTMs (Hochreiter and Schmid- ding matrix, which serves as input to an LSTM. huber, 1997), hence the name SEQ4. The first, a bidirectional LSTM, encodes the sequence y; next, an LSTM with stochastic output, described below, 2.2 Predicting a Latent Sequence x˜ draws a sequence of distributions x˜ over words in vocabulary Σx. The third LSTM encodes these dis- Subsequently, we wish to predict x. Predicting a tributions for the last one to attend over and recon- discrete sequence of symbols through draws from struct y as yˆ. We now give the details of these parts. multinomial distributions over a vocabulary is not an option, as we would not be able to backpropa- 2.1 Encoding y gate through this discrete choice. Marginalising over the possible latent strings or estimating the gradient The first LSTM of the encoder half of the model through na¨ıve Monte Carlo methods would be a pro- y reads the sequence , represented as a sequence of hibitively high variance process because the num- Σ one-hot vectors over the vocabulary y, using a ber of strings is exponential in the maximum length bidirectional RNN into a sequence of vectors hy 1:Ly (which we would have to manually specify) with L y where y is the sequence length of , the vocabulary size as base. To allow backpropaga- y y, y, tion, we instead predict a sequence of distributions x˜ ht = fy→(yt, ht →1 ); fy←(yt, ht+1←) , (1) − over the symbols of Σx with an RNN attending over 1079 x yˆ1 2.3yˆ2 Encodingyˆ3 yˆ4 Moving on to the decoder part of our model, in the hx hx hx yˆ yˆ yˆ 2yˆ 1 2 3 h1 thirdh2 LSTM, weh3 embedh4 and encode x˜: <s> yˆ x yˆ x,yˆ x, 1 ht = f2 x→(˜xt, ht 3 →1 ); fx←(˜xt, ht+1←) (8) − x˜1 x˜2 x˜3 When x is observed, during supervised training and ✏1 ✏2 ✏3 2 2 2 also when making predictions, instead of the distri- µ1, log(σ )1 µ2, log(σ )2 µ3, log(σ )3 bution x˜ we feed the one-hot encoded x to this part y y y y x˜ x˜ x˜ h1 h2 h3 h4 h1 h2 h3 of the model. y1 y2 y3 y4 <s> 2.4 Reconstructing y In the final LSTM, we decode into y: Figure 2: Unsupervised case of the SEQ4 model. Ly x˜ y p(ˆy x˜) = p(ˆyt yˆ1, , yˆt 1 , h ) (9) hy = h , which will later serve to reconstruct y: | |{ ··· − } 1:Ly t=1 Y Lx Equation 9 is implemented as an LSTM attending y x˜ x˜ = q(x y) = q(˜xt x˜1, , x˜t 1 , h ) (2) over h producing a sequence of symbols yˆ based | |{ ··· − } t=1 yˆ Y on recurrent states h , aiming to reproduce input y: where q(x y) models the mapping y x. We define yˆ yˆ x˜ | 7→ ht = fyˆ(ˆyt 1, ht 1, h ) (10) y − − q(˜xt x˜1, , x˜t 1 , h ) in the following way: yˆ |{ ··· − } yˆt softmax(l0(h )) (11) Let the vector x˜t be a distribution over the vocabu- ∼ t 1 lary Σx drawn from a logistic-normal distribution , 2 Σx where fyˆ is the non-linear function, and the actual the parameters of which, µt, log(σ )t R| |, are ∈ probabilities are given by a softmax function after predicted by attending by an LSTM attending over yˆ a linear transformation l0 of h . At training time, the outputs of the encoder (Equation 2), where Σx | | rather than yˆt 1 we feed the ground truth yt 1. is the size of the vocabulary Σx. The use of a logis- − − tic normal distribution serves to regularise the model 2.5 Loss function in the semi-supervised learning regime, which is de- The complete model described in this section gives a scribed at the end of this section. Formally, this pro- reconstruction function y yˆ. We define a loss on cess, depicted in Figure 2, is as follows: 7→ this reconstruction which accommodates the unsu- x˜ x˜ y pervised case, where x is not observed in the train- ht = fx˜(˜xt 1, ht 1, h ) (3) − − ing data, and the supervised case, where (x, y) pairs 2 x˜ µt, log(σt ) = l(ht ) (4) are available. Together, these allow us to train the (0,I) (5) SEQ4 model in a semi-supervised setting, which ex- ∼ N γt = µt + σt (6) periments will show provides some benefits over a purely supervised training regime. x˜t = softmax(γt) (7) Unsupervised case When x isn’t observed, the where the fx˜ function is an LSTM and l a linear loss we minimise during training is the recon- 2 Σx transformation to R | |. We use the reparametrisa- struction loss on y, expressed as the negative log- tion trick from Kingma and Welling (2014) to draw likelihood NLL(ˆy, y) of the true labels y relative to from the logistic normal, allowing us to backpropa- the predictions yˆ.