Semantic with Semi-Supervised Sequential Autoencoders

Toma´sˇ Kociskˇ y´†‡ Gabor´ Melis† Edward Grefenstette† Chris Dyer† Wang Ling† Phil Blunsom†‡ Karl Moritz Hermann† †Google DeepMind ‡University of Oxford tkocisky,melisgl,etg,cdyer,lingwang,pblunsom,kmh @google.com { }

Abstract This is common in situations where we encode nat- ural language into a logical form governed by some We present a novel semi-supervised approach grammar or database. for sequence transduction and apply it to se- mantic parsing. The unsupervised component While such an autoencoder could in principle is based on a generative model in which latent be constructed by stacking two sequence transduc- sentences generate the unpaired logical forms. ers, modelling the latent variable as a series of dis- We apply this method to a number of semantic crete symbols drawn from multinomial distributions parsing tasks focusing on domains with lim- ited access to labelled training data and ex- creates serious computational challenges, as it re- tend those datasets with synthetically gener- quires marginalising over the space of latent se- ated logical forms. quences Σx∗. To avoid this intractable marginalisa- tion, we introduce a novel differentiable alternative for draws from a softmax which can be used with 1 Introduction the reparametrisation trick of Kingma and Welling (2014). Rather than drawing a discrete symbol in Neural approaches, in particular attention-based Σx from a softmax, we draw a distribution over sym- sequence-to-sequence models, have shown great bols from a logistic-normal distribution at each time promise and obtained state-of-the-art performance step. These serve as continuous relaxations of dis- for sequence transduction tasks including machine crete samples, providing a differentiable estimator translation (Bahdanau et al., 2015), syntactic con- of the expected reconstruction log likelihood. stituency parsing (Vinyals et al., 2015), and seman- tic role labelling (Zhou and Xu, 2015). A key re- We demonstrate the effectiveness of our proposed quirement for effectively training such models is an model on three semantic parsing tasks: the GEO- abundance of supervised data. QUERY benchmark (Zelle and Mooney, 1996; Wong In this paper we focus on learning mappings from and Mooney, 2006), the SAIL maze navigation task input sequences x to output sequences y in domains (MacMahon et al., 2006) and the Natural Language where the latter are easily obtained, but annotation Querying corpus (Haas and Riezler, 2016) on Open- in the form of (x, y) pairs is sparse or expensive to StreetMap. As part of our evaluation, we introduce produce, and propose a novel architecture that ac- simple mechanisms for generating large amounts of commodates semi-supervised training on sequence unsupervised training data for two of these tasks. transduction tasks. To this end, we augment the transduction objective (x y) with an autoencod- In most settings, the semi-supervised model out- 7→ ing objective where the input sequence is treated as a performs the supervised model, both when trained latent variable (y x y), enabling training from on additional generated data as well as on subsets of 7→ 7→ both labelled pairs and unpaired output sequences. the existing data.

1078

Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1078–1087, Austin, Texas, November 1-5, 2016. c 2016 Association for Computational Linguistics Dataset Example what are the high points of states surrounding mississippi GEO answer(high point 1(state(next to 2(stateid(’mississippi’))))) Where are kindergartens in Hamburg? NLMAPS query(area(keyval(‘name’,‘Hamburg’)),nwr(keyval(‘amenity’,‘kindergarten’)),qtype(latlong)) turn right at the bench into the yellow tiled hall SAIL (1, 6, 90) FORWARD - FORWARD - RIGHT - STOP (3, 6, 180) Table 1: Examples of natural language x and logical form y from the three corpora and tasks used in this paper. Note that the SAIL corpus requires additional information in order to map from the instruction to the action sequence.

yˆ1 yˆ2 yˆ3 yˆ4

x x x yˆ yˆ yˆ yˆ h1 h2 h3 h1 h2 h3 h4

yˆ1 yˆ2 yˆ3

x˜1 x˜2 x˜3

y y y y x˜ x˜ x˜ h1 h2 h3 h4 h1 h2 h3

y1 y2 y3 y4

Figure 1: SEQ4 model with attention-sequence-to-sequence encoder and decoder. Circle nodes represent random variables.

2 Model where fy→, fy← are non-linear functions applied at each time step to the current token yt and their re- Our sequential autoencoder is shown in Figure 1. y, y, current states ht →1 , ht+1←, respectively. At a high level, it can be seen as two sequence- − to-sequence models with attention (Bahdanau et al., Both the forward and backward functions project 2015) chained together. More precisely, the model the one-hot vector into a dense vector via an embed- consists of four LSTMs (Hochreiter and Schmid- ding matrix, which serves as input to an LSTM. huber, 1997), hence the name SEQ4. The first, a bidirectional LSTM, encodes the sequence y; next, an LSTM with stochastic output, described below, 2.2 Predicting a Latent Sequence x˜ draws a sequence of distributions x˜ over words in vocabulary Σx. The third LSTM encodes these dis- Subsequently, we wish to predict x. Predicting a tributions for the last one to attend over and recon- discrete sequence of symbols through draws from struct y as yˆ. We now give the details of these parts. multinomial distributions over a vocabulary is not an option, as we would not be able to backpropa- 2.1 Encoding y gate through this discrete choice. Marginalising over the possible latent strings or estimating the gradient The first LSTM of the encoder half of the model through na¨ıve Monte Carlo methods would be a pro- y reads the sequence , represented as a sequence of hibitively high variance process because the num- Σ one-hot vectors over the vocabulary y, using a ber of strings is exponential in the maximum length bidirectional RNN into a sequence of vectors hy 1:Ly (which we would have to manually specify) with L y where y is the sequence length of , the vocabulary size as base. To allow backpropaga-

y y, y, tion, we instead predict a sequence of distributions x˜ ht = fy→(yt, ht →1 ); fy←(yt, ht+1←) , (1) − over the symbols of Σx with an RNN attending over  1079 x yˆ1 2.3yˆ2 Encodingyˆ3 yˆ4 Moving on to the decoder part of our model, in the hx hx hx yˆ yˆ yˆ 2yˆ 1 2 3 h1 thirdh2 LSTM, weh3 embedh4 and encode x˜:

yˆ x yˆ x,yˆ x, 1 ht = f2 x→(˜xt, ht 3 →1 ); fx←(˜xt, ht+1←) (8) −  x˜1 x˜2 x˜3 When x is observed, during supervised training and ✏1 ✏2 ✏3

2 2 2 also when making predictions, instead of the distri- µ1, log( )1 µ2, log( )2 µ3, log( )3 bution x˜ we feed the one-hot encoded x to this part y y y y x˜ x˜ x˜ h1 h2 h3 h4 h1 h2 h3 of the model.

y1 y2 y3 y4 2.4 Reconstructing y In the final LSTM, we decode into y: Figure 2: Unsupervised case of the SEQ4 model. Ly x˜ y p(ˆy x˜) = p(ˆyt yˆ1, , yˆt 1 , h ) (9) hy = h , which will later serve to reconstruct y: | |{ ··· − } 1:Ly t=1 Y

Lx Equation 9 is implemented as an LSTM attending y x˜ x˜ = q(x y) = q(˜xt x˜1, , x˜t 1 , h ) (2) over h producing a sequence of symbols yˆ based | |{ ··· − } t=1 yˆ Y on recurrent states h , aiming to reproduce input y: where q(x y) models the mapping y x. We define yˆ yˆ x˜ | 7→ ht = fyˆ(ˆyt 1, ht 1, h ) (10) y − − q(˜xt x˜1, , x˜t 1 , h ) in the following way: yˆ |{ ··· − } yˆt softmax(l0(h )) (11) Let the vector x˜t be a distribution over the vocabu- ∼ t 1 lary Σx drawn from a logistic-normal distribution , 2 Σx where fyˆ is the non-linear function, and the actual the parameters of which, µt, log(σ )t R| |, are ∈ probabilities are given by a softmax function after predicted by attending by an LSTM attending over yˆ a linear transformation l0 of h . At training time, the outputs of the encoder (Equation 2), where Σx | | rather than yˆt 1 we feed the ground truth yt 1. is the size of the vocabulary Σx. The use of a logis- − − tic normal distribution serves to regularise the model 2.5 Loss function in the semi-supervised learning regime, which is de- The complete model described in this section gives a scribed at the end of this section. Formally, this pro- reconstruction function y yˆ. We define a loss on cess, depicted in Figure 2, is as follows: 7→ this reconstruction which accommodates the unsu- x˜ x˜ y pervised case, where x is not observed in the train- ht = fx˜(˜xt 1, ht 1, h ) (3) − − ing data, and the supervised case, where (x, y) pairs 2 x˜ µt, log(σt ) = l(ht ) (4) are available. Together, these allow us to train the  (0,I) (5) SEQ4 model in a semi-supervised setting, which ex- ∼ N γt = µt + σt (6) periments will show provides some benefits over a purely supervised training regime. x˜t = softmax(γt) (7) Unsupervised case When x isn’t observed, the where the fx˜ function is an LSTM and l a linear loss we minimise during training is the recon- 2 Σx transformation to R | |. We use the reparametrisa- struction loss on y, expressed as the negative log- tion trick from Kingma and Welling (2014) to draw likelihood NLL(ˆy, y) of the true labels y relative to from the logistic normal, allowing us to backpropa- the predictions yˆ. To this, we add as a regularising gate through the sampling process. 2Multiplying the distribution over words and an embedding 1The logistic-normal distribution is the exponentiated and matrix averages the word embedding of the entire vocabulary normalised (i.e. taking softmax) normal distribution. weighted by their probabilities.

1080 term the KL divergence KL[q(γ y) p(γ)] which ef- 3.1 GeoQuery | k fectively penalises the mean and variance of q(γ y) | The first task we consider is the prediction of a query from diverging from those of a prior p(γ), which on the GEO corpus which is a frequently used bench- we model as a diagonal Gaussian (0,I). This has N mark for semantic parsing. The corpus contains 880 the effect of smoothing the logistic normal distribu- questions about US geography together with exe- tion from which we draw the distributions over sym- cutable queries representing those questions. We bols of x, guarding against overfitting of the latent follow the approach established by Zettlemoyer and distributions over x to symbols seen in the super- Collins (2005) and split the corpus into 600 training vised case discussed below. The unsupervised loss and 280 test cases. Following common practice, we is therefore formalised as augment the dataset by referring to the database dur- ing training and test time. In particular, we use the unsup = NLL(ˆy, y) + αKL[q(γ y) p(γ)] (12) L | k database to identify and anonymise variables (cities, with regularising factor α is tuned on validation, and states, countries and rivers) following the method described in Dong and Lapata (2016). L x Most prior work on the GEO corpus relies on stan- KL[q(γ y) p(γ)] = KL[q(γ y) p(γ)] (13) | k i| k dard semantic parsing methods together with custom i=1 X heuristics or pipelines for this corpus. The recent pa- We use a closed form of these individual KL diver- per by Dong and Lapata (2016) is of note, as it uses gences, described by Kingma and Welling (2014). a sequence-to-sequence model for training which is the unidirectional equivalent to S2S, and also to the Supervised case When x is observed, we addi- decoder part of our SEQ4 network. tionally minimise the prediction loss on x, expressed as the negative log-likelihood NLL(˜x, x) of the true 3.2 Open Street Maps x x˜ labels relative to the predictions , and do not im- The second task we tackle with our model is the pose the KL loss. The supervised loss is thus NLMAPS dataset by Haas and Riezler (2016). The dataset contains 1,500 training and 880 testing in- sup = NLL(˜x, x) + NLL(ˆy, y) (14) L stances of natural language questions with corre- In both the supervised and unsupervised case, be- sponding machine readable queries over the geo- cause of the continuous relaxation on generating x˜ graphical OpenStreetMap database. The dataset and the reparameterisation trick, the gradient of the contains natural language question in both English losses with regard to the model parameters is well and German but we focus only on single language defined throughout SEQ4. semantic parsing, similar to the first task in Haas and Riezler (2016). We use the data as it is, with We Semi-supervised training and inference the only pre-processing step being the tokenization train with a weighted combination of the supervised of both natural language and query form3. and unsupervised losses described above. Once trained, we simply use the x y decoder segment 3.3 Navigational Instructions to Actions 7→ of the model to predict y from sequences of sym- The SAIL corpus and task were developed to train bols x represented as one-hot vectors. When the de- agents to follow free-form navigational route in- coder is trained without the encoder in a fully super- structions in a maze environment (MacMahon et al., vised manner, it serves as our supervised sequence- 2006; Chen and Mooney, 2011). It consists of a to-sequence baseline model under the name S2S. small number of mazes containing features such as objects, wall and floor types. These mazes come to- 3 Tasks and Data Generation gether with a large number of human instructions We apply our model to three tasks outlined in this paired with the required actions4 to reach the goal section. Moreover, we explain how we generated ad- 3We removed quotes, added spaces around (), and sepa- ditional unsupervised training data for two of these rated the question mark from the last word in each question. tasks. Examples from all datasets are in Table 1. 4There are four actions: LEFT, RIGHT, GO, STOP.

1081 state described in those instructions. Model Accuracy We use the sentence-aligned version of the SAIL Zettlemoyer and Collins (2005) 79.3 route instruction dataset containing 3,236 sentences Zettlemoyer and Collins (2007) 86.1 (Chen and Mooney, 2011). Following previous Liang et al. (2013) 87.9 work, we accept an action sequence as correct if Kwiatkowski et al. (2011) 88.6 and only if the final position and orientation exactly Zhao and Huang (2014) 88.9 match those of the gold data. We do not perform any Kwiatkowski et al. (2013) 89.0 pre-processing on this dataset. Dong and Lapata (2016) 84.6 3.4 Data Generation Jia and Liang (2016)6 89.3 As argued earlier, we are focusing on tasks where S2S 86.5 aligned data is sparse and expensive to obtain, while SEQ4 87.3 it should be cheap to get unsupervised, monomodal Table 2: Non-neural and neural model results on GEOQUERY data. Albeit that is a reasonable assumption for real using the train/test split from (Zettlemoyer and Collins, 2005). world data, the datasets considered have no such component, thus the approach taken here is to gen- erate random database queries or maze paths, i.e. train our SEQ4 model in a semi-supervised setting the machine readable side of the data, and train on the entire dataset with the additional monomodal a semi-supervised model. The alternative not ex- training data described in the previous section. plored here would be to generate natural language Finally, we perform an “ablation” study where we questions or instructions instead, but that is more discard some of the training data and compare S2S difficult to achieve without human intervention. For to SEQ4. S2S is trained solely on the reduced data this reason, we generate the machine readable side in a supervised manner, while SEQ4 is once again of the data for GEOQUERY and SAIL tasks5. trained semi-supervised on the same reduced data For GEOQUERY, we fit a 3-gram Kneser-Ney plus the machine readable part of the discarded data (Chen and Goodman, 1999) model to the queries in (SEQ4-) or on the extra generated data (SEQ4+). the training set and sample about 7 million queries Training We train the model using standard gra- from it. We ensure that the sampled queries are dif- dient descent methods. As none of the datasets used ferent from the training queries, but do not enforce here contain development sets, we tune hyperparam- validity. This intentionally simplistic approach is to eters by cross-validating on the training data. In the demonstrate the applicability of our model. case of the SAIL corpus we train on three folds (two The SAIL dataset has only three mazes. We mazes for training and validation, one for test each) added a fourth one and over 150k random paths, in- and report weighted results across the folds follow- cluding duplicates. The new maze is larger (21 21 × ing prior work (Mei et al., 2016). grid) than the existing ones, and seeks to approxi- mately replicate the key statistics of the other three 4.1 GeoQuery mazes (maximum corridor length, distribution of ob- The evaluation metric for GEOQUERY is the ac- jects, etc). Paths within that maze are created by curacy of exactly predicting the machine readable randomly sampling start and end positions. query. As results in Table 2 show, our supervised S2S baseline model performs slightly better than 4 Experiments the comparable model by Dong and Lapata (2016). We evaluate our model on the three tasks in multiple The semi-supervised SEQ4 model with the addi- settings. First, we establish a supervised baseline to tional generated queries improves on it further. compare the S2S model with prior work. Next, we The ablation study in Table 3 demonstrates a widening gap between supervised and semi- 5Our randomly generated unsupervised datasets can be downloaded from http://deepmind.com/ 6Jia and Liang (2016) used hand crafted grammars to gener- publications ate additional supervised training data.

1082 Sup. data S2S SEQ4- SEQ4+ Sup. data S2S SEQ4- 5% 21.9 30.1 26.2 5% 3.22 3.74 10% 39.7 42.1 42.1 10% 17.61 17.12 25% 62.4 70.4 67.1 25% 33.74 33.50 50% 80.3 81.2 80.4 50% 49.52 53.72 75% 85.3 84.1 85.1 75% 66.93 66.45 100% 86.5 86.5 87.3 100% 78.03 78.03

Table 3: Results of the GEOQUERY ablation study. Table 5: Results of the NLMAPS ablation study.

Model Accuracy action. While a simple instruction such as ‘turn Haas and Riezler (2016) 68.30 left’ can easily be translated into the action sequence S2S 78.03 LEFT-STOP, more complex instructions such as ‘Walk forward until you see a lamp’ require knowl- Table 4: Results on the NLMAPS corpus. edge of the agent’s position in the maze. To accomplish this we modify the model as fol- supervised as the amount of labelled training data lows. First, when encoding action sequences, we gets smaller. This suggests that our model can lever- concatenate each action with a representation of the age unlabelled data even when only small amount of maze at the given position, representing the maze- labelled data is available. state akin to Mei et al. (2016) with a bag-of-features vector. Second, when decoding action sequences, 4.2 Open Street Maps the RNN outputs an action which is used to update We report results for the NLMAPS corpus in Table 4, the agent’s position and the representation of that comparing the supervised S2S model to the results new position is fed into the RNN as its next input. posted by Haas and Riezler (2016). While their model used a semantic parsing pipeline including Training regime We cross-validate over the three alignment, stemming, language modelling and CFG mazes in the dataset and report overall results inference, the strong performance of the S2S model weighted by test size (cf. Mei et al. (2016)). Both demonstrates the strength of fairly vanilla attention- our supervised and semi-supervised model perform based sequence-to-sequence models. It should be worse than the state-of-the-art (see Table 6), but the pointed out that the previous work reports the num- latter enjoys a comfortable margin over the former. ber of correct answers when queries were executed As the S2S model broadly reimplements the work against the dataset, while we evaluate on the strict of Mei et al. (2016), we put the discrepancy in per- accuracy of the generated queries. While we expect formance down to the particular design choices that these numbers to be nearly equivalent, our evalua- we did not follow in order to keep the model here as tion is strictly harder as it does not allow for reorder- general as possible and comparable across tasks. ing of query arguments and similar relaxations. The ablation studies (Table 7) show little gain for We investigate the SEQ4 model only via the abla- the semi-supervised approach when only using data tion study in Table 5 and find little gain through the from the original training set, but substantial im- semi-supervised objective. Our attempt at cheaply provement with the additional unsupervised data. generating unsupervised data for this task was not 5 Discussion successful, likely due to the complexity of the un- derlying database. Supervised training The prediction accuracies of our supervised baseline S2S model are mixed with 4.3 Navigational Instructions to Actions respect to prior results on their respective tasks. For Model extension The experiments for the SAIL GEOQUERY, S2S performs significantly better than task differ slightly from the other two tasks in that the most similar model from the literature (Dong and the language input does not suffice for choosing an Lapata, 2016), mostly due to the fact that y and x are

1083 Input from unsupervised data (y) Generated latent representation (x) answer smallest city loc 2 state stateid STATE what is the smallest city in the state of STATE answer city loc 2 state next to 2 stateid STATE what are the cities in states which border STATE answer mountain loc 2 countryid COUNTRY what is the lakes in COUNTRY answer state next to 2 state all which states longer states show peak states to Table 8: Positive and negative examples of latent language together with the randomly generated logical form from the unsupervised part of the GEOQUERY training. Note that the natural language (x) does not occur anywhere in the training data in this form.

Model Accuracy and a lack of any form of entity anonymisation.

Chen and Mooney (2011) 54.40 Semi-supervised training In both the case of Kim and Mooney (2012) 57.22 GEOQUERY and the SAIL task we found the semi- Andreas and Klein (2015) 59.60 supervised model to convincingly outperform the Kim and Mooney (2013) 62.81 fully supervised model. The effect was particu- Artzi et al. (2014) 64.36 larly notable in the case of the SAIL corpus, where Artzi and Zettlemoyer (2013) 65.28 performance increased from 58.60% accuracy to Mei et al. (2016) 69.98 63.25% (see Table 6). It is worth remembering that the supervised training regime consists of three folds S2S 58.60 of tuning on two maps with subsequent testing on SEQ4 63.25 the third map, which carries a risk of overfitting to Table 6: Results on the SAIL corpus. the training maps. The introduction of the fourth unsupervised map clearly mitigates this effect. Ta- Sup. data S2S SEQ4- SEQ4+ ble 8 shows some examples of unsupervised logi- 5% 37.79 41.48 43.44 cal forms being transformed into natural language, 10% 40.77 41.26 48.67 which demonstrate how the model can learn to sen- 25% 43.76 43.95 51.19 sibly ground unsupervised data. 50% 48.01 49.42 55.97 Ablation performance The experiments with ad- 75% 48.99 49.20 57.40 ditional unsupervised data prove the feasibility of 100% 49.49 49.49 58.28 our approach and clearly demonstrate the useful- Table 7: Results of the SAIL ablation study. Results are from ness of the SEQ4 model for the general class of models trained on L and Jelly maps, tested on Grid only, hence sequence-to-sequence tasks where supervised data the discrepancy between the 100% result and S2S in Table 6. is hard to come by. To analyse the model fur- ther, we also look at the performance of both S2S encoded with bidirectional LSTMs. With a unidirec- and SEQ4 when reducing the amount of supervised tional LSTM we get similar results to theirs. training data available to the model. We compare On the SAIL corpus, S2S performs worse than three settings: the supervised S2S model with re- the state of the art. As the models are broadly equiv- duced training data, SEQ4- which uses the removed alent we attribute this difference to a number of task- training data in an unsupervised fashion (throwing specific choices and optimisations7 made in Mei et away the natural language) and SEQ4+ which uses al. (2016) which we did not reimplement for the sake the randomly generated unsupervised data described of using a common model across all three tasks. in Section 3. The S2S model behaves as expected on all three tasks, its performance dropping with the For NLMAPS, S2S performs much better than the size of the training data. The performance of SEQ4- state-of-the-art, exceeding the previous best result and SEQ4+ requires more analysis. by 11% despite a very simple tokenization method In the case of GEOQUERY, having unlabelled data 7In particular we don’t use beam search and ensembling. from the true distribution (SEQ4-) is a good thing

1084 when there is enough of it, as clearly seen when cuses on defining the grammar of the logical forms only 5% of the original dataset is used for supervised (Zettlemoyer and Collins, 2005), other models learn training and the remaining 95% is used for unsuper- purely from aligned pairs of text and logical form vised training. The gap shrinks as the amount of (Berant and Liang, 2014), or from more weakly su- supervised data is increased, which is as expected. pervised signals such as question-answer pairs to- On the other hand, using a large amount of extra, gether with a database (Liang et al., 2011). Recent generated data from an approximating distribution work of Jia and Liang (2016) induces a synchronous (SEQ4+) does not help as much initially when com- context-free grammar and generates additional train- pared with the unsupervised data from the true dis- ing examples (x, y), which is one way to address tribution. However, as the size of the unsupervised data scarcity issues. The semi-supervised setup pro- dataset in SEQ4- becomes the bottleneck this gap posed here offers an alternative solution to this issue. closes and eventually the model trained on the ex- tra data achieves higher accuracy. Discrete autoencoders Very recently there has been some related work on discrete autoencoders For the SAIL task the semi-supervised models do for natural language processing (Suster et al., 2016; better than the supervised results throughout, with Marcheggiani and Titov, 2016, i.a.) This work the model trained on randomly generated additional presents a first approach to using effectively dis- data consistently outperforming the model trained cretised sequential information as the latent rep- only on the original data. This gives further credence resentation without resorting to draconian assump- to the risk of overfitting to the training mazes already tions (Ammar et al., 2014) to make marginalisation mentioned above. tractable. While our model is not exactly marginalis- Finally, in the case of the NLMAPS corpus, the able either, the continuous relaxation makes training semi-supervised approach does not appear to help far more tractable. A related idea was recently pre- much at any point during the ablation. These indis- sented in Gulc¸ehre¨ et al. (2015), who use monolin- tinguishable results are likely due to the task’s com- gual data to improve by fusing a plexity, causing the ablation experiments to either sequence-to-sequence model and a language model. have to little supervised data to sufficiently ground the latent space to make use of the unsupervised 7 Conclusion data, or in the higher percentages then too little un- supervised data to meaningfully improve the model. We described a method for augmenting a supervised sequence transduction objective with an autoen- 6 Related Work coding objective, thereby enabling semi-supervised training where previously a scarcity of aligned data Semantic parsing The tasks in this paper all might have held back model performance. Across broadly belong to the domain of semantic parsing, multiple semantic parsing tasks we demonstrated the which describes the process of mapping natural lan- effectiveness of this approach, improving model per- guage to a formal representation of its meaning. formance by training on randomly generated unsu- This is extended in the SAIL navigation task, where pervised data in addition to the original data. the formal representation is a function of both the Going forward it would be interesting to fur- language instruction and a given environment. ther analyse the effects of sampling from a logistic- Semantic parsing is a well-studied problem with normal distribution as opposed to a softmax in or- numerous approaches including inductive logic der to better understand how this impacts the dis- programming (Zelle and Mooney, 1996), string- tribution in the latent space. While we focused on to-tree (Galley et al., 2004) and string-to-graph tasks with little supervised data and additional un- (Jones et al., 2012) transducers, grammar induction supervised data in y, it would be straightforward to (Kwiatkowski et al., 2011; Artzi and Zettlemoyer, reverse the model to train it with additional labelled 2013; Reddy et al., 2014) or machine translation data in x, i.e. on the natural language side. A natural (Wong and Mooney, 2006; Andreas et al., 2013). extension would also be a formulation where semi- While a large number of relevant literature fo- supervised training was performed in both x and y.

1085 For instance, machine translation lends itself to such Robin Jia and Percy Liang. 2016. Data recombination a formulation where for many language pairs paral- for neural semantic parsing. In Association for Com- lel data may be scarce while there is an abundance putational Linguistics (ACL). of monolingual data. Bevan Jones, Jacob Andreas, Daniel Bauer, Karl Moritz Hermann, and Kevin Knight. 2012. Semantics- Based Machine Translation with Hyperedge Replace- References ment Grammars. In Proceedings of COLING 2012, December. Waleed Ammar, Chris Dyer, and Noah A. Smith. 2014. Joohyun Kim and Raymond J. Mooney. 2012. Unsuper- Conditional Random Field Autoencoders for Unsuper- vised PCFG Induction for Grounded Language Learn- vised Structured Prediction. In Proceedings of NIPS. ing with Highly Ambiguous Supervision. In Proceed- Jacob Andreas and Dan Klein. 2015. Alignment-based ings of EMNLP-CoNLL, July. Compositional Semantics for Instruction Following. Joohyun Kim and Raymond Mooney. 2013. Adapt- In Proceedings of EMNLP, September. ing Discriminative Reranking to Grounded Language Jacob Andreas, Andreas Vlachos, and Stephen Clark. Learning. In Proceedings of ACL, August. 2013. Semantic Parsing as Machine Translation. In Diederik P. Kingma and Max Welling. 2014. Auto- Proceedings of ACL, August. Encoding Variational Bayes. In Proceedings of ICLR. Yoav Artzi and Luke Zettlemoyer. 2013. Weakly Super- Tom Kwiatkowski, Luke Zettlemoyer, Sharon Goldwater, vised Learning of Semantic Parsers for Mapping In- and Mark Steedman. 2011. Lexical Generalization structions to Actions. Transactions of the Association in CCG Grammar Induction for Semantic Parsing. In for Computational Linguistics, 1(1):49–62. Proceedings of EMNLP. Yoav Artzi, Dipanjan Das, and Slav Petrov. 2014. Learn- ing Compact Lexicons for CCG Semantic Parsing. In Tom Kwiatkowski, Eunsol Choi, Yoav Artzi, and Luke Proceedings of EMNLP, October. Zettlemoyer. 2013. Scaling semantic parsers with on-the-fly ontology matching. In In Proceedings of Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- EMNLP. Citeseer. gio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. In Proceedings of Percy Liang, Michael I. Jordan, and Dan Klein. 2011. ICLR. Learning Dependency-based Compositional Seman- tics. In Proceedings of the ACL-HLT. Jonathan Berant and Percy Liang. 2014. Semantic Pars- ing via Paraphrasing. In Proceedings of ACL, June. Percy Liang, Michael I Jordan, and Dan Klein. 2013. Stanley F Chen and Joshua Goodman. 1999. An empir- Learning dependency-based compositional semantics. ical study of smoothing techniques for language mod- Computational Linguistics, 39(2):389–446. eling. Computer Speech & Language, 13(4):359–393. Matt MacMahon, Brian Stankiewicz, and Benjamin David L. Chen and Raymond J. Mooney. 2011. Learning Kuipers. 2006. Walk the Talk: Connecting Language, to Interpret Natural Language Navigation Instructions Knowledge, and Action in Route Instructions. In Pro- from Observations. In Proceedings of AAAI, August. ceedings of AAAI. Li Dong and Mirella Lapata. 2016. Language to Diego Marcheggiani and Ivan Titov. 2016. Discrete-state Logical Form with Neural Attention. arXiv preprint variational autoencoders for joint discovery and factor- arXiv:1601.01280. ization of relations. Transactions of ACL. Michel Galley, Mark Hopkins, Kevin Knight, and Daniel Hongyuan Mei, Mohit Bansal, and Matthew R. Walter. Marcu. 2004. What’s in a translation rule? In Pro- 2016. Listen, Attend, and Walk: Neural Mapping ceedings of HLT-NAACL, May. of Navigational Instructions to Action Sequences. In C¸aglar Gulc¸ehre,¨ Orhan Firat, Kelvin Xu, Kyunghyun Proceedings of AAAI. Cho, Lo¨ıc Barrault, Huei-Chi Lin, Fethi Bougares, Siva Reddy, Mirella Lapata, and Mark Steedman. 2014. Holger Schwenk, and Yoshua Bengio. 2015. On Us- Large-scale Semantic Parsing without Question- ing Monolingual Corpora in Neural Machine Transla- Answer Pairs. Transactions of the Association for tion. arXiv preprint arXiv:1503.03535. Computational Linguistics, 2:377–392. Carolin Haas and Stefan Riezler. 2016. A corpus and se- Simon Suster, Ivan Titov, and Gertjan van Noord. 2016. mantic parser for multilingual natural language query- Bilingual Learning of Multi-sense Embeddings with ing of openstreetmap. In Proceedings of NAACL, June. Discrete Autoencoders. CoRR, abs/1603.09128. Sepp Hochreiter and Jurgen¨ Schmidhuber. 1997. Oriol Vinyals, Łukasz Kaiser, Terry Koo, Slav Petrov, Long Short-Term Memory. Neural Computation, Ilya Sutskever, and Geoffrey Hinton. 2015. Grammar 9(8):1735–1780, November. as a Foreign Language. In Proceedings of NIPS.

1086 Yuk Wah Wong and Raymond J. Mooney. 2006. Learn- ing for Semantic Parsing with Statistical Machine Translation. In Proceedings of NAACL. John M. Zelle and Raymond J. Mooney. 1996. Learning to Parse Database Queries using Inductive Logic Pro- gramming. In Proceedings of AAAI/IAAI, pages 1050– 1055, August. Luke S. Zettlemoyer and Michael Collins. 2005. Learn- ing to Map Sentences to Logical Form: Structured Classification with Probabilistic Categorial Grammars. In UAI, pages 658–666. AUAI Press. Luke Zettlemoyer and Michael Collins. 2007. Online Learning of Relaxed CCG Grammars for Parsing to Logical Form. In Proceedings of EMNLP-CoNLL, June. Kai Zhao and Liang Huang. 2014. Type-driven incre- mental semantic parsing with polymorphism. arXiv preprint arXiv:1411.5379. Jie Zhou and Wei Xu. 2015. End-to-end Learning of Using Recurrent Neural Net- works. In Proceedings of ACL.

1087