Learning from Executions for Semantic Parsing

Learning from Executions for Semantic Parsing Bailin Wang, Mirella Lapata and Ivan Titov Institute for Language, Cognition and Computation School of Informatics, University of Edinburgh [email protected], {mlap, ititov}@inf.ed.ac.uk Abstract list all 3 star rated thai restaurants Semantic parsing aims at translating natu- Program Candidates Gold Exe ral language (NL) utterances onto machine- select restaurant where star_rating = thai 77 interpretable programs, which can be executed select restaurant where cuisine > 3 77 against a real-world environment. The ex- select where 73 pensive annotation of utterance-program pairs restaurant star_rating = 3 has long been acknowledged as a major bot- select restaurant where star_rating = 3 tleneck for the deployment of contemporary and cuisine = thai 33 neural models to real-life applications. In this work, we focus on the task of semi-supervised Figure 1: Candidate programs for an utterance can be learning where a limited amount of annotated classified by executability (Exe); note that the gold pro- data is available together with many unlabeled gram is always in the set of executable programs. We NL utterances. Based on the observation that propose to ultilize the weak yet freely available signal programs which correspond to NL utterances of executablility for learning. must be always executable, we propose to encourage a parser to generate executable programs for unlabeled utterances. Due to the ronment against which they are executed (e.g., a large search space of executable programs, knowledge base, a relational database). An alter- conventional methods that use approximations native to annotation is to collect answers (or deno- based on beam-search such as self-training and tations) of programs, rather than programs them- top-k marginal likelihood training, do not per- selves (Liang et al., 2013; Berant et al., 2013). In form as well. Instead, we view the problem this work, we focus on the more extreme setting of learning from executions from the perspective of posterior regularization and propose a where there are no annotations available for a large set of new training objectives. Experimen- number of utterances. This setting resembles a tal results on OVERNIGHT and GEOQUERY common real-life scenario where massive numbers show that our new objectives outperform con- of user utterances can be collected when deploying ventional methods, bridging the gap between a semantic parser (Iyer et al., 2017). Effectively semi-supervised and supervised learning. utilizing the unlabeled data makes it possible for 1 Introduction a semantic parser to improve over time without human involvement. arXiv:2104.05819v1 [cs.CL] 12 Apr 2021 Semantic parsing is the task of mapping nat- Our key observation is that not all candidate pro- ural language (NL) utterances to meaning repre- grams for an utterance will be semantically valid. sentations (aka programs) that can be executed This implies that only some candidate programs against a real-world environment such as a knowl- can be executed and obtain non-empty execution edge base or a relational database. While neural results.1 As illustrated in Figure1, executability is sequence-to-sequence models (Dong and Lapata, a weak signal that can differentiate between seman- 2016; Jia and Liang, 2016a) have achieved much tically valid and invalid programs. On unlabeled success in this task in recent years, they usually re- utterances, we can encourage a parser to only focus quire a large amount of labeled data (i.e., utterance- on executable programs ignoring non-executable program pairs) for training. However, annotat- ones. Moreover, the executability of a program ing utterances with programs is expensive as it 1In the rest of this paper, we extend the meaning of ‘exe- requires expert knowledge of meaning representa- cutability’, and use it to refer to the case where a program is tions (e.g., lambda calculus, SQLs) and the envi- executable and obtains non-empty results. can be obtained from an executor for free without data as unlabeled. Empirical results show that our requiring human effort. Executability has previ- method can substantially boost the performance of ously been used to guide the decoding of a semantic a parser, trained only on labeled data, by utilizing parser (Wang et al., 2018). We take a step further a large amount of unlabeled data. to directly use this weak signal for learning from Our contributions are summarized as follows: unlabeled utterances. • We show how to exploit unlabeled utterances To learn from executability, we resort to by taking advantage of their executability. marginal likelihood training, i.e., maximizing the • To better learn from executability, we propose marginal likelihood of all executable programs for a set of new objectives based on posterior reg- an unlabeled NL utterance. However, the space ularization. of all possible programs is exponentially large, as well as the space of executable ones. Hence, simply • Our method can help a base parser achieve marginalizing over all executable programs is in- substantially better performance by utilizing tractable. Typical approximations use beam search unlabeled data. to retrieve a handful of (‘seen’) programs, which Our code, datasets, and splits are publicly avail- are used to approximate the entire space. Using able at https://github.com/berlino/ such approximations can lead to optimization get- tensor2struct-public. ting trapped in undesirable local minima. For example, we observe that encouraging a model to 2 Related Work exploit seen executable programs hinders exploration and reinforces the preference for shorter pro- Semi-Supervised Semantic Parsing In the con- grams, as discussed in Section 6.3. This happens text of semantic parsing, semi-supervised models because shorter programs are both more likely to using limited amounts of parallel data and large be among ‘seen’ programs (probably due to using amounts of unlabeled data treat either utterances locally-normalized autoregressive modeling) and or programs as discrete latent variables and in- more likely to be executable. To alleviate these duce them in the framework of generative mod- issues, we derive three new alternative objectives, els (Kociskýˇ et al., 2016; Yin et al., 2018). A chal- relying on a new interpretation of marginal likeli- lenge with these methods is that (combinatorially) hood training from the perspective of posterior reg- complex discrete variables make optimization very ularization. Our proposed objectives encode two hard, even with the help of variational inference. In kinds of inductive biases: 1) discouraging seen this work, we seek to directly constrain the discrimi- non-executable programs, which plays a similar native parser with signals obtained from executions. role to encouraging seen executable ones but does Our method can potentially be integrated into these not share its drawback of hindering exploration; generative models to regularize discrete variables. 2) encouraging sparsity among executable pro- (Underspecified) Sequence-Level Rewards grams, which encourages a parser to only focus on There have been attempts in recent years to a subset of executable programs by softly injecting integrate sequence-level rewards into sequence- a sparsity constraint. This is desirable, as there are to-sequence training as a way of accommodating only one or few correct programs for each utter- task-specific objectives. For example, BLEU can ance (see Figure1), and an accurate parser should be optimized for coherent text generation (Bosse- assign probability mass only to this subset. We col- lut et al., 2018) and machine translation (Wu lectively call these objectives X-PR, as a shorthand et al., 2018) via reinforcement learning or beam- for Execution-guided Posterior Regularization. search (Wiseman and Rush, 2016). In this work, We conduct experiments on two representative we resort to marginal likelihood training to exploit semantic parsing tasks: text-to-LF (logical form) binary executability rewards for semantic parsing parsing over a knowledge base and text-to-SQL (i.e., whether a program is executable or not), (Zelle and Mooney, 1996) parsing over a relational which has been shown to be more effective than database. Concretely, we evaluate our methods on REINFORCE (Guu et al., 2017). the OVERNIGHT (Wang et al., 2015a) and GEO- More importantly, our binary reward is under- QUERY datasets. We simulate the semi-supervised specified, i.e., there exist many spurious programs learning setting by treating 70% of the training that enjoy the same reward as the gold program. This issue of learning from underspecified re- grams. Specifically, they are defined as follows: wards underlies many weakly-supervised tasks, e.g., learning from denotations (Liang et al., 2013; Lsup(x; y) = − log p(yjx; θ) (2) X Berant et al., 2013), weakly supervised question Lunsup(x) = − log R(y)p(yjx; θ) (3) answering (Min et al., 2019). Previous work tried y to model latent alignments (Wang et al., 2019) be- where R(y) is a binary reward function that returns tween NL and programs to alleviate this issue. In 1 if y is executable and 0 otherwise. In practice, this this work, we take an orthogonal direction and pro- function is implemented by running a task-specific pose several training objectives that alleviate the executor, e.g., a SQL executor. impact of spurious programs. Another alternative to unsupervised loss is RE- INFORCE (Sutton et al., 1999), i.e., maximize the Execution for Semantic Parsing Execution has expected R(y) with respect to p(yjx; θ). However, been utilized in semantic parsing (Wang et al., as presented in Guu et al.(2017), this objective 2018) and the related area of program synthe- usually underperforms MML, which is consistent sis (Chen et al., 2019). These approaches ex- with our initial experiments.2 ploit the execution of partial programs to guide the search for plausible complete programs. Al- 3.2 Self-Training and Top-K MML though partial execution is feasible for SQL-style MML in Equation (3) requires marginalizing programs, it cannot be trivially extended to general over all executable programs which is intractable.

Load more