Learning from Executions for Semantic

Bailin Wang, Mirella Lapata and Ivan Titov Institute for Language, Cognition and Computation School of Informatics, University of Edinburgh [email protected], {mlap, ititov}@inf.ed.ac.uk

Abstract list all 3 star rated thai restaurants

Semantic parsing aims at translating natu- Program Candidates Gold Exe ral language (NL) utterances onto machine- select restaurant where star_rating = thai  interpretable programs, which can be executed select restaurant where cuisine > 3  against a real-world environment. The ex- select where  pensive annotation of utterance-program pairs restaurant star_rating = 3 has long been acknowledged as a major bot- select restaurant where star_rating = 3 tleneck for the deployment of contemporary and cuisine = thai  neural models to real-life applications. In this work, we focus on the task of semi-supervised Figure 1: Candidate programs for an utterance can be learning where a limited amount of annotated classified by executability (Exe); note that the gold pro- data is available together with many unlabeled gram is always in the set of executable programs. We NL utterances. Based on the observation that propose to ultilize the weak yet freely available signal programs which correspond to NL utterances of executablility for learning. must be always executable, we propose to en- courage a parser to generate executable pro- grams for unlabeled utterances. Due to the ronment against which they are executed (e.g., a large search space of executable programs, knowledge base, a relational database). An alter- conventional methods that use approximations native to annotation is to collect answers (or deno- based on beam-search such as self-training and tations) of programs, rather than programs them- top-k marginal likelihood training, do not per- selves (Liang et al., 2013; Berant et al., 2013). In form as well. Instead, we view the problem this work, we focus on the more extreme setting of learning from executions from the perspec- tive of posterior regularization and propose a where there are no annotations available for a large set of new training objectives. Experimen- number of utterances. This setting resembles a tal results on OVERNIGHT and GEOQUERY common real-life scenario where massive numbers show that our new objectives outperform con- of user utterances can be collected when deploying ventional methods, bridging the gap between a semantic parser (Iyer et al., 2017). Effectively semi-supervised and supervised learning. utilizing the unlabeled data makes it possible for 1 Introduction a semantic parser to improve over time without human involvement.

arXiv:2104.05819v1 [cs.CL] 12 Apr 2021 Semantic parsing is the task of mapping nat- Our key observation is that not all candidate pro- ural language (NL) utterances to meaning repre- grams for an utterance will be semantically valid. sentations (aka programs) that can be executed This implies that only some candidate programs against a real-world environment such as a knowl- can be executed and obtain non-empty execution edge base or a relational database. While neural results.1 As illustrated in Figure1, executability is sequence-to-sequence models (Dong and Lapata, a weak signal that can differentiate between seman- 2016; Jia and Liang, 2016a) have achieved much tically valid and invalid programs. On unlabeled success in this task in recent years, they usually re- utterances, we can encourage a parser to only focus quire a large amount of labeled data (i.e., utterance- on executable programs ignoring non-executable program pairs) for training. However, annotat- ones. Moreover, the executability of a program ing utterances with programs is expensive as it 1In the rest of this paper, we extend the meaning of ‘exe- requires expert knowledge of meaning representa- cutability’, and use it to refer to the case where a program is tions (e.g., , ) and the envi- executable and obtains non-empty results. can be obtained from an executor for free without data as unlabeled. Empirical results show that our requiring human effort. Executability has previ- method can substantially boost the performance of ously been used to guide the decoding of a semantic a parser, trained only on labeled data, by utilizing parser (Wang et al., 2018). We take a step further a large amount of unlabeled data. to directly use this weak signal for learning from Our contributions are summarized as follows: unlabeled utterances. • We show how to exploit unlabeled utterances To learn from executability, we resort to by taking advantage of their executability. marginal likelihood training, i.e., maximizing the • To better learn from executability, we propose marginal likelihood of all executable programs for a set of new objectives based on posterior reg- an unlabeled NL utterance. However, the space ularization. of all possible programs is exponentially large, as well as the space of executable ones. Hence, simply • Our method can help a base parser achieve marginalizing over all executable programs is in- substantially better performance by utilizing tractable. Typical approximations use beam search unlabeled data. to retrieve a handful of (‘seen’) programs, which Our code, datasets, and splits are publicly avail- are used to approximate the entire space. Using able at https://github.com/berlino/ such approximations can lead to optimization get- tensor2struct-public. ting trapped in undesirable local minima. For ex- ample, we observe that encouraging a model to 2 Related Work exploit seen executable programs hinders explo- ration and reinforces the preference for shorter pro- Semi-Supervised Semantic Parsing In the con- grams, as discussed in Section 6.3. This happens text of semantic parsing, semi-supervised models because shorter programs are both more likely to using limited amounts of parallel data and large be among ‘seen’ programs (probably due to using amounts of unlabeled data treat either utterances locally-normalized autoregressive modeling) and or programs as discrete latent variables and in- more likely to be executable. To alleviate these duce them in the framework of generative mod- issues, we derive three new alternative objectives, els (Kociskýˇ et al., 2016; Yin et al., 2018). A chal- relying on a new interpretation of marginal likeli- lenge with these methods is that (combinatorially) hood training from the perspective of posterior reg- complex discrete variables make optimization very ularization. Our proposed objectives encode two hard, even with the help of variational inference. In kinds of inductive biases: 1) discouraging seen this work, we seek to directly constrain the discrimi- non-executable programs, which plays a similar native parser with signals obtained from executions. role to encouraging seen executable ones but does Our method can potentially be integrated into these not share its drawback of hindering exploration; generative models to regularize discrete variables. 2) encouraging sparsity among executable pro- (Underspecified) Sequence-Level Rewards grams, which encourages a parser to only focus on There have been attempts in recent years to a subset of executable programs by softly injecting integrate sequence-level rewards into sequence- a sparsity constraint. This is desirable, as there are to-sequence training as a way of accommodating only one or few correct programs for each utter- task-specific objectives. For example, BLEU can ance (see Figure1), and an accurate parser should be optimized for coherent text generation (Bosse- assign probability mass only to this subset. We col- lut et al., 2018) and (Wu lectively call these objectives X-PR, as a shorthand et al., 2018) via reinforcement learning or beam- for Execution-guided Posterior Regularization. search (Wiseman and Rush, 2016). In this work, We conduct experiments on two representative we resort to marginal likelihood training to exploit semantic parsing tasks: text-to-LF (logical form) binary executability rewards for semantic parsing parsing over a knowledge base and text-to-SQL (i.e., whether a program is executable or not), (Zelle and Mooney, 1996) parsing over a relational which has been shown to be more effective than database. Concretely, we evaluate our methods on REINFORCE (Guu et al., 2017). the OVERNIGHT (Wang et al., 2015a) and GEO- More importantly, our binary reward is under- QUERY datasets. We simulate the semi-supervised specified, i.e., there exist many spurious programs learning setting by treating 70% of the training that enjoy the same reward as the gold program. This issue of learning from underspecified re- grams. Specifically, they are defined as follows: wards underlies many weakly-supervised tasks, e.g., learning from denotations (Liang et al., 2013; Lsup(x, y) = − log p(y|x, θ) (2) X Berant et al., 2013), weakly supervised question Lunsup(x) = − log R(y)p(y|x, θ) (3) answering (Min et al., 2019). Previous work tried y to model latent alignments (Wang et al., 2019) be- where R(y) is a binary reward function that returns tween NL and programs to alleviate this issue. In 1 if y is executable and 0 otherwise. In practice, this this work, we take an orthogonal direction and pro- function is implemented by running a task-specific pose several training objectives that alleviate the executor, e.g., a SQL executor. impact of spurious programs. Another alternative to unsupervised loss is RE- INFORCE (Sutton et al., 1999), i.e., maximize the Execution for Semantic Parsing Execution has expected R(y) with respect to p(y|x, θ). However, been utilized in semantic parsing (Wang et al., as presented in Guu et al.(2017), this objective 2018) and the related area of program synthe- usually underperforms MML, which is consistent sis (Chen et al., 2019). These approaches ex- with our initial experiments.2 ploit the execution of partial programs to guide the search for plausible complete programs. Al- 3.2 Self-Training and Top-K MML though partial execution is feasible for SQL-style MML in Equation (3) requires marginalizing programs, it cannot be trivially extended to general over all executable programs which is intractable. meaning representation (e.g., logical forms). In Conventionally, we resort to beam search to explore this work, we explore a more general setting where the space of programs and collect executable ones. execution can be only obtained from complete pro- To illustrate, we can divide the space of programs grams. into four parts based on whether they are executable and observed, as shown in Figure 2a. For example,

3 Executability as Learning Signal programs in PSE ∪ PSN are seen in the sense that they are retrieved by beam search. Programs in In this section, we formally define our semi- PSE ∪PUE are all executable, though only programs supervised learning setting and show how to in- in PSE can be directly observed. corporate executability into the training objective Two common approximations of Equation (3) whilst relying on the marginal likelihood training are Self-Training (ST) and Top-K MML, and they framework. We also present two conventional ap- are defined as follows: proaches to optimizing marginal likelihood. ∗ LST(x, θ) = − log p(y |x, θ) (4) X 3.1 Problem Definition Ltop-k(x, θ) = − log p(y|x, θ) (5) y∈P Given a set of labeled NL-program pairs SE l l N ∗ {(xi, yi)}i=1 and a set of unlabeled NL utterances where y denotes the most probable program, and M {xj}j=1, where N and M denote the sizes of the it is approximated by the most probable one from respective datasets, we would like to learn a neural beam search. parser p(y|x, θ), parameterized by θ, that maps ut- It is obvious that both methods only exploit pro- terances to programs. The objective to minimize grams in PSE, i.e., executable programs retrieved by consists of two parts: beam search. In cases where a parser successfully

includes the correct programs in PSE, both approxi- N M mations should work reasonably well. However, if 1 X 1 X J = L (xl, yl) + λ L (x ) a parser is uncertain and P does not contain the N sup i i M unsup i SE i=1 j=1 gold program, it would then mistakenly exploit in- (1) correct programs in learning, which is problematic. where Lsup and Lunsup denote the supervised and A naive solution to improve Self-Training or unsupervised loss, respectively. For labeled data, Top-K MML is to explore a larger space, e.g., in- we use the negative log-likelihood of gold pro- crease the beam size to retrieve more executable grams; for unlabeled data, we instead use the log 2We review the comparison between REINFORCE and marginal likelihood (MML) of all executable pro- MML in the appendix. ∗ LST(x, θ) = − log p(y |x, θ) X Seen Programs Unseen Programs Ltop-k(x, θ) = − log p(y|x, θ)

* y∈PSE Executable P P X  Programs SE UE Lrepulsion(x, θ) = − log 1 − p(y|x, θ)

y∈PSN X Lgentle(x, θ) = −p(PSE ∪ PSN) log p(y|x, θ) Non-Executable y∈PSE PSN PUN X Programs − p(PUE ∪ PUN) log p(y|x, θ)

y∈PUE ∪PUN X Lsparse(x, θ) = − qsparse(y) log p(y|x, θ) (a) Partitioned program space. Red asterisk denotes the y∈P most probable executable program y∗. P stands for pro- SE gram; subscript S stands for seen, U for unseen, E for executable, and N for non-executable. (b) Five objectives to approximate MML.

Figure 2: In (a) the program space is partitioned along two dimentions: executability and observability. In (b) we show two commonly used objectives (Self-Training and Top-K MML) and the three objectives proposed in this work. programs. However, this would inevitably increase the KL-divergence between Q and p, which is: the computation cost of learning. We also show in the appendix that increasing beam size, after it ex- JQ(θ) = DKL[Q||p(y|x, θ)] (6) ceeds a certain threshold, is no longer beneficial for = min DKL[q(y)||p(y|x, θ)] q∈Q learning. In this work, we instead propose better approximations without increasing beam size. By definition, the objective has the following upper bound: 4 Method J (θ, q) = DKL[q(y)||p(y|x, θ)] X (7) We first present a view of MML in Equation (3) = − q(y) log p(y|x, θ) − H(q) from the perspective of posterior regularization. y This new perspective helps us derive three alter- where q ∈ Q, H denotes the entropy. We can native approximations of MML: Repulsion MML, use block-coordinate descent, an EM iterative algo- Gentle MML, and Sparse MML. rithm to optimize it.

t+1 t 4.1 Posterior Regularization E : q = arg min DKL[q(y)||p(y|x, θ )] q∈Q X Posterior regularization (PR) allows to inject lin- M : θt+1 = arg min − qt+1(y)[log p(y|x, θ)] ear constraints into posterior distributions of gener- θ y ative models, and it can be extended to discrimina- tive models (Ganchev et al., 2010). In our case, we During the E-step, we try to find a distribution try to constrain the parser p(y|x, θ) to only assign q from the constrained set Q that is closest to the probability mass to executable programs. Instead current parser p in terms of KL-divergence. We of imposing hard constraints, we softly penalize then use q as a ‘soft label’ and minimize the cross- the parser if it is far away from a desired distribu- entropy between q and p during the M-step. Note that q is a constant vector and has no gradient wrt. tion q(y), which is defined as Eq[R(y)] = 1. Since R is a binary reward function, q(y) is constrained θ during the M-step. to only place mass on executable programs whose The E-step has a closed-form solution: rewards are 1. We denote all such desired distribu- ( p(y|x,θt) t+1 y ∈ PSE ∪ PUE tions as the family Q. q (y) = p(PSE ∪PUE ) (8) Specifically, the objective of PR is to penalize 0 otherwise P 0 t where p(P ∪ P ) = 0 p(y |x, θ ). Based on the intuition that the correct program is SE UE y ∈PSE ∪PUE t+1 q (y) is essentially a re-normalized version of p included in either seen executable programs (PSE) over executable programs. Interestingly, if we use or unseen programs (PUE and PUN), we can sim- the solution in the M-step, the gradient wrt. θ is ply push a parser away from seen non-executable equivalent to the gradient of MML in Equation (3). programs (PSN). Hence, we call such method That is, optimizing PR with the EM algorithm is Repulsion MML. Specifically, the first heuristic equivalent to optimizing MML.3 The connection approximates Equation (8) as follows: between EM and MML is not new, and it has been ( p(y|x,θt) t+1 y 6∈ PSN well-studied for classification problems (Amini and 1−p(PSN ) qrepulsion(y) = Gallinari, 2002; Grandvalet and Bengio, 2004). In 0 otherwise our problem, we additionally introduce PR to ac- Another way to view this heuristic is that we commodate the executability constraint, and instan- distribute the probability mass from seen non- tiate the general EM algorithm. executable programs (PSN) to other programs. In Although the E-step has a closed-form solution, contrast, the second heuristic is more ‘conserva- q computing is still intractable due to the large tive’ about unseen programs as it tends to trust seen search space of executable programs. However, this executable PSN programs more. Specifically, the PR view provides new insight on what it means to second heuristic uses the following approximations approximate MML. In essence, conventional meth- to solve the E-step. ods can be viewed as computing an approximate so-  lution of q. Specifically, Self-Training corresponds p(PSE∪SN ) t  p(y|x, θ ) y ∈ PSE  p(PSE ) to a delta distribution that only focuses on the most t+1 t qgentle(y) = p(y|x, θ ) y ∈ PUE ∪ PUN probable y∗.   0 y ∈ PSN  ∗ t+1 1 y = y Intuitively, it shifts the probability mass of qST (y) = 0 otherwise seen non-executable programs (PSN) directly to

seen executable programs (PSE). Meanwhile, Top-K MML corresponds to a re-normarlized dis- it neither upweights nor downweights unseen tribution over PSE. programs. We call this heuristic Gentle MML. ( p(y|x,θt) Compared with Self-Training and Top-K MML, t+1 y ∈ PSE p(PSE ) qtop-k(y) = Repulsion MML and Gentle MML lead to better 0 otherwise exploration of the program space, as only seen non- executable (P ) programs are discouraged. Most importantly, this perspective leads us to de- SN riving three new approximations of MML, which 4.3 Sparse MML we collectively call X-PR. Sparse MML is based on the intuition that in 4.2 Repulsion MML and Gentle MML most cases there is only one or few correct pro- grams among all executable programs. As men- As mentioned previously, Self-Training and Top- tioned in Section2, spurious programs that are exe- K MML should be reasonable approximations in cutable, but do not reflect the semantics of an utter- cases where gold programs are retrieved, i.e., they ance are harmful. One empirical evidence from pre- are in the seen executable subset (P in Figure 2a). SE vious work (Min et al., 2019) is that Self-Training However, if a parser is uncertain, i.e., beam search outperforms Top-K MML for weakly-supervised cannot retrieve the gold programs, exclusively ex- . Hence, exploiting all seen ploiting P programs is undesirable. Hence, we SE executable programs can be sub-optimal. Follow- consider ways of taking unseen executable pro- ing recent work on sparse distributions (Martins grams (P in Figure 2a) into account. Since we UE and Astudillo, 2016; Niculae et al., 2018), we pro- never directly observe unseen programs (P or UE pose to encourage sparsity of the ‘soft label’ q. P ), our heuristics do not discriminate between ex- UN Encouraging sparsity is also related to the mini- ecutable and non-executable programs (P ∪P ). UE UN mum entropy and low-density separation princi- In other words, upweighting P programs will UE ples which are commonly used in semi-supervised inevitably upweight P . UN learning (Grandvalet and Bengio, 2004; Chapelle 3Please see the appendix for the proof and analysis of PR. and Zien, 2005). To achieve this, we first interpret the entropy 5 Semantic Parsers term H in Equation (7) as a regularization of q. It is known that entropy regularization always results in In principle, our X-PR framework is model- a dense q, i.e., all executable programs are assigned agnostic, i.e., it can be coupled with any semantic non-zero probability. Inspired by SparseMax (Mar- parser for semi-supervised learning. In this work, tins and Astudillo, 2016), we instead use L2 norm we use a neural parser that achieves state-of-the-art for regularization. Specifically, we replace our PR performance across semantic parsing tasks. Specif- objective in Equation (7) with the following one: ically, we use RAT-SQL (Wang et al., 2020) which features a relation-aware encoder and a grammar- X 1 based decoder. The parser was originally devel- J (θ, q) = − q(y) log p(y|s, θ) + ||q||2 sparse 2 2 oped for text-to-SQL parsing, and we adapt it to y text-to-LF parsing. In this section, we briefly re- where q ∈ Q. Similarly, it can be optimized by the view the encoder and decoder of this parser. For EM algorithm: more details, please refer to Wang et al.(2020).

t+1 t 5.1 Relation-Aware Encoding E : q = SparseMaxQ(log p(y|x, θ )) X M : θt+1 = arg min − qt+1(y)[log p(y|x, θ)] Relation-aware encoding is originally designed θ y to handle schema encoding and schema linking for text-to-SQL parsing. We generalize these two where the top-E-step can be solved by the notions for both text-to-LF and text-to-SQL parsing SparseMax operator, which denotes the Euclidean as follows: projection from the vector of logits log p(y|x, θt) • enviroment encoding: encoding enviroments, to the simplex Q. Again, we solve the E-step i.e., a knowledge base consisting of a set of approximately. One of the approximations is to triples; a relational database represented by its use top-k SparseMax which constrain the num- schema ber of non-zeros of q to be less than k. It can be • enviroment linking: linking mentions to in- solved by using a top-k operator and followed by tended elements of environments, i.e., men- SparseMax (Correia et al., 2020). In our case, we tions of entities and properties of knowledge use beam search to approximate the top-k operator bases; mentions of tables and columns of rela- and the resulting approximation for the E-step is tional databases defined as follows: Relation-aware attention is introduced to inject dis- crete relations between environment items, and be- qt+1 = SparseMax log p(y|x, θt) sparse y∈PSE tween the utterance and environments into the self- attention mechanism of Transformer (Devlin et al., Intuitively, qt+1 occupies the middle ground be- sparse 2019). The details of relation-aware encoding can tween Self-Training (uses y∗ only) and Top-K be found in the appendix. MML (uses all PSE programs). With the help of sparsity of q introduced by SparseMax, the M-step 5.2 Grammar-Based Decoding will only promote a subset of P programs. SE Typical sequence-to-sequence models (Dong Summary We propose three new approxima- and Lapata, 2016; Jia and Liang, 2016a) treat pro- tions of MML for learning from executions. They grams as sequences, ignoring their internal struc- are designed to complement Self-Training and Top- ture. As a result, the well-formedness of generated K MML via discouraging seen non-executable pro- programs cannot be guaranteed. Grammar-based grams and introducing sparsity. In the following decoders aim to remedy this issue. For text-to-LF sections, we will empirically show that they are parsing, we use the type-constrained decoder pro- superior to Self-Training and Top-K MML for posed by Krishnamurthy et al.(2017); for text-to- semi-supervised semantic parsing. The approxi- SQL parsing, we use an AST (abstract syntax tree) mations we proposed may also be beneficial for based decoder following Yin and Neubig(2018). learning from denotations (Liang et al., 2013; Be- Note that grammar-based decoding can only ensure rant et al., 2013) and weakly supervised question the syntactic correctness of generated programs. answering (Min et al., 2019), but we leave this to Executable programs are additionally semantically future work. correct. For example, all programs in Figure 1 are OVERNIGHT GEO Model BASKETBALL BLOCKS CALENDAR HOUSING PUBLICATIONS RECIPES RESTAURANTS SOCIAL Avg. Lower bound 82.2 54.1 64.9 61.4 64.3 72.7 71.7 76.7 68.5 60.6 Self-Traing 84.7 52.6 67.9 59.8 68.6 80.1 71.1 77.4 70.3 64.2 Top-K MML 83.1 55.3 68.5 56.9 67.7 73.7 69.9 76.2 68.9 61.3 Repulsion MML 84.9 56.3 70.8 60.9 70.3 79.8 72.0 78.3 71.7 64.9 Gentle MML 84.1 58.1 70.2 63.0 71.5 78.7 72.3 76.4 71.8 65.6 Sparse MML 83.9 58.6 72.6 60.3 75.2 80.6 72.6 77.8 72.7 67.4 Upper Bound 87.7 62.9 82.1 71.4 78.9 82.4 82.8 80.8 78.6 74.2

Table 1: Execution accuracy of supervised and semi-supervised models on all domains of OVERNIGHT and GEO- QUERY. In semi-supervised learning, 30% of the original training examples are treated as labeled and the re- maining 70% as unlabeled. Lower bound refers to supervised models that only use labeled examples and discard unlabeled ones whereas upper bound refers to supervised models that have access to gold programs of unlabeled examples. Avg. refers to the average accuracy of the eight OVERNIGHT domains. We average runs over three random splits of the original training data for semi-supervised learning. well-formed, but the first two programs are seman- executability signal is much weaker than direct tically incorrect. supervision. By comparing the performance of the second baseline (upper bound) with previous 6 Experiments methods (Jia and Liang, 2016b; Herzig and Berant, 2017; Su and Yan, 2017), we can verify that our To evaluate X-PR, we present experiments on semantic parsers are state-of-the-art. Please refer to semi-supervised semantic parsing. We also analyze the Appendix for detailed comparisons. Our main how the objectives affect the training process. experiments aim to show how the proposed objec- tives can mitigate the gap between the lower- and 6.1 Semi-Supervised Learning upper-bound baselines by utilizing 70% unlabeled We simulate the setting of semi-supervised learn- data. ing on standard text-to-LF and text-to-SQL parsing benchmarks. Specifically, we randomly sample 30% of the original training data as the labeled data, and use the rest 70% as the unlabeled data. Semi-Supervised Training and Tuning We use For text-to-LF parsing, we use the OVERNIGHT stochastic gradient descent to optimize Equa- dataset (Wang et al., 2015a), which has eight dif- tion (1). At each training step, we sample two ferent domains, each with a different size rang- batches from the labeled and unlabeled data, re- ing between 801 and 4,419; for text-to-SQL pars- spectively. In preliminary experiments, we found ing, we use GEOQUERY (Zelle and Mooney, 1996) that it is crucial to pre-train a parser on supervised which contains 880 utterance-SQL pairs. The semi- data alone; this is not surprising as all of the ob- supervised setting is very challenging as leveraging jectives for learning from execution rely on beam only 30% of the original training data would re- search which would only introduce noise with an sult in only around 300 labeled examples in four untrained parser. That is, λ in Equation (1) is set to domains of OVERNIGHT and also in GEOQUERY. 0 during initial updates, and is switched to a normal value afterwards. Supervised Lower and Upper Bounds As base- lines, we train two supervised models. The first We leave out 100 labeled examples for tuning one only uses the labeled data (30% of the original the hyperparameters. The hyperparameters of the training data) and discards the unlabeled data in the semantic parser are only tuned for the development semi-supervised setting. We view this baseline as a of the supervised baselines, and are fixed for semi- lower bound in the sense that any semi-supervised supervised learning. The only hyperparameter we method is expected to surpass this. The second one tune in the semi-supervised setting is the λ in Equa- has extra access to gold programs for the unlabeled tion (1), which controls how much the unsuper- data in the semi-supervised setting, which means vised objective influences learning. After tuning, it uses the full original training data. We view this we use all the labeled examples for supervised train- baseline as an upper bound for semi-supervised ing and use the last checkpoints for evaluation on learning; we cannot expect to approach it as the the test set. (a) Average Ratios. (b) Coverage of gold programs.

Figure 3: Effect of different learning objectives in terms of average ratios and coverage (view in color).

6.2 Main Results perform better than Repulsion MML, it retrieves Our experiments evaluate the objectives pre- more accurate programs and also generates longer sented in Figure2 under a semi-supervised learning programs (see next section for details). setting. Our results are shown in Table1. To see how much labeled data would be needed for a supervised model to reach the same accuracy Self-Training and Top-K MML First, Top-K as our semi-supervised models, we conduct experi- MML, which exploits more executable programs ments using 40% of the original training examples than Self-Training, does not yield better perfor- as the labeled data. The supervised model achieves mance in six domains of OVERNIGHT and GEO- 72.6% on average in OVERNIGHT, implying that QUERY. This observation is consistent with Min ‘labeling’ 33.3% more examples would yield the et al.(2019) where Top-K MML underperforms same accuracy as our best-performning objective Self-Training for weakly-supervised question an- (Sparse MML). swering. Self-Training outperforms the lower bound in five domains of OVERNIGHT, and on 6.3 Analysis average. In contrast, Top-K MML obtains a similar To better understand the effect of different objec- performance to the lower bound in terms of average tives, we conduct analysis on the training process accuracy. of semi-supervised learning. For the sake of brevity, we focus our analysis on the CALENDAR domain X-PR Objectives In each domain of but have drawn similar conclusions for the other OVERNIGHT and GEOQUERY, the objective domains. that achieves the best performance is always within X-PR. In terms of average accuracy in Length Ratio During preliminary experiments, OVERNIGHT, all our objectives perform better we found that all training objectives tend to favor than Self-Training and Top-K MML. Among short executable programs for unlabeled utterances. X-PR, Sparse MML performs best in five domains To quantify this, we define the metric of average of OVERNIGHT, leading to a margin of 4.2% ratio as follows: compared with the lower bound in terms of average P P accuracy. In GEOQUERY, Sparse MML also |y| ratio = i y∈PSE (xi) obtain best performance. P (9) i |xi||PSE(xi)| Based on the same intuition of discouraging seen non-executable programs, Repulsion MML where PSE(xi) denotes seen executable programs achieves a similar average accuracy to of xi, |x|, |y| denotes the length of an utterance and

Gentle MML in OVERNIGHT. In contrast, a program, respectively, and |PSE(xi)| denotes the Gentle MML tends to perform better in domains number of seen executable programs. Intuitively, whose parser are weak (such as HOUSING, average ratio reveals the range of programs that BLOCKS) indicated by their lower bounds. In an objective is exploiting in terms of length. This GEOQUERY, Gentle MML performs slightly metric is computed in an online manner, and xi is better than Repulsion MML. Although it does not a sequence of data fed to the training process. As shown in Figure 3a, Top-K MML favors would like to extend X-PR to related tasks such as shorter programs, especially during the initial steps. learning from denotations and weakly supervised In contrast, Repulsion MML and Gentle MML pre- question answering. fer longer programs. For reference, we can com- Acknowledgements We would like to thank the pute the gold ratio by assuming P (x ) only con- SE i anonymous reviewers for their valuable comments. tains the gold program. The gold ratio for CAL- We gratefully acknowledge the support of the Euro- ENDAR is 2.01, indicating that all objectives are pean Research Council (Titov: ERC StG Broad- still preferring programs that are shorter than gold Sem 678254; Lapata: ERC CoG TransModal programs. However, by not directly exploiting 681760) and the Dutch National Science Founda- seen executable programs, Repulsion MML and tion (NWO VIDI 639.022.518). Gentle MML alleviate this issue compared with Top-K MML. Coverage Next, we analyze how much an objec- References tive can help a parser retrieve gold programs for Massih-Reza Amini and Patrick Gallinari. 2002. Semi- unlabeled data. Since the orignal data contains the supervised logistic regression. In ECAI, pages 390– 394. gold programs for the unlabeled data, we ultilize them to define the metric of coverage as follows: Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic parsing on Freebase from P i I[ˆyi ∈ PSE(xi)] question-answer pairs. In Proceedings of the 2013 coverage = P (10) Conference on Empirical Methods in Natural Lan- |xi| i guage Processing, pages 1533–1544, Seattle, Wash- where I is an indicator function, yˆi denotes the gold ington, USA. Association for Computational Lin- guistics. program of an utterance xi. Intuitively, this metric measures how often a gold program is captured in Antoine Bosselut, Asli Celikyilmaz, Xiaodong He,

PSE. As shown in Figure 3b, Self-Training, which Jianfeng Gao, Po-Sen Huang, and Yejin Choi. 2018. only exploits one program at a time, is relatively Discourse-aware neural rewards for coherent text generation. In Proceedings of the 2018 Conference weak in terms of retrieving more gold programs. of the North American Chapter of the Association In contrast, Repulsion MML retrieves more gold for Computational Linguistics: Human Language programs than the others. Technologies, Volume 1 (Long Papers), pages 173– As mentioned in Section 4.3, SparseMax can be 184, New Orleans, Louisiana. Association for Com- putational Linguistics. viewed as an interpolation between Self-Training and Top-K MML. This is also reflected in both Olivier Chapelle and Alexander Zien. 2005. Semi- metrics: Sparse MML always occupies the middle- supervised classification by low density separation. In AISTATS, volume 2005, pages 57–64. Citeseer. ground performance between ST and Top-K MML. Interestingly, although Sparse MML is not best in Xinyun Chen, Chang Liu, and Dawn Song. 2019. terms of both diagnostic metrics, it still achieves Execution-guided neural program synthesis. In 7th International Conference on Learning Representa- the best accuracy in this domain. tions, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net. 7 Conclusion Gonçalo M. Correia, Vlad Niculae, Wilker Aziz, and In this work, we propose to learn a semi- André F. T. Martins. 2020. Efficient marginalization supervised semantic parser from the weak yet of discrete and structured latent variables via spar- freely available executability signals. Due to the sity. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Informa- large search space of executable programs, conven- tion Processing Systems 2020, NeurIPS 2020, De- tional approximations of MML training, i.e, Self- cember 6-12, 2020, virtual. Training and Top-K MML, are often sub-optimal. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and We propose a set of alternative objectives, namely Kristina Toutanova. 2019. BERT: Pre-training of X-PR, through the lens of posterior regularization. deep bidirectional transformers for language under- Empirical results on semi-supervised learning show standing. In Proceedings of the 2019 Conference that X-PR can help a parser achieve substantially of the North American Chapter of the Association for Computational Linguistics: Human Language better performance than conventional methods, fur- Technologies, Volume 1 (Long and Short Papers), ther bridging the gap between semi-supervised pages 4171–4186, Minneapolis, Minnesota. Associ- learning and supervised learning. In the future, we ation for Computational Linguistics. Li Dong and Mirella Lapata. 2016. Language to logi- Jayant Krishnamurthy, Pradeep Dasigi, and Matt Gard- cal form with neural attention. In Proceedings of the ner. 2017. Neural semantic parsing with type con- 54th Annual Meeting of the Association for Compu- straints for semi-structured tables. In Proceedings tational Linguistics (Volume 1: Long Papers), pages of the 2017 Conference on Empirical Methods in 33–43, Berlin, Germany. Association for Computa- Natural Language Processing, pages 1516–1526, tional Linguistics. Copenhagen, Denmark. Association for Computa- tional Linguistics. Kuzman Ganchev, Joao Graça, Jennifer Gillenwater, and Ben Taskar. 2010. Posterior regularization for Percy Liang, Michael I. Jordan, and Dan Klein. 2013. structured latent variable models. The Journal of Learning dependency-based compositional seman- Machine Learning Research, 11:2001–2049. tics. Computational Linguistics, 39(2):389–446.

Yves Grandvalet and Yoshua Bengio. 2004. Semi- André F. T. Martins and Ramón Fernandez Astudillo. supervised learning by entropy minimization. In 2016. From softmax to sparsemax: A sparse model Advances in Neural Information Processing Systems of attention and multi-label classification. In Pro- 17 [Neural Information Processing Systems, NIPS ceedings of the 33nd International Conference on 2004, December 13-18, 2004, Vancouver, British Machine Learning, ICML 2016, New York City, Columbia, Canada], pages 529–536. NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings, pages 1614– Kelvin Guu, Panupong Pasupat, Evan Liu, and Percy 1623. JMLR.org. Liang. 2017. From language to programs: Bridg- ing reinforcement learning and maximum marginal Sewon Min, Danqi Chen, Hannaneh Hajishirzi, and likelihood. In Proceedings of the 55th Annual Meet- Luke Zettlemoyer. 2019. A discrete hard EM ap- ing of the Association for Computational Linguistics proach for weakly supervised question answering. (Volume 1: Long Papers), pages 1051–1062, Van- In Proceedings of the 2019 Conference on Empirical couver, Canada. Association for Computational Lin- Methods in Natural Language Processing and the guistics. 9th International Joint Conference on Natural Lan- guage Processing (EMNLP-IJCNLP), pages 2851– Jonathan Herzig and Jonathan Berant. 2017. Neural Se- 2864, Hong Kong, China. Association for Computa- mantic Parsing over Multiple Knowledge-bases. In tional Linguistics. Proceedings of the 55th Annual Meeting of the As- sociation for Computational Linguistics (Volume 2: Vlad Niculae, André F. T. Martins, Mathieu Blondel, Short Papers), volume 2, pages 623–628, Strouds- and Claire Cardie. 2018. Sparsemap: Differentiable burg, PA, USA. sparse structured inference. In Proceedings of the 35th International Conference on Machine Learning, Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant ICML 2018, Stockholmsmässan, Stockholm, Sweden, Krishnamurthy, and Luke Zettlemoyer. 2017. Learn- July 10-15, 2018, volume 80 of Proceedings of Ma- ing a neural semantic parser from user feedback. In chine Learning Research, pages 3796–3805. PMLR. Proceedings of the 55th Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Yu Su and Xifeng Yan. 2017. Cross-domain seman- Long Papers), pages 963–973, Vancouver, Canada. tic parsing via paraphrasing. In Proceedings of the Association for Computational Linguistics. 2017 Conference on Empirical Methods in Natu- ral Language Processing, pages 1235–1246, Copen- Robin Jia and Percy Liang. 2016a. Data recombination hagen, Denmark. for neural semantic parsing. In Proceedings of the 54th Annual Meeting of the Association for Compu- Richard S Sutton, David A McAllester, Satinder P tational Linguistics (Volume 1: Long Papers), pages Singh, Yishay Mansour, et al. 1999. Policy gradient 12–22, Berlin, Germany. Association for Computa- methods for reinforcement learning with function ap- tional Linguistics. proximation. In NIPs, volume 99, pages 1057–1063. Citeseer. Robin Jia and Percy Liang. 2016b. Data Recombina- tion for Neural Semantic Parsing. In Proceedings Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr of the 54th Annual Meeting of the Association for Polozov, and Matthew Richardson. 2020. RAT-SQL: Computational Linguistics (Volume 1: Long Papers), Relation-aware schema encoding and linking for pages 12–22, Stroudsburg, PA, USA. text-to-SQL parsers. In Proceedings of the 58th An- nual Meeting of the Association for Computational Tomáš Kociský,ˇ Gábor Melis, Edward Grefenstette, Linguistics, pages 7567–7578, Online. Association Chris Dyer, Wang Ling, Phil Blunsom, and for Computational Linguistics. Karl Moritz Hermann. 2016. Semantic parsing with semi-supervised sequential autoencoders. In Pro- Bailin Wang, Ivan Titov, and Mirella Lapata. 2019. ceedings of the 2016 Conference on Empirical Meth- Learning semantic parsers from denotations with la- ods in Natural Language Processing, pages 1078– tent structured alignments and abstract programs. In 1087, Austin, Texas. Association for Computational Proceedings of the 2019 Conference on Empirical Linguistics. Methods in Natural Language Processing and the 9th International Joint Conference on Natural Lan- A Method guage Processing (EMNLP-IJCNLP), pages 3774– 3785, Hong Kong, China. Association for Computa- MML vs. RL Another choice for learning from tional Linguistics. executions is to maximize the expected reward:

Chenglong Wang, Kedar Tatwawadi, Marc JRL(x, θ) = Ep(y|x,θ)[R(y)] (11) Brockschmidt, Po-Sen Huang, Yi Mao, Olek- sandr Polozov, and Rishabh Singh. 2018. Robust Recall that our MML objective is defined as: text-to- generation with execution-guided decoding. arXiv preprint arXiv:1807.03100. X JMML(x, θ) = log R(y)p(y|x, θ) (12) Yushi Wang, Jonathan Berant, and Percy Liang. 2015a. y Building a semantic parser overnight. In Proceed- ings of the 53rd Annual Meeting of the Association As shown in previous work (Guu et al., 2017), the for Computational Linguistics and the 7th Interna- gradients (wrt. θ) of RL and MML have the same tional Joint Conference on Natural Language Pro- form: cessing (Volume 1: Long Papers), pages 1332–1342, Beijing, China. Association for Computational Lin- guistics. X ∇θJ (x, θ) = q(y)∇θ log p(y|x, θ) (13) Yushi Wang, Jonathan Berant, and Percy Liang. 2015b. y Building a Semantic Parser Overnight. In Proceed- ings of the 53rd Annual Meeting of the Association where q can be viewed as a soft-label that is in- for Computational Linguistics and the 7th Interna- duced from p. For RL, qRL(y) = pθ(y|x, θ)R(y); tional Joint Conference on Natural Language Pro- R(y)pθ(y|x) for MML, qMML(y) = P R(y0)p (y0|x) = cessing (Volume 1: Long Papers), volume 1, pages y0 θ 1332–1342, Stroudsburg, PA, USA. pθ(y|x, R(y) = 1). Compared with RL, MML additionally renormalize the p over executable pro- Sam Wiseman and Alexander M. Rush. 2016. grams to obtain q. As a result, MML outputs a Sequence-to-sequence learning as beam-search op- timization. In Proceedings of the 2016 Conference ‘stronger’ gradient than RL due to the renormaliza- on Empirical Methods in Natural Language Process- tion. In our initial experiments, we found that RL ing, pages 1296–1306, Austin, Texas. Association is hard to converge whereas MML is not. for Computational Linguistics. Proof of the E-Step Solution Denote the set of Lijun Wu, Fei Tian, Tao Qin, Jianhuang Lai, and Tie- executable programs as V. Since q(y) is 0 for non- Yan Liu. 2018. A study of reinforcement learning executable programs, we only need to compute it for neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Nat- for executable programs in V. We need to solve the ural Language Processing, pages 3612–3621, Brus- following optimization problem: sels, Belgium. Association for Computational Lin- guistics. X X min − q(y) log p(y|x, θ) + q(y) log q(y) q Pengcheng Yin and Graham Neubig. 2018. TRANX: A y∈V y∈V transition-based neural abstract syntax parser for se- X mantic parsing and code generation. In Proceedings s.t. q(y) = 1 of the 2018 Conference on Empirical Methods in y∈V Natural Language Processing: System Demonstra- tions, pages 7–12, Brussels, Belgium. Association q(y) ≥ 0 for Computational Linguistics. (14)

Pengcheng Yin, Chunting Zhou, Junxian He, and Gra- By solving it with KKT conditions, we can see ham Neubig. 2018. StructVAE: Tree-structured la- that q(y) ∝ p(y|x, θ). Since q(y) needs to sum up tent variable models for semi-supervised semantic p(y|x,θ) to 1, it is easy to obtain that q(y) = P . parsing. In Proceedings of the 56th Annual Meet- y∈V p(y|x,θ) ing of the Association for Computational Linguis- tics (Volume 1: Long Papers), pages 754–765, Mel- B Experiments bourne, Australia. Association for Computational Linguistics. Relation-Aware Encoding of Semantic Parsers Relation-aware encoding was originally introduced John M Zelle and Raymond J Mooney. 1996. Learn- ing to parse database queries using inductive logic for text-to-SQL parsing in Wang et al.(2020) to ac- programming. In Proceedings of the national con- commodate discrete relations among schema items ference on artificial intelligence, pages 1050–1055. (e.g., tables and columns) and linking between an Type of x Type of y Edge label Description

RELATED-F there exists a property p s.t. (x, p, y) ∈ K Entity Entity RELATED-R there exists a property p s.t. (y, p, x) ∈ K

HAS-PROPERTY-F there exists an entity e s.t. (x, y, e) ∈ K Entity Property HAS-PROPERTY-R there exists an entity e s.t. (e, y, x) ∈ K

PROP-TO-ENT-F there exists an entity e s.t. (y, x, e) ∈ K Property Entity PROP-TO-ENT-R there exists an entity e s.t. (e, x, y) ∈ K

EXACT-MATCH x and y are the same word Utterance Token Entity PARTIAL-MATCH token x is contained in entity y

EXACT-MATCH-R y and x are the same word Entity Utterance Token PARTIAL-MATCH-R token y is contained in entity x

P-EXACT-MATCH x and y are the same word Utterance Token Property P-PARTIAL-MATCH token x is contained in property y

P-EXACT-MATCH-R y and x are the same word Property Utterance Token P-PARTIAL-MATCH-R token y is contained in property y

Table 2: Relation types used for text-to-LF parsing.

(0) N−1 utterance and schema items. Let {xi }i=0 de- items of this sequence, we define relations between note the input to the parser consisting of NL to- each pair of items, denoted by (x, y), in Table2. kens and the (linearized version of) environment; Statitics of Datasets The numbers of examples relation-aware encoding changes the multi-head in each domain are shown in Table3. Four do- self-attention (with H heads and hidden size dx) as mains of OVERNIGHT and GEOQUERY contain follows: only around 1000 examples. (n) (n,h) (n) (n,h) x W (x W + rK )> Results of Supervised Models The results of su- e(n,h) = i Q j K ij ij p pervised models (upper bounds) are shown in Ta- dz/H (n,h)  (n,h) ble4. Our parser achieves the best performance αij = softmaxj eij (15) among models without cross-domain training. This n (n,h) X (n,h) (n) (n,h) V confirms that we use a strong base parser for our zi = αij (xj WV + rij ). experiments on semi-supervised learning. j=1 Beam Size We try the beam size from (n,h) {4, 8, 16, 32, 64} and finally picks 16 which per- where αij denotes the attention weights of head h at layer n, 0 ≤ h < H, 0 ≤ n < N, and forms best in the OVERNIGHT dataset. We also try (h) (h) (h) d ×(d /H) a larger beam size of 128 during preliminary exper- W ,W ,W ∈ R x x . Most impor- Q K V iments. However, the model is extremely slow to tantly, rK and rV are key and value embeddings ij ij train and does not outperform the one with beam of the discrete relation r between items i and ij size 16. j. They are incorporated to bias attention towards discrete relations. Semi-Supervised Learning with 10% Data In this work, we re-use the relations from Wang We also investigate a semi-supervised setting where et al.(2020) for our text-to-SQL parsing task on only 10% of the original training data are treated the GEOQUERY dataset. For text-to-LF parsing as labeled data. This is more challenging as four on the OVERNIGHT dataset, we elaborate on how domains of OVERNIGHT and GEOQUERY only we define discrete relations. The input of text-to- have around 100 labeled utterance-program pairs LF parsing is an utterance and a fixed knowledge in this setting. The results are shown in Table5. In base K which is represented as a set of triples in terms of average accuracy, Sparse MML achieves the form of (entity1, property, entity2). To feed the best performance. The margin of improve- the input into a transformer, we first linearize it ment, compared to the lower bound, is relatively into an ordered sequence which contains a list of smaller than using 30% data (3% vs 4.2%). As utterance tokens, followed by a sequence of entities all objectives rely on beam search, a weak base and properties. To model the relations among the parser, as indicated by the lower bound, is probably BASKETBALL BLOCKS CALENDAR HOUSING PUBLICATIONS RECIPES RESTAURANTS SOCIAL GEO all 1952 1995 837 941 801 1080 1657 4419 880

Table 3: Data sizes of the OVERNIGHT and GEOQUERY dataset.

Model BASKETBALL BLOCKS CALENDAR HOUSING PUBLICATIONS RECIPES RESTAURANTS SOCIAL Avg. Wang et al.(2015b) 46.3 41.9 74.4 54.5 59.0 70.8 75.9 48.2 58.8 Jia and Liang(2016b) 85.2 58.1 78.0 71.4 76.4 79.6 76.2 81.4 75.8 Herzig and Berant(2017) 85.2 61.2 77.4 67.7 74.5 79.2 79.5 80.2 75.6 Su and Yan(2017) 86.2 60.2 79.8 71.4 78.9 84.7 81.6 82.9 78.2 Ours 87.7 62.9 82.1 71.4 78.9 82.4 82.8 80.8 78.6 Herzig and Berant(2017) ∗ 86.2 62.7 82.1 78.3 80.7 82.9 82.2 81.7 79.6 Su and Yan(2017) ∗ 88.2 62.7 82.7 78.8 80.7 86.1 83.7 83.1 80.8

Table 4: Test accuracy of supervised models on all domains for OVERNIGHT. Models with ∗ are augmented with cross-domain training.

OVERNIGHT GEO Model BASKETBALL BLOCKS CALENDAR HOUSING PUBLICATIONS RECIPES RESTAURANTS SOCIAL Avg. Lower Bound 68.0 35.3 40.5 34.4 39.1 45.8 59.3 63.7 48.3 42.7 Self-Traing 66.5 37.8 48.2 39.2 41.0 41.2 63.0 63.8 50.1 44.1 Top-K MML 70.3 36.8 42.3 30.7 37.3 44.4 61.1 62.2 48.1 40.1 Repulsion MML 70.8 37.3 44.6 37.8 39.8 44.4 60.2 67.8 50.3 44.8 Gentle MML 68.3 38.1 45.8 39.2 43.5 44.9 59.9 64.0 50.5 46.6 Sparse MML 73.4 42.1 46.4 38.1 42.2 43.1 63.0 64.9 51.3 45.2 Upper Bound 87.7 62.9 82.1 71.4 78.9 82.4 82.8 80.8 78.6 74.2

Table 5: Execution accuracy of supervised and semi-supervised models on all domains of OVERNIGHT and GEO- QUERY. In semi-supervised learning, 10% of the original training examples are treated as labeled and the re- maining 90% as unlabeled. Lower bound refers to supervised models that only use labeled examples and discard unlabeled ones whereas upper bound refers to supervised models that have access to gold programs of unlabeled examples. Avg. refers to the average accuracy of the eight OVERNIGHT domains. harder to improve. By taking account of unseen programs, Repulsion MML and Gentle MML tend to be more useful in domains such as HOUSING, PUBLICATIONS,RECIPES where the base parsers are very weak (accuracy is around 40%). Moreover, Gentle MML performs best in GEOQUERY.