Learning to Learn Semantic Parsers from Natural Language Supervision

Igor Labutov∗ Bishan Yang∗ Tom Mitchell LAER AI, Inc. LAER AI, Inc. Carnegie Mellon University New York, NY New York, NY Pittsburgh, PA [email protected] [email protected] [email protected]

Abstract moyer and Collins, 2005, 2009; Kwiatkowski et al., 2010). A well recognized practical chal- As humans, we often rely on language to learn lenge in supervised learning of structured mod- language. For example, when corrected in a conversation, we may learn from that correc- els is that fully annotated structures (e.g., logical tion, over time improving our language flu- forms) that are needed for training are often highly ency. Inspired by this observation, we pro- labor-intensive to collect. This problem is further pose a learning algorithm for training semantic exacerbated in semantic by the fact that parsers from supervision (feedback) expressed these annotations can only be done by people fa- in natural language. Our algorithm learns a miliar with the underlying logical language, mak- semantic parser from users’ corrections such ing it challenging to construct large scale datasets as “no, what I really meant was before his job, not after”, by also simultaneously learn- by non-experts. ing to parse this natural language feedback in Over the years, this practical observation has order to leverage it as a form of supervision. spurred many creative solutions to training seman- Unlike supervision with gold-standard logical tic parsers that are capable of leveraging weaker forms, our method does not require the user to forms of supervision, amenable to non-experts. be familiar with the underlying logical formal- One such weaker form of supervision relies on ism, and unlike supervision from denotation, logical form denotations (i.e, the results of a logi- it does not require the user to know the correct cal form’s execution) – rather than the logical form answer to their query. This makes our learning algorithm naturally scalable in settings where itself, as “supervisory” signal (Clarke et al., 2010; existing conversational logs are available and Liang et al., 2013; Berant et al., 2013; Pasupat and can be leveraged as training data. We con- Liang, 2015; Liang et al., 2016; Krishnamurthy struct a novel dataset of natural language feed- et al., 2017). In Question Answeing (QA), for back in a conversational setting, and show that example, this means the annotator needs only to our method is effective at learning a semantic know the answer to a question, rather than the full parser from such natural language supervision. SQL query needed to obtain that answer. Para- 1 Introduction phrasing of utterances already annotated with log- ical forms is another practical approach to scale up Semantic parsing is a problem of mapping a natu- annotation without requiring experts with a knowl- ral language utterance into a formal meaning rep- edge of the underlying logical formalism (Berant resentation, e.g., an executable logical form (Zelle and Liang, 2014; Wang et al., 2015). and Mooney, 1996). Because the space of all logi- Although these and similar methods do reduce cal forms is large but constrained by an underlying the difficulty of the annotation task, collecting structure (i.e., all trees), the problem of learning a even these weaker forms of supervision (e.g., de- semantic parser is commonly formulated as an in- notations and paraphrases) still requires a dedi- stance of structured prediction. cated annotation event, that occurs outside of the Historically, approaches based on supervised normal interactions between the end-user and the learning of structured prediction models have semantic parser1. Furthermore, expert knowledge emerged as some of the first and still remain com- may still be required, even if the underlying logi- mon in the semantic parsing community (Zettle- 1although in some cases existing datasets can be leveraged ∗ Work done while at Carnegie Mellon University. to extract this information

1676 Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1676–1690 Brussels, Belgium, October 31 - November 4, 2018. c 2018 Association for Computational Linguistics User Utterance Before [September 10th], how many places was be a correction of yˆi expressed in natural language. [Brad] employed? The observed variables in a single interaction are System NLG find employment [Brad] had, prior to [September 10th] the task utterance ui, the predicted task logical User Feedback Yes, but I also want you to count how many placed Brad was employed at. form yˆi and the user’s feedback fi; the true logical form yi is hidden. See Figure1 for an illustration. Figure 1: Example (i) user’s original utterance, (ii) A key observation that we make from the above confirmation query generated by inverting the original formulation is that learning from such interactions parse, (iii) user’s generated feedback towards the con- can be effectively done in an offline (i.e., non- firmation query (i.e., original parse) . interactive) setting, using only the logs of past in- teractions with the user. Our aim is to formulate a model that can learn a task semantic parser (i.e., cal form does not need to be given by the annotator one parsing the original utterance u) from such in- (QA denotations, for example, require the annota- teraction logs, without access to the true logical tor to know the correct answer to the question – an forms (or denotations) of the users’ requests. assumption which doesn’t hold for end-users who asked the question with the goal of obtaining the 2.1 Modelling conversational logs answer). In contrast, our goal is to leverage natural language feedback and corrections that may occur Formally, we propose to learn a semantic parser naturally as part of the continuous interaction with from conversational logs represented as follows: the non-expert end-user, as training signal to learn a semantic parser. D = {(ui, yˆi, fi)}i,...,N The core challenge in leveraging natural lan- guage feedback as a form of supervision in train- where ui is the user’s task utterance (e.g., a ques- ing semantic parsers, however, is the challenge of tion), yˆi is the system’s original parse (logical correctly parsing that feedback to extract the su- form) of that utterance, fi is the user’s natural lan- pervisory signal embedded in it. Parsing the feed- guage feedback towards that original parse and N back, just like parsing the original utterance, re- is the number of dialog turns in the log. Note that quires its own semantic parser trained to inter- the original parse yˆi of utterance ui could come pret that feedback. Motivated by this observa- from any semantic parser that was deployed at the tion, our main contribution in this work is a semi- time the data was logged – there are no assump- supervised learning algorithm that learns a task tions made on the source of how yˆi was produced. parser (e.g., a question parser) from feedback ut- Contrast this with the traditional learning set- terances while simultaneously learning a parser to tings for semantic parsers, where the user (or an- interpret the feedback utterances. Our algorithm notator) provides the correct logical form yi or the relies only on a small number of annotated logical execution of the correct logical form yi (deno- forms, and can continue learning as more feedback tation) (Table1). In our setting, insteadJ K of the is collected from interactions with the user. correct logical form yi, we only have access to Because our model learns from supervision that the logical form produced by whatever semantic it simultaneously learns to interpret, we call our parser was interacting with the user at the time the approach learning to learn semantic parsers from data was collected, i.e., yˆi is not necessarily the natural language supervision. correct logical form (though it could be). In this 2 Problem Formulation work, however, we focus only on corrective feed- back (i.e., cases where yˆ =6 y) as our main objec- Formally, the setting proposed in this work can be tive is to evaluate the ability to interpret rich natu- modelled as follows: (i) the user poses a natural ral language feedback (which is only given when language input ui (e.g., a question) to the system, the original parse is incorrect). Our key hypothesis (ii) the system parses the user’s utterance ui, pro- in this work is that although we do not observe yi ducing a logical form yˆi, (iii) the system commu- directly, combining yˆi with the user’s feedback fi nicates yˆi to the user in natural language in the should be sufficient to get a better estimate of the form of a confirmation (i.e., “did you mean . . . ”), correct logical form (closer to yi), which we can (iv) in response to the system’s confirmation, the then leverage as a “training example” to improve user generates a feedback utterance fi, which may the task parser (and the feedback parser).

1677 Supervision Dataset utterance to agree with the “stronger” model that Full logical forms {(ui, yi)}i,...,N Denotations {(ui, yi )}i,...,N does (i.e., using the feedback parser’s prediction as Binary feedback {(ui, JyˆiK= yi )}i,...,N a noisy “label” to bootstrap the task parser), while J K NL feedback (this work) {(ui, yˆi, fi)}i,...,N conversely encouraging a more “complex” model Table 1: Different types of supervision used in litera- to agree with a “simpler” model (i.e., using the ture for training semantic parsers, and the correspond- simpler task parser model as a “regularizer” for the ing data needed for each type of supervision. Notation more complex feedback parser; note that the feed- used in the table: u corresponds to user utterance (lan- back parser generally has higher model complex- guage), y corresponds to a gold-standard logical form ity compared to the task parser because it incorpo- parse of u, yˆ corresponds to a predicted logical form, rates additional parameters to account for process- · is the result of executing an expression inside the ing the feedback input).2 Note the feedback parser bracketsJ K and f is the user’s feedback expressed in nat- ural language in the context of an utterance and a pre- output is not simply the meaning of the feed- dicted logical form. back utterance in isolation, but instead the revised semantic interpretation of the original utterance, guided by the feedback utterance. In this sense, 3 Natural Language Supervision the task faced by the feedback parser is to both de- 3.1 Learning problem formulation termine the meaning of the feedback, and to apply that feedback to repair the original interpretation In this section, we propose a learning algorithm for of u . Note that this model is closely related to training a semantic parser from natural language i co-training (Blum and Mitchell, 1998) (and more feedback. We will use the terms task parser and generally multi-view learning (Xu et al., 2013)) feedback parser to refer to the two distinct parsers and is also a special case of a product-of-experts used for parsing the user’s original task utterance (PoE) model (Hinton, 1999). (e.g., question) and parsing the user’s follow-up feedback utterance respectively. Our learning al- 3.2 Learning gorithm does not assume any specific underlying model for the two parsers aside from the require- The problem of maximizing the joint-likelihood ment that each parser specifies a probability distri- P (y | u, f, yˆ; θt, θf ) can be approached as a stan- bution over logical forms given the utterance (for dard problem of maximum likelihood estimation the task parser) and the utterance plus feedback in the presence of latent variables (i.e., unob- (for the feedback parser): served logical form y), suggesting the application of the Expectation Maximization (EM) algorithm Task Parser: P (y | u; θt) for learning parameters θt and θf . The direct ap- Feedback Parser: P (y | u, f, yˆ; θf ) plication of EM in our setting, however, is faced where θt and θf parametrize the task and feedback with a complication in the E-step. Because the parsers respectively. Note that the key distinction hidden variables (logical forms y) are structured, between the task and feedback parser models is computing the posterior over the space of all log- that in addition to the user’s original task utterance ical forms and taking the expectation of the log- u, the feedback parser also has access to the user’s likelihood with respect to that posterior is gener- feedback utterance f, and the original parser’s pre- ally intractable (unless we assume certain factor- diction yˆ. We now introduce a joint model that ized forms for the logical form likelihood). combines task and feedback parsers in a way that Instead, we propose to approximate the E-step encourages the two models to agree: by replacing the expectation of the log joint- likelihood with its point estimate, using the maxi- 1 P (y | u, f, yˆ; θt, θf ) = P (y | u; θt) P (y | u, f, yˆ; θf ) mum a posteriori (MAP) estimate of the posterior Z | {z } | {z } task parser feedback parser over logical forms y to obtain that estimate (this is sometimes referred to as “hard-EM”). Obtain- At training time, our objective is to maximize the ing the MAP estimate of the posterior over log- above joint likelihood by optimizing the parser pa- 2 rameters θt and θf of the task and feedback parsers Note that in practice, to avoid locally optimal degenerate respectively. The intuition behind optimizing the solutions (e.g., where the feedback parser learns to trivially agree with the task parser by learning to ignore the feedback), joint objective is that it encourages the “weaker” some amount of labeled logical forms would be required to model that does not have access to the feedback pre-train both models.

1678 ical forms may itself be a difficult optimization Algorithm 1: Semantic Parser Training from Natural problem, and the specific algorithm for obtaining Language Supervision it would depend on the internal models of the task Input : D = {(ui, yˆi, fi)}1,...,N Output : Task parser parameters θ ; Feedback parser and the feedback parser. t parameters θf As an alternative, we propose a simple, MCMC Parameter: Number of training epochs T based algorithm for approximating the MAP esti- for t = 1 to T do for i = 1 to N do mate of the posterior over logical forms, that does f yˆi ← MH-MAP (ui, yˆi, fi, θt, θf ); not require access to the internals of the parser f ∇θt ← ∇ log P (ˆyi | ui; θt) ; models, i.e., allowing us to conveniently treat both f ∇θf ← ∇ log P (ˆyi | ui, fi, yˆi; θf ) ; parsers as “black boxes”. The only assumption θt ← SGD (θt, ∇θt); that we make is that it’s easy to sample logi- θf ← SGD (θf , ∇θf ); end cal forms from the individual parsers (though not end necessarily from the joint model), as well as to compute the likelihoods of those sampled logical forms under at least one of the models. Algorithm 2: Metropolis Hastings-based MAP estima- tion of latent semantic parse We use Metropolis Hastings to sample from Input : u, yˆ, f, θt, θf the posterior over logical forms, using one of the Output : latent parse yˆf parser models as the proposal distribution. Specif- Parameter: Number of sampling iterations N ically, if we choose the feedback parser as the pro- Function MH-MAP(u, yˆ, f, θt, θf ): samples ← [] posal distribution, and initialize the Markov Chain // sample parse from task parser with a logical form sampled from the task parser, Sample yˆt ∼ P (y | u, θt) it can be shown that the acceptance probability yˆcurr ← yˆt r for i = 1 to N do for the first sample conveniently simplifies to the // sample parse from feedback parser following expression: Sample yˆf ∼ P (y | u, f, y,ˆ θf )   r ← min 1, P (ˆyf |u,θt)   P (ˆycurr |u,θt) P (ˆyf | u, θt) r = min 1, (1) Sample accept ∼ (r) P (ˆy | u, θ ) Bernoulli t t if accept then p ← P (ˆy | u, θ ) where yˆt and yˆf are logical forms sampled from t f t pf ← P (ˆyf | u, f, y,ˆ θf ) the task and feedback parsers respectively. In- samples[ˆyf ] ← pt · pf tuitively the above expression compares the like- yˆcurr ← yˆf lihood of the logical form proposed by the task end end parser to the likelihood of the logical form pro- yˆf ← argmax samples posed by the feedback parser, but with both likeli- return yˆf hoods computed under the same task parser model (making the comparison fair). The proposed parse is then accepted with a probability proportional to els (Bahdanau et al., 2014; Luong et al., 2015). that ratio. See Algorithm2 for details. The encoder takes in an utterance (tokenized) u Finally, given this MAP estimate of y, we can and computes a context-sensitive embedding hi perform optimization over the parser parameters for each token ui using a bidirectional Long Short- θt and θf using a single step with stochastic gradi- Term Memory (LSTM) network (Hochreiter and ent descent before re-estimating the latent logical Schmidhuber, 1997), where hi is computed as the form y. We also approximate the gradients of each concatenation of the hidden states at position i parser model by ignoring the gradient terms asso- output by the forward LSTM and the backward 3 ciated with the log of the normalizer Z. LSTM. The decoder generates the output logical 3.3 Task Parser Model form y one token at a time using another LSTM. At each time step j, it generates yj based on the Our task parser model is implemented based current LSTM hidden state sj, a summary of the on existing attention-based encoder-decoder mod- input context cj, and attention scores aji, which 3We have experimented with a sampling based approxi- are used for attention-based copying as in (Jia and mation of the true gradients with contrastive divergence (Hin- Liang, 2016). Specifically, ton, 2002), however, found that our approximation works suf- ficiently well empirically. See Algorithm1 for more details of the complete algorithm. p(yj = w, w ∈ Vout|u, y1:j−1) ∝ exp (Wo[sj; cj])

1679 p(yj = ui|u, y1:j−1) ∝ exp (aji) where αi ∝ exp(aji) and βk ∝ exp(ejk). Ac- cordingly, the decoder is allowed to copy words where denotes that is chosen from the yj = w yj from the feedback: output vocabulary Vout; yj = ui denotes that yj T p(y = f |u, f, y ) ∝ exp (e ) is a copy of ui; aji = sj Wahi is an attention j k 1:j−1 jk P score on the input word ui; cj = i αihi, αi ∝ The bottom parts of Figure7 and Figure8 vi- exp (aji) is a context vector that summarizes the sualize the attention mechanism of the feedback encoder states; and Wo and Wa are matrix parame- parser, where the decoder attends to the utterance, ters to be learned. After generating yj, the decoder the conversational context, and the feedback dur- LSTM updates its hidden state sj+1 by taking as ing decoding. input the concatenation of the embedding vector Note that our feedback model does not explic- for yj and the context vector cj. itly incorporate the original logical form yˆ that the An important problem in semantic parsing for user was generating their feedback towards. We conversation is resolving references of people and experimented with a number of ways to incorpo- things mentioned in the dialog context. Instead of rate yˆ in the feedback parser model, but found it treating coreference resolution as a separate prob- most effective to instead use it during MAP infer- lem, we propose a simple way to resolve it as a part ence for the latent parse y. In sampling a logical of semantic parsing. For each input utterance, we form in Algorithm2, we simply reject the sample record a list of previously-occurring entity men- if the sampled logical form matches yˆ. tions m = {m1, ..., mL}. We consider entity men- tions of four types: persons, organizations, times, 4 Dataset and topics. The encoder now takes in an utterance In this work we focus on the problem of semantic 4 u concatenated with m , and the decoder can gen- parsing in a conversational setting and construct a erate a referenced mention through copying. new dataset for this task. Note that a parser is re- The top parts of Figure7 and Figure8 visualize quired to resolve references to the conversational the attention mechanism of the task parser, where context during parsing, which in turn may further the decoder attends to both the utterance and the amplify ambiguity and propagate errors to the fi- conversational context during decoding. Note that nal logical form. Conveniently, the same conver- the conversational context is only useful when the sational setting also offers a natural channel for utterance contains reference mentions. correcting such errors via natural language feed- back users can express as part of the conversation. 3.4 Feedback Parser Model Our feedback parser model is an extension of the 4.1 Dataset construction encoder-decoder model in the previous section. We choose conversational search in the domain of The encoder consists of two bidirectional LSTMs: email and biographical research as a setting for our one encodes the input utterance u along with the dataset. Large enterprises produce large amounts history of entity mentions m and the other encodes of text during daily operations, e.g., emails, re- the user feedback f. At each time step j during ports, and meeting memos. There is an increasing decoding, the decoder computes attention scores need for systems that allow users to quickly find over each word ui in utterance u as well as each information over text through search. Unlike reg- feedback word fk based on the decoder hidden ular tasks like booking airline tickets, search tasks state sj, the context embedding bk output by the often involve many facets, some of which may or feedback encoder, and a learnable weight matrix may not be known to the users when the search be- T We: ejk = sj Webk. The input context vector cj gins. For example, knowing where a person works in the previous section is updated by adding a con- may triggers a followup question about whether text vector that attends to both the utterance and the person communicates with someone internally the feedback: about certain topic. Such search tasks will often be handled most naturally through dialog. 0 X X cj = αihi + βkbk Figure1 shows example dialog turns containing i k the user’s utterance (question) u, system’s confir- 4The mentions are ordered based on their types and each mation of the original parse yˆ, and the user’s feed- mention is wrapped by special boundary symbols [ and ]. back f. To simplify and scale data collection, we

1680 decouple the dialog turns and collect feedback for each dialog turn in isolation. We do that by show- ing a worker on Mechanial Turk the original utter- ance u, the system’s natural language confirmation generated by inverting the original logical form yˆ, and the dialog context summarized in a table. See Figure2 for a screenshot of the Mechanical Turk interface. The dialog context is summarized Figure 2: Screenshot of the Mechanical Turk web in- by the entities (of types person, organization, time terface used to collect natural language feedback. and topic) that were mentioned earlier in the con- versation and could be referenced in the original 5 question u shown on the screen . The worker is tions to form a test set. The rest 2271 questions we then asked to type their feedback given (u, yˆ) and pair with between one and three predicted parses yˆ the dialog context displayed on the screen. Turkers sampled from the beam produced by the grammar- are instructed to write a natural language feedback based parser, and present each pair of original ut- utterance that they might otherwise say in a real terance and predicted logical form (ui, yˆi) to a conversation when attempting to correct another turker who then generates a feedback utterance person’s incorrect understanding of their question. fi. In total, we collect 4321 question/original We recognize that our data collection process parse/feedback triples (ui, yˆi, fi) (averaging ap- results in only an approximation of a true multi- proximately 1.9 feedback utterances per question). turn conversation, however we find that this ap- proach to data collection offers a convenient trade- 5 Experiments off for collecting a large number of controlled and diverse context-grounded interactions. Qualita- tively we find that turkers are generally able to The key hypothesis that we aim to evaluate in our imagine themselves in the hypothetical ongoing experiments is whether natural language feedback dialog, and are able to generate realistic contextual is an effective form of supervision for training a feedback utterances using only the context sum- semantic parser. To achieve this goal, we control mary table provided to them. and measure the effect that the number of feedback utterances used during training has on the resulting The initial dataset of questions paired with orig- performance of the task semantic parser on a held- inal logical form parses {u , yˆ } that we use i i 1,...,N out test set. Across all our experiments we also use to solicit feedback from turkers, is prepared of- a small seed training set (300 questions) that con- fline. In this separate offline task we collect a tains gold-standard logical forms to pre-train the dataset of 3556 natural language questions, anno- task and feedback parsers. The number of “unla- tated with gold standard logical forms, in the same beled” questions6 (i.e., questions not labeled with domain of email and biographical research. We gold standard logical forms but that have natural parse each utterance in this dataset with a float- language feedback) ranges from 300, 500, 1000 to ing grammar-based semantic parser trained using 1700 representing different experimental settings. a structured perceptron algorithm (implemented in For each experimental setting, we rerun the exper- SEMPRE (Berant et al., 2013)) on a subset of the iment 10 times, re-sampling the questions in both questions. We then construct the dataset for feed- the training and unlabeled sets, and report the aver- back collection by sampling the logical form yˆ aged results. The test-set remains fixed across all from the first three candidate parses in the beam experiments and contains 1285 questions labeled produced by the grammar-based parser. We use with gold-standard logical forms. The implemen- this grammar-based parser intentionally as a very tation details for the task parser and the feedback different model from the one that we would ulti- parser are included in the appendix. mately train (LSTM-based parser) on the conver- sational logs produced by the original parser. 6Note that from hereon we will refer to the portion of the We retain 1285 out of the 3556 annotated ques- data that contains natural language feedback as the only form of supervision as “unlabeled data”, to emphasize that it is not 5zero or more of the context entities may actually be ref- labeled with gold standard logical forms (in contrast to the erenced in the original utterance seed training set).

1681 5.1 Models and evaluation metrics 6 Results In our evaluations, we compare the following four 6.1 Generalization performance models: Figure3 a shows test accuracy as a function of the number of unlabeled questions (i.e., ques- • MH (full model) Joint model described in Sec- tions containing only feedback supervision with- tion3 and in Algorithm1 (using Metropolis out gold standard logical forms) used during train- Hastings-based MAP inference described in Al- ing, across all four models. As expected, using gorithm2). more unlabeled data generally improves general- • MH (no feedback) Same as the full model, ex- ization performance. The self-training baseline cept we ignore the feedback f and the origi- is the only exception, where performance starts nal logical form yˆ information in the feedback to deteriorate as the ratio of unlabeled to labeled model. Effectively, this reduces the model of questions increases beyond a certain point. This the feedback parser to that of the task parser. behavior is not necessarily surprising – when the Because during training, both models would be unlabeled examples significantly outnumber the initialized differently, we may still expect the re- labeled examples, the model may more easily veer sulting model averaging effects to aid learning. away to local optima without being strongly regu- larized by the loss on the small number of labeled • MH (no feedback + reject yˆ) Same as the examples. above baseline without feedback, but we in- Interestingly, the MH (no feedback) baseline corporate the knowledge of the original logical is very similar to self-training, but has a signifi- form yˆ during training. We incorporate yˆ using cant performance advantage that does not deterio- the same method as described in Section ??. rate with more unlabeled examples. Recall that the MH (no feedback) model modifies the full model • Self-training Latent logical form inference is described in Algorithm1 by ignoring the feedback performed using only the task parser (using f and the original logical form yˆ in the feedback beam search). Feedback utterance f and orig- parser model P (y | u, y,ˆ f; θf ). This has the ef- inal logical form yˆ are ignored. Task parser pa- fect of reducing the model of the feedback parser rameters θt are updated in the same way as in into the model of the task parser P (y | u; θt). The Algorithm1. training on unlabeled data proceeds otherwise in the same way as described in Algorithm1. As a Note that all models are exposed to the same train- result, the principle behind the MH (no feedback) ing seed set, and differ only in the way they take model is the same as that behind self-training, i.e., advantage of the unlabeled data. We perform two a single model learns from its own predictions. types of evaluations of each model: However, different initializations of the two copies of the task parser, and the combined model averag- • Generalization performance we use the ing appears to improve the robustness of the model learned task parser to make predictions on held- sufficiently to keep it from diverging as the amount out data. This type of evaluation tests the ability of unlabeled data is increased. of the parser trained with natural language feed- The MH (no feedback + reject yˆ) baseline also back to generalize to unseen utterances. does not observe the feedback utterance f, but incorporates the knowledge of the original parse • Unlabeled data performance we use the yˆ. As described in Section ??, this knowledge is learned task parser to make predictions on the incorporated during MAP inference of the latent unlabeled data that was used in training it. Note parse, by rejecting any logical form samples that that for each experimental condition, we per- match yˆ. As expected, incorporating the knowl- form this evaluation only on the portion of edge of yˆ improves the performance of this base- the unlabeled data that was used during train- line over the one that does not. ing. This ensures that this evaluation tests Finally, the full model that incorporates both, the model’s ability to “recover” correct logical the feedback utterance f and the original logical forms from the questions that have natural lan- form yˆ outperforms the baselines that incorporate guage feedback associated with them. only some of that information. The performance

1682 gain over these baselines grows as more ques- formance with the statistics on the average length tions with natural language feedback supervision of the target logical form (number of predicates). are made available during training. Note that both A more important take-away from the results in this and the MH (no feedback + reject yˆ) model Figure4, is that the model that takes advantage of incorporate the knowledge of the original logical natural language feedback gains an even greater form yˆ, however, the performance gain from in- advantage over models that do not use feedback corporating the knowledge of yˆ without the feed- when parsing more difficult questions. This means back is relatively small, indicating that the gains that the model is able to take advantage of more from the model that observes feedback is primar- complex feedback (i.e., with more corrections) ily from its ability to interpret it. even for more difficult to parse questions.

6.2 Performance on unlabeled data 84 Figure3 b shows accuracy on the unlabeled data, 82 as a function of the number of unlabeled ques- 80 tions used during training, across all four models. 78 The questions used in evaluating the model’s ac- 76 curacy on unlabeled data are the same unlabeled Test Accuracy 74 MH (full model) questions used during training in each experimen- MH (no feedback + reject yˆ) 72 MH (no feedback) tal condition. The general trend and the relation- Self-training 70 ship between baselines is consistent with the gen- 200 400 600 800 1000 1200 1400 1600 1800 Number of unlabeled questions eralization performance on held-out data in Fig- (a) ure3 a. One of the main observations is that accu- 82 racy on unlabeled training examples remains rela- 80 tively flat, but consistently high (> 80%), across 78 all models regardless of the amount of unlabeled 76 questions used in training (within the range that

Unlabeled Accuracy 74 MH (full model) we experimented with). This suggests that while MH (no feedback + reject yˆ) 72 MH (no feedback) the models are able to accurately recover the un- Self-training 70 derlying logical forms of the unlabeled questions 200 400 600 800 1000 1200 1400 1600 1800 Number of unlabeled questions regardless of the amount of unlabeled data (within (b) our experimental range), the resulting generaliza- tion performance of the learned models is signif- Figure 3: (a) Parsing accuracy on held-out questions icantly affected by the amount of unlabeled data as a function of the number of unlabeled questions (more is better). used during training. (b) Parsing accuracy on unla- beled questions as a function of the number of unla- 6.3 Effect of feedback complexity on beled questions used during training. In both panels, performance all parsers were initialized using 300 labeled examples consisting of questions and their corresponding logical Figure4 reports parsing accuracy on unlabeled form. questions as a function of the number of correc- tions expressed in the feedback utterance paired 7 Related Work with that question. Our main observation is that the performance of the full model (i.e., the joint Early semantic parsing systems map natural lan- model that uses natural language feedback) deteri- guage to logical forms using inductive logical orates for questions that are paired with more com- programming (Zelle and Mooney, 1996). Mod- plex feedback (i.e., feedback containing more cor- ern systems apply statistical models to learn from rections). Perhaps surprisingly, however, is that pairs of sentences and logical forms (Zettlemoyer all models (including those that do not incorporate and Collins, 2005, 2009; Kwiatkowski et al., feedback) deteriorate in performance for questions 2010). As hand-labeled logical forms are very paired with more complex feedback. This is ex- costly to obtain, different forms of weak super- plained by the fact that more complex feedback vision have been explored. Example works in- is generally provided for more difficult-to-parse clude learning from pairs of sentences and an- questions. In Figure4, we overlay the parsing per- swers by querying a database (Clarke et al., 2010;

1683 beled examples for concept learning (Srivastava Avg. number of predicates in logical form 82.5 8.4 et al., 2017) and to help induce programs that solve 80.0 8.2 algebraic word problems (Ling et al., 2017). Our 77.5 8.0 work is similar in that natural language is used as 75.0 7.8 additional supervision during learning, however, 72.5 7.6 our natural language annotations consist of user Unlabeled accuracy 70.0 MH (full model) 7.4 MH (no feedback + reject yˆ) feedback on system predictions instead of expla- 67.5 MH (no feedback) 7.2 Average number of predicates in logical form Self-training nations of the training data. 65.0 1 2 3 Number of corrections expressed in feedback utterance 8 Conclusion and Future Work

Figure 4: Parsing accuracy on unlabeled questions, par- In this work, we proposed a novel task of learning titioned by feedback complexity (i.e., number of cor- a semantic parser directly from end-users’ open- rections expressed in a single feedback utterance). ended natural language feedback during a conver- sation. The key advantage of being able to learn Liang et al., 2013; Berant et al., 2013; Pasupat from natural language feedback is that it opens the and Liang, 2015; Liang et al., 2016; Krishna- door to learning continuously through natural in- murthy et al., 2017); learning from indirect super- teractions with the user, but it also presents a chal- vision from a large-scale knowledge base (Reddy lenge of how to interpret such feedback. In this et al., 2014; Krishnamurthy and Mitchell, 2012); work we introduced an effective approach that si- learning from conversations of systems asking multaneously learns two parsers: one parser that for and confirming information (Artzi and Zettle- interprets natural language questions and a second moyer, 2011; Thomason et al., 2015; Padmakumar parser that also interprets natural language feed- et al., 2017); and learning from interactions with back regarding errors made by the first parser. a simulated world environment (Branavan et al., Our work is, however, limited to interpreting 2009; Artzi and Zettlemoyer, 2013; Goldwasser feedback contained in a single utterance. A natural and Roth, 2014; Misra et al., 2015). The supervi- generalization of learning from natural language sion used in these methods is mostly in the form of feedback is to view it as part of an integrated dia- binary feedback, partial logical forms (e.g., slots) log system capable of both interpreting feedback or execution results. In this paper, we explore a and asking appropriate questions to solicit this new form of supervision – natural language feed- feedback (e.g., connecting to the work by (Pad- back. We demonstrate that such feedback not only makumar et al., 2017)). We hope that the prob- provides rich and expressive supervisory signals lem we introduce in this work, together with the for learning but also can be easily collected via dataset that we release, inspires the community to crowd-sourcing. Recent work (Iyer et al., 2017) develop models that can learn language (e.g., se- trains an online language-to-SQL parser from user mantic parsers) through flexible natural language feedback. Unlike our work, their collected feed- conversation with end users. back is structured and is used for acquiring more labeled data during training. Our model jointly Acknowledgements learns from questions and feedback and can be This research was supported in part by AFOSR trained with limited labeled data. under grant FA95501710218, and in part by the There has been a growing interest on machine CMU InMind project funded by Oath. learning from natural language instructions. Much work has been done in the setting where an au- tonomous agent learns to complete a task in an References environment, for example, learning to play games Yoav Artzi and Luke Zettlemoyer. 2011. Bootstrap- by utilizing text manuals (Branavan et al., 2012; ping semantic parsers from conversations. In Pro- Eisenstein et al., 2009; Narasimhan et al., 2015) ceedings of the conference on empirical methods in and guiding policy learning using high-level hu- natural language processing, pages 421–432. Asso- man advice (Kuhlmann et al., 2004; Squire et al., ciation for Computational Linguistics. 2015; Harrison et al., 2017). Recently, natural lan- Yoav Artzi and Luke Zettlemoyer. 2013. Weakly su- guage explanations have been used to augment la- pervised learning of semantic parsers for mapping

1684 instructions to actions. Transactions of the Associa- Sepp Hochreiter and Jurgen¨ Schmidhuber. 1997. tion of Computational Linguistics, 1:49–62. Long short-term memory. Neural computation, 9(8):1735–1780. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- gio. 2014. Neural by jointly Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, learning to align and translate. arXiv preprint Jayant Krishnamurthy, and Luke Zettlemoyer. 2017. arXiv:1409.0473. Learning a neural semantic parser from user feed- back. arXiv preprint arXiv:1704.08760. Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic parsing on freebase from Robin Jia and Percy Liang. 2016. Data recombina- question-answer pairs. In Proceedings of the 2013 tion for neural semantic parsing. arXiv preprint Conference on Empirical Methods in Natural Lan- arXiv:1606.03622. guage Processing, pages 1533–1544. Jayant Krishnamurthy, Pradeep Dasigi, and Matt Gard- Jonathan Berant and Percy Liang. 2014. Semantic ner. 2017. Neural semantic parsing with type con- parsing via paraphrasing. In Proceedings of the straints for semi-structured tables. In Proceedings of 52nd Annual Meeting of the Association for Compu- the 2017 Conference on Empirical Methods in Nat- tational Linguistics (Volume 1: Long Papers), vol- ural Language Processing, pages 1516–1526. ume 1, pages 1415–1425. Jayant Krishnamurthy and Tom M Mitchell. 2012. Avrim Blum and Tom Mitchell. 1998. Combining la- Weakly supervised training of semantic parsers. In beled and unlabeled data with co-training. In Pro- Proceedings of the 2012 Joint Conference on Empir- ceedings of the eleventh annual conference on Com- ical Methods in Natural Language Processing and putational learning theory, pages 92–100. ACM. Computational Natural Language Learning, pages 754–765. Association for Computational Linguis- Satchuthananthavale RK Branavan, Harr Chen, Luke S tics. Zettlemoyer, and Regina Barzilay. 2009. Reinforce- ment learning for mapping instructions to actions. Gregory Kuhlmann, Peter Stone, Raymond Mooney, In Proceedings of the Joint Conference of the 47th and Jude Shavlik. 2004. Guiding a reinforcement Annual Meeting of the ACL and the 4th International learner with natural language advice: Initial results Joint Conference on Natural Language Processing in robocup soccer. In The AAAI-2004 workshop on of the AFNLP: Volume 1-Volume 1, pages 82–90. supervisory control of learning and adaptive sys- Association for Computational Linguistics. tems. SRK Branavan, David Silver, and Regina Barzilay. Tom Kwiatkowski, Luke Zettlemoyer, Sharon Gold- 2012. Learning to win by reading manuals in a water, and Mark Steedman. 2010. Inducing proba- monte-carlo framework. Journal of Artificial Intel- bilistic ccg grammars from logical form with higher- ligence Research, 43:661–704. order unification. In Proceedings of the 2010 con- James Clarke, Dan Goldwasser, Ming-Wei Chang, and ference on empirical methods in natural language Dan Roth. 2010. Driving semantic parsing from processing, pages 1223–1233. Association for Com- the world’s response. In Proceedings of the four- putational Linguistics. teenth conference on computational natural lan- Chen Liang, Jonathan Berant, Quoc Le, Kenneth D guage learning, pages 18–27. Association for Com- Forbus, and Ni Lao. 2016. Neural symbolic putational Linguistics. machines: Learning semantic parsers on free- Jacob Eisenstein, James Clarke, Dan Goldwasser, and base with weak supervision. arXiv preprint Dan Roth. 2009. Reading to learn: Constructing arXiv:1611.00020. features from semantic abstracts. In Proceedings of the 2009 Conference on Empirical Methods in Percy Liang, Michael I Jordan, and Dan Klein. 2013. Learning dependency-based compositional seman- Natural Language Processing: Volume 2-Volume 2, tics. Computational Linguistics, 39(2):389–446. pages 958–967. Association for Computational Lin- guistics. Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blun- Dan Goldwasser and Dan Roth. 2014. Learning from som. 2017. Program induction by rationale genera- natural instructions. Machine learning, 94(2):205– tion: Learning to solve and explain algebraic word 232. problems. arXiv preprint arXiv:1705.04146. Brent Harrison, Upol Ehsan, and Mark O Riedl. 2017. Minh-Thang Luong, Hieu Pham, and Christopher D Guiding reinforcement learning exploration using Manning. 2015. Effective approaches to attention- natural language. arXiv preprint arXiv:1707.08616. based neural machine translation. arXiv preprint arXiv:1508.04025. Geoffrey E Hinton. 1999. Products of experts. Dipendra Kumar Misra, Kejia Tao, Percy Liang, and Geoffrey E Hinton. 2002. Training products of experts Ashutosh Saxena. 2015. Environment-driven lex- by minimizing contrastive divergence. Neural com- icon induction for high-level instructions. In Pro- putation, 14(8):1771–1800. ceedings of the 53rd Annual Meeting of the Associ-

1685 ation for Computational Linguistics and the 7th In- Luke S Zettlemoyer and Michael Collins. 2009. Learn- ternational Joint Conference on Natural Language ing context-dependent mappings from sentences to Processing (Volume 1: Long Papers), volume 1, logical form. In Proceedings of the Joint Confer- pages 992–1002. ence of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Lan- Karthik Narasimhan, Tejas Kulkarni, and Regina guage Processing of the AFNLP: Volume 2-Volume Barzilay. 2015. Language understanding for text- 2, pages 976–984. Association for Computational based games using deep reinforcement learning. Linguistics. arXiv preprint arXiv:1506.08941. Aishwarya Padmakumar, Jesse Thomason, and Ray- mond J Mooney. 2017. Integrated learning of dialog strategies and semantic parsing. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, volume 1, pages 547–557. Panupong Pasupat and Percy Liang. 2015. Compo- sitional semantic parsing on semi-structured tables. arXiv preprint arXiv:1508.00305. Siva Reddy, Mirella Lapata, and Mark Steedman. 2014. Large-scale semantic parsing without question- answer pairs. Transactions of the Association of Computational Linguistics, 2(1):377–392. Shawn Squire, Stefanie Tellex, Dilip Arumugam, and Lei Yang. 2015. Grounding english commands to reward functions. Robotics: Scienceand Systems. Shashank Srivastava, Igor Labutov, and Tom Mitchell. 2017. Joint concept learning and semantic parsing from natural language explanations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1527–1536. Jesse Thomason, Shiqi Zhang, Raymond J Mooney, and Peter Stone. 2015. Learning to interpret nat- ural language commands through human-robot dia- log. In IJCAI, pages 1923–1929. Yushi Wang, Jonathan Berant, and Percy Liang. 2015. Building a semantic parser overnight. In Proceed- ings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th Interna- tional Joint Conference on Natural Language Pro- cessing (Volume 1: Long Papers), volume 1, pages 1332–1342. John Wieting, Mohit Bansal, Kevin Gimpel, and Karen Livescu. 2015. Towards universal para- phrastic sentence embeddings. arXiv preprint arXiv:1511.08198. Chang Xu, Dacheng Tao, and Chao Xu. 2013. A survey on multi-view learning. arXiv preprint arXiv:1304.5634. John M Zelle and Raymond J Mooney. 1996. Learn- ing to parse database queries using inductive logic programming. In Proceedings of the national con- ference on artificial intelligence, pages 1050–1055. Luke S Zettlemoyer and Michael Collins. 2005. Learn- ing to map sentences to logical form: Structured classification with probabilistic categorial gram- mars. In UAI.

1686 original parse yˆ incorrect 7 User feedback error rate 2000 original parse yˆ correct User feedback false positives 6 User feedback false negatives

5 1500

4

1000 3

2

500 % of question/feedback pairs Number of question/feedback pairs 1

0 0 0 1 2 3 4 5 6 7 8 9 10 11 Number of corrections expressed in feedback utterance Number of predicates in correct logical form

Figure 5: A histogram of the number of corrections Figure 6: Analysis of noise in feedback utterances con- expressed in a natural language feedback utterance in tained in our dataset. Feedback false positives refers our data (0 corrections means that the user affirmed the to feedback that incorrectly identifies the wrong origi- original parse as correct). We partition the feedback ut- nal parse yˆ as correct. Feedback false negatives refers terances according to whether the original parse yˆ that to feedback that incorrectly identifies a correct original the feedback was provided towards is known to be cor- parse yˆ as wrong. Users are more likely to generate rect (i.e., matches the gold standard logical form parse). false positives (i.e., miss the error) in their feedback for parses of more complex utterances (as measured by the number of predicates in the gold-standard logical A Analysis of natural language feedback form).

The dataset we collect creates a unique opportu- 8 nity to study the nature and the limitations of the written it . Because we also know the ground truth corrective feedback that users generate in response of whether the original parse yˆ (i.e., parse that the to an incorrect parse. One of our hypotheses stated user provided feedback towards) was correct (i.e., in Section1 is that natural language affords users yˆ matches gold standard parse y), we can partition to express richer feedback than for example possi- the number of corrections by whether the origi- ble with binary (correct/incorrect) mechanism, by nal parse yˆ was correct (green bars in Figure5) allowing users to explicitly refer to and fix what or incorrect (red bars), allowing us to evaluate the they see as incorrect with the original prediction. accuracy of some of the feedback. In this section we analyze the feedback utterances From Figure5, we observe that users provide in our dataset to gain deeper insight into the types feedback that ranges in the number of corrections, of feedback users generate, and the possible limi- with the majority of feedback utterances making tations of natural language as a source of supervi- one correction to the original logical form yˆ (only sion for semantic parsers. 4 feedback utterances expressed more than 3 cor- Figure5 breaks down the collected feedback by rections). Although we do not have gold-standard the number of corrections expressed in the feed- annotation for the true number of errors in the back utterance. The number of corrections ranges original logical form yˆ, we can nevertheless ob- tain some estimate of the noise in natural language from 0 (no corrections, i.e., worker considers the feedback by analyzing cases where we know that original parse yˆ to be the correct parse of u) to more than 3 corrections.7 The number of correc- the original logical form yˆ was correct, yet the user tions is self-annotated by the workers who write generated feedback with at least one correction. feed- the feedback – we instruct workers to count a cor- We will refer to such feedback instances as back false negatives rection constrained to a single predicate or a single . Similarly, cases where the entity as a single correction, and tally all such cor- original logical form yˆ is incorrect, yet the user rections in their feedback utterance after they have provided no corrections in their feedback, we re- fer to as feedback false positives. The number of feedback false negatives and false positives can be 7Note that in cases where the user indicates that they made 0 corrections, the feedback utterance that they write is often a variation of a confirmation such as “that’s right” or “that’s 8we give these instructions to workers in an easy to un- correct” derstand explanation without invoking technical jargon

1687 obtained directly from Figure5. Generally, we ob- serve that users are more likely to provide more false negatives (≈ 4% of all feedback) than false positives (≈ 1%) in the feedback they generate. It is also instructive to consider what factors may contribute to the observed noise in user gener- ated natural language feedback. Our hypothesis is that more complex queries (i.e., longer utterances and longer logical forms) may result in a greater cognitive load to identify and correct the error(s) in the original parse yˆ. In Figure6 we investigate this hypothesis by decomposing the percentage of feedback false positives and false negatives as a function of the number of predicates in the gold standard logical form (i.e., one that the user is try- ing to recover by making corrections in the origi- nal logical form yˆ). Our main observation is that users tend to generate more false positives (i.e., incorrectly identify an originally incorrect logical form yˆ as correct) when the target logical form is longer (i.e., the query utterance u is more com- plex). The number of false negatives (i.e., incor- rectly identifying a correct logical form as incor- rect) is relatively unaffected by the complexity of the query (i.e., number of predicates in the tar- get logical form). One conclusion that we can draw from this analysis is that we can expect user- generated feedback to miss errors in more com- plex queries, and models that learn from users’ natural language feedback need to have a degree of robustness to such noise. B Implementation Details We tokenize user utterances (questions) and feed- back using the Stanford CoreNLP package. In all experiments, we use 300-dimensional word em- beddings, initialized with word vectors trained us- ing the Paraphrase Database PPDB (Wieting et al., 2015), and we use 128 hidden units for LSTMs. All parameters are initialized uniformly at ran- dom. We train all the models using Adam with initial learning rate 10−4 and apply L2 gradient norm clipping with a threshold of 10. In all the experiments, we pre-train the task parser and the feedback parser for 20 epochs, and then switch to semi-supervised training for 10 epochs. Pre- training takes roughly 30 minutes and the semi- supervised training process takes up to 7 hours on a Titan X GPU.

1688 Utterance Before Feedback Before (incorrect parse) Conversational Context Conversational Feedback Utterance After Feedback (correct parse) Conversational Context Conversational Feedback

Figure 7: Visualization of the attention mechanism in the task parser (top) and the feedback parser (bottom) for parsing the same utterance. We partition the input to the parser into three groups: the original utterance u being parsed (blue), the conversational context (green) and the feedback utterance f (red). This example was parsed incorrectly before incorporating feedback, but parsed correctly after its incorporation.

1689 Utterance Conversational Context Conversational Before Feedback Before (incorrect parse) Feedback Utterance Conversational Context Conversational After Feedback (correct parse) Feedback

Figure 8: Visualization of the attention mechanism in the task parser (top) and the feedback parser (bottom) for parsing the same utterance. We partition the input to the parser into three groups: the original utterance u being parsed (blue), the conversational context (green) and the feedback utterance f (red). This example was parsed incorrectly before incorporating feedback, but parsed correctly after its incorporation.

1690