Learning Structured Natural Language Representations for Semantic

Jianpeng Cheng† Siva Reddy† Vijay Saraswat‡ and Mirella Lapata† †School of Informatics, University of Edinburgh ‡IBM T.J. Watson Research jianpeng.cheng,siva.reddy @ed.ac.uk, [email protected], { [email protected]}

Abstract representation (Kwiatkowski et al., 2013; Reddy et al., 2016, 2014; Krishnamurthy and Mitchell, We introduce a neural semantic parser 2015; Gardner and Krishnamurthy, 2017). A merit which is interpretable and scalable. Our of the two-stage approach is that it creates reusable model converts natural language utter- intermediate interpretations, which potentially en- ances to intermediate, domain-general nat- ables the handling of unseen words and knowledge ural language representations in the form transfer across domains (Bender et al., 2015). of predicate-argument structures, which The successful application of encoder-decoder are induced with a transition system and models (Bahdanau et al., 2015; Sutskever et al., subsequently mapped to target domains. 2014) to a variety of NLP tasks has provided The semantic parser is trained end-to-end strong impetus to treat semantic parsing as a se- using annotated logical forms or their de- quence transduction problem where an utterance notations. We achieve the state of the is mapped to a target meaning representation in art on SPADES and GRAPHQUESTIONS string format (Dong and Lapata, 2016; Jia and and obtain competitive results on GEO- Liang, 2016; Kociskˇ y´ et al., 2016). Such models QUERY and WEBQUESTIONS. The in- still fall under the first approach, however, in con- duced predicate-argument structures shed trast to previous work (Zelle and Mooney, 1996; light on the types of representations useful Zettlemoyer and Collins, 2005; Liang et al., 2011) for semantic parsing and how these are dif- they reduce the need for domain-specific assump- ferent from linguistically motivated ones.1 tions, grammar learning, and more generally ex- tensive feature engineering. But this modeling 1 Introduction flexibility comes at a cost since it is no longer pos- Semantic parsing is the task of mapping natu- sible to interpret how meaning composition is per- ral language utterances to machine interpretable formed. Such knowledge plays a critical role in meaning representations. Despite differences in understand modeling limitations so as to build bet- the choice of meaning representation and model ter semantic parsers. Moreover, without any task- structure, most existing work conceptualizes se- specific prior knowledge, the learning problem is mantic parsing following two main approaches. fairly unconstrained, both in terms of the possible Under the first approach, an utterance is parsed derivations to consider and in terms of the target and grounded to a meaning representation directly output which can be ill-formed (e.g., with extra or via learning a task-specific grammar (Zelle and missing brackets). Mooney, 1996; Zettlemoyer and Collins, 2005; In this work, we propose a neural semantic Wong and Mooney, 2006; Kwiatkowksi et al., parser that alleviates the aforementioned prob- 2010; Liang et al., 2011; Berant et al., 2013; lems. Our model falls under the second class of Flanigan et al., 2014; Pasupat and Liang, 2015; approaches where utterances are first mapped to Groschwitz et al., 2015). Under the second ap- an intermediate representation containing natural proach, the utterance is first parsed to an inter- language predicates. However, rather than using mediate task-independent representation tied to a an external parser (Reddy et al., 2014, 2016) or syntactic parser and then mapped to a grounded manually specified CCG grammars (Kwiatkowski 1Our code is available at https://github.com/ et al., 2013), we induce intermediate representa- cheng6076/scanner. tions in the form of predicate-argument structures

44 Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 44–55 Vancouver, Canada, July 30 - August 4, 2017. c 2017 Association for Computational Linguistics https://doi.org/10.18653/v1/P17-1005 from data. This is achieved with a transition-based Predicate Usage Sub-categories answer denotation wrapper — approach which by design yields recursive seman- stateid, cityid, type entity type checking tic structures, avoiding the problem of generating riverid, etc. querying for an entire ill-formed meaning representations. Compared to all — set of entities most existing semantic parsers which employ a one-argument meta count, largest, aggregation CKY style bottom-up parsing strategy (Krishna- predicates for sets smallest, etc. murthy and Mitchell, 2012; Cai and Yates, 2013; logical two-argument meta intersect, connectors predicates for sets union, exclude Berant et al., 2013; Berant and Liang, 2014), the transition-based approach we proposed does not Table 1: List of domain-general predicates. require feature decomposition over structures and thereby enables the exploration of rich, non-local tion y. features. The output of the transition system is Grounded Meaning Representation We repre- then grounded (e.g., to a knowledge base) with a sent grounded meaning representations in FunQL neural mapping model under the assumption that (Kate et al., 2005) amongst many other alterna- grounded and ungrounded structures are isomor- tives such as (Zettlemoyer and phic.2 As a result, we obtain a neural model that Collins, 2005), λ-DCS (Liang, 2013) or graph jointly learns to parse natural language semantics queries (Holzschuher and Peinl, 2013; Harris and induce a lexicon that helps grounding. et al., 2013). FunQL is a variable-free query lan- The whole network is trained end-to-end on guage, where each predicate is treated as a func- natural language utterances paired with anno- tion symbol that modifies an argument list. For tated logical forms or their denotations. We example, the FunQL representation for the utter- conduct experiments on four datasets, including ance which states do not border texas is: GEOQUERY (which has logical forms; Zelle and Mooney 1996), SPADES (Bisk et al., 2016), WEB- answer(exclude(state(all), next to(texas))) QUESTIONS (Berant et al., 2013), and GRAPH- where next to is a domain-specific binary predi- QUESTIONS (Su et al., 2016) (which have deno- cate that takes one argument (i.e., the entity texas) tations). Our semantic parser achieves the state of and returns a set of entities (e.g., the states border- the art on SPADES and GRAPHQUESTIONS, while ing Texas) as its denotation. all is a special predi- obtaining competitive results on GEOQUERY and cate that returns a collection of entities. exclude is WEBQUESTIONS. A side-product of our mod- a predicate that returns the difference between two eling framework is that the induced intermedi- input sets. ate representations can contribute to rationalizing An advantage of FunQL is that the resulting neural predictions (Lei et al., 2016). Specifically, s-expression encodes semantic compositionality they can shed light on the kinds of representations and derivation of the logical forms. This prop- (especially predicates) useful for semantic pars- erty makes FunQL logical forms convenient to be ing. Evaluation of the induced predicate-argument predicted with recurrent neural networks (Vinyals relations against syntax-based ones reveals that et al., 2015; Choe and Charniak, 2016; Dyer et al., they are interpretable and meaningful compared 2016). However, FunQL is less expressive than to heuristic baselines, but they sometimes deviate lambda calculus, partially due to the elimination from linguistic conventions. of variables. A more compact logical formulation which our method also applies to is λ-DCS (Liang, 2 Preliminaries 2013). In the absence of anaphora and composite Problem Formulation Let denote a knowl- binary predicates, conversion algorithms exist be- K edge base or more generally a reasoning system, tween FunQL and λ-DCS. However, we leave this and x an utterance paired with a grounded mean- to future work. ing representation G or its denotation y. Our prob- Ungrounded Meaning Representation We lem is to learn a semantic parser that maps x to G also use FunQL to express ungrounded meaning via an intermediate ungrounded representation U. representations. The latter consist primarily of When G is executed against , it outputs denota- K natural language predicates and domain-general 2We discuss the merits and limitations of this assumption predicates. Assuming for simplicity that domain- in Section5 general predicates share the same vocabulary

45 in ungrounded and grounded representations, algorithm uses a buffer to store input tokens in the ungrounded representation for the example the utterance and a stack to store partially com- utterance is: pleted trees. A major difference in our semantic answer(exclude(states(all), border(texas))) parsing scenario is that tokens in the buffer are not fetched in a sequential order or removed from the where states and border are natural language pred- buffer. This is because the lexical alignment be- icates. In this work we consider five types of tween an utterance and its semantic representation domain-general predicates illustrated in Table1. is hidden. Moreover, some predicates cannot be Notice that domain-general predicates are often clearly anchored to a token span. Therefore, we implicit, or represent extra-sentential knowledge. allow the generation algorithm to pick tokens and For example, the predicate all in the above utter- combine logical forms in arbitrary orders, condi- ance represents all states in the domain which are tioning on the entire set of sentential features. Al- not mentioned in the utterance but are critical for ternative solutions in the traditional semantic pars- working out the utterance denotation. Finally, note ing literature include a floating chart parser (Pa- that for certain domain-general predicates, it also supat and Liang, 2015) which allows to construct makes sense to extract natural language rationales logical predicates out of thin air. (e.g., not is indicative for exclude). But we do not Our transition system defines three actions, find this helpful in experiments. namely NT, TER, and RED, explained below. In this work we constrain ungrounded represen- tations to be structurally isomorphic to grounded NT(X) generates a Non-Terminal predicate. This ones. In order to derive the target logical forms, predicate is either a natural language expression all we have to do is replacing predicates in the such as border, or one of the domain-general ungrounded representations with symbols in the predicates exemplified in Table1 (e.g., exclude). knowledge base. The type of predicate is determined by the place- holder X and once generated, it is pushed onto the 3 Modeling stack and represented as a non-terminal followed In this section, we discuss our neural model which by an open bracket (e.g., ‘border(’). The open maps utterances to target logical forms. The se- bracket will be closed by a reduce operation. mantic parsing task is decomposed in two stages: TER(X) generates a TERminal entity or the spe- we first explain how an utterance is converted to cial predicate all. Note that the terminal choice an intermediate representation (Section 3.1), and does not include variable (e.g., $0, $1), since then describe how it is grounded to a knowledge FunQL is a variable-free language which suffi- base (Section 3.2). ciently captures the semantics of the datasets we 3.1 Generating Ungrounded Representations work with. The framework could be extended to generate directly acyclic graphs by incorporat- At this stage, utterances are mapped to interme- ing variables with additional transition actions for diate representations with a transition-based algo- handling variable mentions and co-reference. rithm. In general, the transition system generates the representation by following a derivation tree RED stands for REDuce and is used for subtree (which contains a set of applied rules) and some completion. It recursively pops elements from the canonical generation order (e.g., depth-first). For stack until an open non-terminal node is encoun- FunQL, a simple solution exists since the repre- tered. The non-terminal is popped as well, af- sentation itself encodes the derivation. Consider ter which a composite term representing the entire again answer(exclude(states(all), border(texas))) subtree, e.g., border(texas), is pushed back to the which is tree structured. Each predicate (e.g., bor- stack. If a RED action results in having no more der) can be visualized as a non-terminal node of open non-terminals left on the stack, the transition the tree and each entity (e.g., texas) as a terminal. system terminates. Table2 shows the transition The predicate all is a special case which acts as actions used to generate our running example. a terminal directly. We can generate the tree with The model generates the ungrounded represen- a top-down, depth first transition system reminis- tation U conditioned on utterance x by recursively cent of grammars (RN- calling one of the above three actions. Note that NGs; Dyer et al. 2016). Similar to RNNG, our U is defined by a sequence of actions (denoted

46 Sentence: which states do not border texas Non-terminal symbols in buffer: which, states, do, not, border Terminal symbols in buffer: texas Stack Action NT choice TER choice NT answer answer ( NT exclude answer ( exclude ( NT states answer ( exclude ( states ( TER all answer ( exclude ( states ( all RED answer ( exclude ( states ( all ) NT border answer ( exclude ( states ( all ), border ( TER texas answer ( exclude ( states ( all ), border ( texas RED answer ( exclude ( states ( all ), border ( texas ) RED answer ( exclude ( states ( all ), border ( texas )) RED answer ( exclude ( states ( all ), border ( texas )))

Table 2: Actions taken by the transition system for generating the ungrounded meaning representation of the example utterance. Symbols in red indicate domain-general predicates. by a) and a sequence of term choices (denoted entity) needs to be chosen from the candidate list by u) as shown in Table2. The conditional proba- depending on the specific placeholder X. To se- bility p(U x) is factorized over time steps as: lect a domain-general term, we use the same rep- | resentation of the transition system et to compute p(U x) = p(a, u x) | | (1) a probability distribution over candidate terms: T I(at=RED) GENERAL = p(at a

47 of gt given ut with a bi-linear neural network: likelihood can be decomposed into the likelihood of the grounded action sequence p(a x) and the p(gt ut) exp ~ut Wug ~gt> (5) | | ∝ · · grounded term sequence p(g x), which we opti- | where ~ut is the contextual representation of the un- mize separately. grounded term given by the bidirectional LSTM, For the grounded action sequence (which by ~gt is the grounded term embedding, and Wug is design is the same as the ungrounded action se- the weight matrix. quence and therefore the output of the transition The above grounding step can be interpreted system), we can directly maximize the log likeli- as learning a lexicon: the model exclusively re- hood log p(a x) for all examples: | lies on the intermediate representation U to pre- n dict the target meaning representation G without = log p(a x) = log p(a x) (6) La | t| taking into account any additional features based x x t=1 X∈T X∈T X on the utterance. In practice, U may provide suf- where denotes examples in the training data. T ficient contextual background for closed domain For the grounded term sequence g, since the semantic parsing where an ungrounded predicate intermediate ungrounded terms are latent, we often maps to a single grounded predicate, but is maximize the expected log likelihood of the a relatively impoverished representation for pars- grounded terms [p(u x) log p(g u, x)] for all u | | ing large open-domain knowledge bases like Free- examples, which is a lower bound of the log like- P base. In this case, we additionally rely on a dis- lihood log p(g x): criminative reranker which ranks the grounded | = [p(u x) log p(g u, x)] representations derived from ungrounded repre- Lg | | x u sentations (see Section 3.4). X∈T X k (7) 3.3 Training Objective = p(u x) log p(gt ut) | | x u " t=1 # When the target meaning representation is avail- X∈T X X The final objective is the combination of able, we directly compare it against our predic- La and , denoted as = + . We opti- tions and back-propagate. When only denotations Lg LG La Lg are available, we compare surrogate meaning rep- mize this objective with the method described in resentations against our predictions (Reddy et al., Lei et al.(2016). 2014). Surrogate representations are those with 3.4 Reranker the correct denotations. When there exist multi- ple surrogate representations,3 we select one ran- As discussed above, for open domain semantic domly and back-propagate. The global effect of parsing, solely relying on the ungrounded repre- the above update rule is close to maximizing the sentation would result in an impoverished model marginal likelihood of denotations, which differs lacking sentential context useful for disambigua- from recent work on weakly-supervised seman- tion decisions. For all Freebase experiments, we tic parsing based on reinforcement learning (Nee- followed previous work (Berant et al., 2013; Be- lakantan et al., 2017). rant and Liang, 2014; Reddy et al., 2014) in addi- Consider utterance x with ungrounded mean- tionally training a discriminative ranker to re-rank ing representation U, and grounded meaning rep- grounded representations globally. resentation G. Both U and G are defined with The discriminative ranker is a maximum- a sequence of transition actions (same for U entropy model (Berant et al., 2013). The objective and G) and a sequence of terms (different for U is to maximize the log likelihood of the correct an- swer y given x by summing over all grounded can- and G). Recall that a = [a1, , an] denotes ··· didates G with denotation y (i.e.,[[G]] = y): the transition action sequence defining U and G; K let u = [u1, , uk] denote the ungrounded = log p(G x) (8) ··· Ly | terms (e.g., predicates), and g = [g1, , gk] (x,y) [[G]] =y ··· X∈T XK the grounded terms. We aim to maximize the p(G x) exp f(G, x) (9) likelihood of the grounded meaning representa- | ∝ { } tion p(G x) over all training examples. This where f(G, x) is a feature function that maps | 3The average Freebase surrogate representations obtained pair (G, x) into a feature vector. We give details with highest denotation match (F1) is 1.4. on the features we used in Section 4.2.

48 4 Experiments tential entity spans using seven handcrafted part- of-speech patterns and associate them with Free- In this section, we verify empirically that our se- base entities obtained from the Freebase/KG API.4 mantic parser derives useful meaning representa- We use a structured perceptron trained on the enti- tions. We give details on the evaluation datasets ties found in WEBQUESTIONS and GRAPHQUES- and baselines used for comparison. We also TIONS to select the top 10 non-overlapping entity describe implementation details and the features disambiguation possibilities. We treat each possi- used in the discriminative ranker. bility as a candidate input utterance, and use the perceptron score as a feature in the discriminative 4.1 Datasets reranker, thus leaving the final disambiguation to We evaluated our model on the following datasets the semantic parser. which cover different domains, and use differ- Apart from the entity score, the discriminative ent types of training data, i.e., pairs of natural ranker uses the following basic features. The first language utterances and grounded meanings or feature is the likelihood score of a grounded rep- question-answer pairs. resentation aggregating all intermediate represen- GEOQUERY (Zelle and Mooney, 1996) con- tations. The second set of features include the em- tains 880 questions and database queries about US bedding similarity between the relation and the ut- geography. The utterances are compositional, but terance, as well as the similarity between the rela- the language is simple and vocabulary size small. tion and the question words. The last set of fea- The majority of questions include at most one en- tures includes the answer type as indicated by the tity. SPADES (Bisk et al., 2016) contains 93,319 last word in the Freebase relation (Xu et al., 2016). questions derived from CLUEWEB09 (Gabrilovich We used the Adam optimizer for training with et al., 2013) sentences. Specifically, the questions an initial learning rate of 0.001, two momentum were created by randomly removing an entity, thus parameters [0.99, 0.999], and batch size 1. The di- producing sentence-denotation pairs (Reddy et al., mensions of the word embeddings, LSTM states, 2014). The sentences include two or more entities entity embeddings and relation embeddings are and although they are not very compositional, they [50, 100, 100, 100]. The word embeddings were constitute a large-scale dataset for neural network initialized with Glove embeddings (Pennington training. WEBQUESTIONS (Berant et al., 2013) et al., 2014). All other embeddings were randomly contains 5,810 question-answer pairs. Similar to initialized. SPADES, it is based on Freebase and the questions are not very compositional. However, they are 4.3 Results real questions asked by people on the Web. Fi- Experimental results on the four datasets are sum- nally, GRAPHQUESTIONS (Su et al., 2016) con- tains 5,166 question-answer pairs which were cre- marized in Tables3–6. We present comparisons of ated by showing 500 Freebase graph queries to our system which we call SCANNER (as a short- Amazon Mechanical Turk workers and asking hand for SymboliC meANiNg rEpResentation) them to paraphrase them into natural language. against a variety of models previously described in the literature. 4.2 Implementation Details GEOQUERY results are shown in Table5. The first block contains symbolic systems, whereas Amongst the four datasets described above, GEO- neural models are presented in the second block. QUERY has annotated logical forms which we di- rectly use for training. For the other three datasets, We report accuracy which is defined as the pro- we treat surrogate meaning representations which portion of the utterance that are correctly parsed lead to the correct answer as gold standard. The to their gold standard logical forms. All previ- surrogates were selected from a subset of candi- ous neural systems (Dong and Lapata, 2016; Jia date Freebase graphs, which were obtained by en- and Liang, 2016) treat semantic parsing as a se- quence transduction problem and use LSTMs to tity linking. Entity mentions in SPADES have been automatically annotated with Freebase entities directly map utterances to logical forms. SCAN- NER yields performance improvements over these (Gabrilovich et al., 2013). For WEBQUESTIONS and GRAPHQUESTIONS, we follow the procedure 4http://developers.google.com/ described in Reddy et al.(2016). We identify po- freebase/

49 Models F1 Models Accuracy Berant et al.(2013) 35.7 Zettlemoyer and Collins(2005) 79.3 Yao and Van Durme(2014) 33.0 Zettlemoyer and Collins(2007) 86.1 Berant and Liang(2014) 39.9 Kwiatkowksi et al.(2010) 87.9 Bast and Haussmann(2015) 49.4 Kwiatkowski et al.(2011) 88.6 Berant and Liang(2015) 49.7 Kwiatkowski et al.(2013) 88.0 Reddy et al.(2016) 50.3 Zhao and Huang(2015) 88.9 Bordes et al.(2014) 39.2 Liang et al.(2011) 91.1 Dong et al.(2015) 40.8 Dong and Lapata(2016) 84.6 Yih et al.(2015) 52.5 Jia and Liang(2016) 85.0 Xu et al.(2016) 53.3 Jia and Liang(2016) with extra data 89.1 Neural Baseline 48.3 SCANNER 86.7 SCANNER 49.4 Table 5: GEOQUERY results. Table 3: WEBQUESTIONS results. Models F1 Models F1 Unsupervised CCG (Bisk et al., 2016) 24.8 SEMPRE (Berant et al., 2013) 10.80 Semi-supervised CCG (Bisk et al., 2016) 28.4 PARASEMPRE (Berant and Liang, 2014) 12.79 Neural baseline 28.6 JACANA (Yao and Van Durme, 2014) 5.08 Supervised CCG (Bisk et al., 2016) 30.9 Neural Baseline 16.24 Rule-based system (Bisk et al., 2016) 31.4 SCANNER 17.02 SCANNER 31.5 Table 4: GRAPHQUESTIONS results. Numbers for Table 6: SPADES results. comparison systems are from Su et al.(2016). systems when using comparable data sources for not produce meaning representations whereas Be- training. Jia and Liang(2016) achieve better rant and Liang(2015) propose a sophisticated agenda-based parser which is trained borrowing results with synthetic data that expands GEO- ideas from imitation learning. SCANNER is con- QUERY; we could adopt their approach to improve model performance, however, we leave this to fu- ceptually similar to Reddy et al.(2016) who also ture work. learn a semantic parser via intermediate repre- sentations which they generate based on the out- Table6 reports S CANNER’s performance on put of a dependency parser. SCANNER performs SPADES. For all Freebase related datasets we use average F1 (Berant et al., 2013) as our evalua- competitively despite not having access to any tion metric. Previous work on this dataset has linguistically-informed syntactic structures. The used a semantic parsing framework similar to ours second block in Table3 reports the results of sev- where natural language is converted to an interme- eral neural systems. Xu et al.(2016) represent the diate syntactic representation and then grounded state of the art on WEBQUESTIONS. Their sys- to Freebase. Specifically, Bisk et al.(2016) evalu- tem uses Wikipedia to prune out erroneous candi- ate the effectiveness of four different CCG parsers date answers extracted from Freebase. Our model on the semantic parsing task when varying the would also benefit from a similar post-processing amount of supervision required. As can be seen, step. As in previous experiments, SCANNER out- performs the neural baseline, too. SCANNER outperforms all CCG variants (from unsupervised to fully supervised) without having Finally, Table4 presents our results on access to any manually annotated derivations or GRAPHQUESTIONS. We report F1 for SCANNER, lexicons. For fair comparison, we also built a neu- the neural baseline model, and three symbolic sys- ral baseline that encodes an utterance with a recur- tems presented in Su et al.(2016). S CANNER rent neural network and then predicts a grounded achieves a new state of the art on this dataset with meaning representation directly (Ture and Jojic, a gain of 4.23 F1 points over the best previously 2016; Yih et al., 2016). Again, we observe that reported model. SCANNER outperforms this baseline. 4.4 Analysis of Intermediate Representations Results on WEBQUESTIONS are summarized in Table3.S CANNER obtains performance on par with the best symbolic systems (see the first Since a central feature of our parser is that it learns block in the table). It is important to note that intermediate representations with natural language Bast and Haussmann(2015) develop a question predicates, we conducted additional experiments answering system, which contrary to ours can- in order to inspect their quality. For GEOQUERY

50 Metrics Accuracy Dataset SCANNER Baseline Exact match 79.3 SPADES 51.2 45.5 Structure match 89.6 –conj (1422) 56.1 66.4 Token match 96.5 –control (132) 28.3 40.5 –pp (3489) 46.2 23.1 Table 7: GEOQUERY evaluation of ungrounded –subord (76) 37.9 52.9 meaning representations. We report accuracy WEBQUESTIONS 42.1 25.5 against a manually created gold standard. GRAPHQUESTIONS 11.9 15.3 Table 8: Evaluation of predicates induced by which contains only 280 test examples, we manu- SCANNER against EASYCCG. We report F1(%) ally annotated intermediate representations for the across datasets. For SPADES, we also provide a test instances and evaluated the learned represen- breakdown for various utterance types. tations against them. The experimental setup aims to shows how humans can participate in improving predicates induced by our system to entirely agree the semantic parser with feedback at the interme- with those produced by the syntactic parser. To diate stage. In terms of evaluation, we use three further analyze how the learned predicates differ metrics shown in Table7. The first row shows the from syntax-based ones, we grouped utterances in percentage of exact matches between the predicted SPADES into four types of linguistic constructions: representations and the human annotations. The coordination (conj), control and raising (control), second row refers to the percentage of structure prepositional phrase attachment (pp), and subor- matches, where the predicted representations have dinate clauses (subord). Table8 also shows the the same structure as the human annotations, but breakdown of matching scores per linguistic con- may not use the same lexical terms. Among struc- struction, with the number of utterances in each turally correct predictions, we additionally com- type. In Table9, we provide examples of predi- pute how many tokens are correct, as shown in the cates identified by SCANNER, indicating whether third row. As can be seen, the induced meaning they agree or not with the output of EASYCCG. representations overlap to a large extent with the As a reminder, the task in SPADES is to predict the human gold standard. entity masked by a blank symbol ( ). As can be seen in Table8, the match- We also evaluated the intermediate represen- ing score is relatively high for utterances in- tations created by SCANNER on the other three volving coordination and prepositional phrase (Freebase) datasets. Since creating a man- attachments. The model will often identify ual gold standard for these large datasets is informative predicates (e.g., nouns) which do time-consuming, we compared the induced rep- not necessarily agree with linguistic intuition. resentations against the output of a syntactic For example, in the utterance wilhelm maybach parser. Specifically, we converted the ques- and his son started maybach in 1909 (see tions to event-argument structures with EASY- Table9), S CANNER identifies the predicate- CCG(Lewis and Steedman, 2014), a high cover- argument structure son(wilhelm maybach) rather age and high accuracy CCG parser. EASYCCG than started(wilhelm maybach). We also observed extracts predicate-argument structures with a la- that the model struggles with control and subor- beled F-score of 83.37%. For further comparison, dinate constructions. It has difficulty distinguish- we built a simple baseline which identifies pred- ing control from raising predicates as exemplified icates based on the output of the Stanford POS- in the utterance ceo john thain agreed to leave tagger (Manning et al., 2014) following the order- from Table9, where it identifies the raising predi- ing VBD VBN VB VBP VBZ MD.      cate agreed. For subordinate clauses, SCANNER As shown in Table8, on S PADES and WE- tends to take shortcuts identifying as predicates BQUESTIONS, the predicates learned by our words closest to the blank symbol. model match the output of EASYCCG more closely than the heuristic baseline. But for 5 Discussion GRAPHQUESTIONS which contains more compo- sitional questions, the mismatch is higher. How- We presented a neural semantic parser which ever, since the key idea of our model is to cap- converts natural language utterances to grounded ture salient meaning for the task at hand rather meaning representations via intermediate than strictly obey syntax, we would not expect the predicate-argument structures. Our model

51 the boeing company was founded in 1916 and is was incorporated in 1947 and is based in headquartered in , illinois . new york city . nstar was founded in 1886 and is based in boston , . the ifbb was formed in 1946 by president ben weider conj the is owned and operated by zuffa , llc , and his brother . headquarted in las vegas , nevada . wilhelm maybach and his son started maybach in hugh attended and then shifted to uppingham school 1909 . in england . was founded in 1996 and is headquartered in chicago . threatened to kidnap russ . agreed to purchase wachovia corp . has also been confirmed to play captain haddock . ceo john thain agreed to leave . hoffenberg decided to leave . so nick decided to create . control is reportedly trying to get impregnated by djimon salva later went on to make the non clown-based horror now . . for right now , are inclined to trust obama to do just eddie dumped debbie to marry when carrie was 2 . that . is the home of the university of tennessee . jobs will retire from . chu is currently a physics professor at . the nab is a strong advocacy group in . pp youtube is based in , near san francisco , california . this one starred robert reed , known mostly as . mathematica is a product of . is positively frightening as detective bud white . founded the , which is now also a designated terrorist the is a national testing board that is based in toronto . group . is a corporation that is wholly owned by the is an online bank that ebay owns . city of edmonton . zoya akhtar is a director , who has directed the subord unborn is a scary movie that stars . upcoming movie . ’s third wife was actress melina mercouri , who died imelda staunton , who plays , is genius . in 1994 . sure , there were who liked the shah . is the important president that american ever had . plus mitt romney is the worst governor that has had . Table 9: Informative predicates identified by SCANNER in various types of utterances. Yellow predi- cates were identified by both SCANNER and EASYCCG, red predicates by SCANNER alone, and green predicates by EASYCCG alone. essentially jointly learns how to parse natural For instance, Freebase does not contain a rela- language semantics and the lexicons that help tion representing daughter, using instead two rela- grounding. Compared to previous neural semantic tions representing female and child. Previous work parsers, our model is more interpretable as the (Kwiatkowski et al., 2013) models such cases intermediate structures are useful for inspecting by introducing collapsing (for many-to-one map- what the model has learned and whether it ping) and expansion (for one-to-many mapping) matches linguistic intuition. operators. Within our current framework, these two types of structural mismatches can be han- An assumption our model imposes is that un- dled with semi-Markov assumptions (Sarawagi grounded and grounded representations are struc- and Cohen, 2005; Kong et al., 2016) in the pars- turally isomorphic. An advantage of this assump- ing (i.e., predicate selection) and the grounding tion is that tokens in the ungrounded and grounded steps, respectively. Aside from relaxing strict iso- representations are strictly aligned. This allows morphism, we would also like to perform cross- the neural network to focus on parsing and lexi- domain semantic parsing where the first stage of cal mapping, sidestepping the challenging struc- the semantic parser is shared across domains. ture mapping problem which would result in a larger search space and higher variance. On the Acknowledgments We would like to thank negative side, the structural isomorphism assump- three anonymous reviewers, members of the Ed- tion restricts the expressiveness of the model, es- inburgh ILCC and the IBM Watson, and Abul- pecially since one of the main benefits of adopt- hair Saparov for feedback. The support of the ing a two-stage parser is the potential of captur- European Research Council under award num- ing domain-independent semantic information via ber 681760 “Translating Multiple Modalities into the intermediate representation. While it would be Text” is gratefully acknowledged. challenging to handle drastically non-isomorphic structures in the current model, it is possible to perform local structure matching, i.e., when the References mapping between natural language and domain- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- specific predicates is many-to-one or one-to-many. gio. 2015. Neural by jointly

52 learning to align and translate. In Proceedings of Li Dong, Furu Wei, Ming Zhou, and Ke Xu. 2015. ICLR 2015. San Diego, California. over Freebase with multi- column convolutional neural networks. In Proceed- Hannah Bast and Elmar Haussmann. 2015. More ac- ings of the 53rd Annual Meeting of the Association curate question answering on Freebase. In Proceed- for Computational Linguistics and the 7th Interna- ings of the 24th ACM International on Conference tional Joint Conference on Natural Language Pro- on Information and Knowledge Management. ACM, cessing (Volume 1: Long Papers). Beijing, China, pages 1431–1440. pages 260–269.

Emily M Bender, Dan Flickinger, Stephan Oepen, Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Woodley Packard, and Ann Copestake. 2015. Lay- Matthews, and Noah A. Smith. 2015. Transition- ers of interpretation: On grammar and composition- based dependency parsing with stack long short- ality. In Proceedings of the 11th International Con- term memory. In Proceedings of the 53rd Annual ference on . London, UK, Meeting of the Association for Computational Lin- pages 239–249. guistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Papers). Beijing, China, pages 334–343. Liang. 2013. Semantic parsing on Freebase from question-answer pairs. In Proceedings of the 2013 Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, Conference on Empirical Methods in Natural Lan- and Noah A. Smith. 2016. Recurrent neural net- guage Processing. Seattle, Washington, pages 1533– work grammars. In Proceedings of the 2016 Con- 1544. ference of the North American Chapter of the Asso- ciation for Computational Linguistics: Human Lan- Jonathan Berant and Percy Liang. 2014. Semantic guage Technologies. San Diego, California, pages parsing via paraphrasing. In Proceedings of the 199–209. 52nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers). Balti- Jeffrey Flanigan, Sam Thomson, Jaime Carbonell, more, Maryland, pages 1415–1425. Chris Dyer, and Noah A. Smith. 2014. A discrim- inative graph-based parser for the abstract meaning Jonathan Berant and Percy Liang. 2015. Imitation representation. In Proceedings of the 52nd Annual learning of agenda-based semantic parsers. Trans- Meeting of the Association for Computational Lin- actions of the Association for Computational Lin- guistics (Volume 1: Long Papers). Baltimore, Mary- guistics 3:545–558. land, pages 1426–1436.

Yonatan Bisk, Siva Reddy, John Blitzer, Julia Hock- Evgeniy Gabrilovich, Michael Ringgaard, and Amar- enmaier, and Mark Steedman. 2016. Evaluating in- nag Subramanya. 2013. FACC1: Freebase anno- duced CCG parsers on grounded semantic parsing. tation of ClueWeb corpora, version 1 (release date In Proceedings of the 2016 Conference on Empirical 2013-06-26, format version 1, correction level 0) . Methods in Natural Language Processing. Austin, Texas, pages 2022–2027. Matt Gardner and Jayant Krishnamurthy. 2017. Open- Vocabulary Semantic Parsing with both Distribu- Antoine Bordes, Sumit Chopra, and Jason Weston. tional Statistics and Formal Knowledge. In Pro- 2014. Question answering with subgraph embed- ceedings of the 31st AAAI Conference on Artificial dings. In Proceedings of the 2014 Conference on Intelligence. San Francisco, California, pages 3195– Empirical Methods in Natural Language Processing 3201. (EMNLP). Doha, Qatar, pages 615–620. Jonas Groschwitz, Alexander Koller, and Christoph Te- Qingqing Cai and Alexander Yates. 2013. Large-scale ichmann. 2015. Graph parsing with s-graph gram- semantic parsing via schema matching and lexicon mars. In Proceedings of the 53rd Annual Meet- extension. In Proceedings of the 51st Annual Meet- ing of the Association for Computational Linguistics ing of the Association for Computational Linguis- and the 7th International Joint Conference on Natu- tics (Volume 1: Long Papers). Sofia, Bulgaria, pages ral Language Processing (Volume 1: Long Papers). 423–433. Beijing, China, pages 1481–1490.

Do Kook Choe and Eugene Charniak. 2016. Parsing Steve Harris, Andy Seaborne, and Eric as language modeling. In Proceedings of the 2016 Prud’hommeaux. 2013. SPARQL 1.1 query Conference on Empirical Methods in Natural Lan- language. W3C recommendation 21(10). guage Processing. Austin, Texas, pages 2331–2336. Sepp Hochreiter and Jurgen¨ Schmidhuber. 1997. Li Dong and Mirella Lapata. 2016. Language to logi- Long short-term memory. Neural computation cal form with neural attention. In Proceedings of the 9(8):1735–1780. 54th Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers). Florian Holzschuher and Rene´ Peinl. 2013. Perfor- Berlin, Germany, pages 33–43. mance of graph query languages: comparison of

53 cypher, gremlin and native access in Neo4j. In Pro- Mike Lewis and Mark Steedman. 2014. A* CCG pars- ceedings of the Joint EDBT/ICDT 2013 Workshops. ing with a supertag-factored model. In Proceed- ACM, pages 195–204. ings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Robin Jia and Percy Liang. 2016. Data recombination Qatar, pages 990–1000. for neural semantic parsing. In Proceedings of the 54th Annual Meeting of the Association for Com- Percy Liang. 2013. Lambda dependency-based putational Linguistics (Volume 1: Long Papers). compositional semantics. arXiv preprint Berlin, Germany, pages 12–22. arXiv:1309.4408 . Rohit J. Kate, Yuk Wah Wong, and Raymond J. Percy Liang, Michael Jordan, and Dan Klein. 2011. Mooney. 2005. Learning to Transform Natural to Learning dependency-based compositional seman- Formal Languages. In Proceedings for the 20th Na- tics. In Proceedings of the 49th Annual Meeting tional Conference on Artificial Intelligence. Pitts- of the Association for Computational Linguistics: burgh, Pennsylvania, pages 1062–1068. Human Language Technologies. Portland, Oregon, pages 590–599. Lingpeng Kong, Chris Dyer, and Noah A Smith. 2016. Segmental recurrent neural networks. In Proceed- Christopher Manning, Mihai Surdeanu, John Bauer, ings of ICLR 2016. San Juan, Puerto Rico. Jenny Finkel, Steven Bethard, and David McClosky. 2014. The stanford corenlp natural language pro- Toma´sˇ Kociskˇ y,´ Gabor´ Melis, Edward Grefenstette, cessing toolkit. In Proceedings of 52nd Annual Chris Dyer, Wang Ling, Phil Blunsom, and Meeting of the Association for Computational Lin- Karl Moritz Hermann. 2016. Semantic parsing with guistics: System Demonstrations. Baltimore, Mary- semi-supervised sequential autoencoders. In Pro- land, pages 55–60. ceedings of the 2016 Conference on Empirical Meth- ods in Natural Language Processing. Austin, Texas, Arvind Neelakantan, Quoc V Le, Martin Abadi, An- pages 1078–1087. drew McCallum, and Dario Amodei. 2017. Learn- ing a natural language interface with neural pro- Jayant Krishnamurthy and Tom Mitchell. 2012. grammer. In Proceedings of ICLR 2017. Toulon, Weakly supervised training of semantic parsers. In France. Proceedings of the 2012 Joint Conference on Empir- ical Methods in Natural Language Processing and Panupong Pasupat and Percy Liang. 2015. Compo- Computational Natural Language Learning. Jeju Is- sitional semantic parsing on semi-structured tables. land, Korea, pages 754–765. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the Jayant Krishnamurthy and Tom M. Mitchell. 2015. 7th International Joint Conference on Natural Lan- Learning a Compositional Semantics for Freebase guage Processing (Volume 1: Long Papers). Beijing, with an Open Predicate Vocabulary. Transactions China, pages 1470–1480. of the Association for Computational Linguistics 3:257–270. Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word Tom Kwiatkowksi, Luke Zettlemoyer, Sharon Goldwa- representation. In Proceedings of the 2014 Con- ter, and Mark Steedman. 2010. Inducing probabilis- ference on Empirical Methods in Natural Language tic CCG grammars from logical form with higher- Processing (EMNLP). Doha, Qatar, pages 1532– order unification. In Proceedings of the 2010 Con- 1543. ference on Empirical Methods in Natural Language Processing. Cambridge, MA, pages 1223–1233. Siva Reddy, Mirella Lapata, and Mark Steedman. 2014. Large-scale semantic parsing without question- Tom Kwiatkowski, Eunsol Choi, Yoav Artzi, and Luke answer pairs. Transactions of the Association for Zettlemoyer. 2013. Scaling Semantic Parsers with Computational Linguistics 2:377–392. On-the-Fly Ontology Matching. In Proceedings of Empirical Methods on Natural Language Process- Siva Reddy, Oscar Tackstr¨ om,¨ Michael Collins, Tom ing. pages 1545–1556. Kwiatkowski, Dipanjan Das, Mark Steedman, and Mirella Lapata. 2016. Transforming dependency Tom Kwiatkowski, Luke Zettlemoyer, Sharon Goldwa- structures to logical forms for semantic parsing. ter, and Mark Steedman. 2011. Lexical generaliza- Transactions of the Association for Computational tion in CCG grammar induction for semantic pars- Linguistics 4:127–140. ing. In Proceedings of the 2011 Conference on Em- pirical Methods in Natural Language Processing. Sunita Sarawagi and William W Cohen. 2005. Semi- Edinburgh, Scotland, pages 1512–1523. markov conditional random fields for . In Advances in Neural Information Pro- Tao Lei, Regina Barzilay, and Tommi Jaakkola. 2016. cessing Systems 17, MIT Press, pages 1185–1192. Rationalizing neural predictions. In Proceedings of the 2016 Conference on Empirical Methods in Natu- Yu Su, Huan Sun, Brian Sadler, Mudhakar Srivatsa, ral Language Processing. Austin, Texas, pages 107– Izzeddin Gur, Zenghui Yan, and Xifeng Yan. 2016. 117. On generating characteristic-rich question sets for

54 qa evaluation. In Proceedings of the 2016 Confer- logical form. In Proceedings of the 2007 Joint Con- ence on Empirical Methods in Natural Language ference on Empirical Methods in Natural Language Processing. Austin, Texas, pages 562–572. Processing and Computational Natural Language Learning (EMNLP-CoNLL). Prague, Czech Repub- Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. lic, pages 678–687. Sequence to sequence learning with neural net- works. In Advances in Neural Information Process- Luke S. Zettlemoyer and Michael Collins. 2005. ing Systems 27, MIT Press, pages 3104–3112. Learning to Map Sentences to Logical Form: Struc- tured Classification with Probabilistic Categorial Ferhan Ture and Oliver Jojic. 2016. Simple and ef- Grammars. In Proceedings of 21st Conference in fective question answering with recurrent neural net- Uncertainilty in Artificial Intelligence. Edinburgh, works. arXiv preprint arXiv:1606.05029 . Scotland, pages 658–666.

Oriol Vinyals, Łukasz Kaiser, Terry Koo, Slav Petrov, Kai Zhao and Liang Huang. 2015. Type-driven in- Ilya Sutskever, and Geoffrey Hinton. 2015. Gram- cremental semantic parsing with polymorphism. In mar as a foreign language. In Advances in Neu- Proceedings of the 2015 Conference of the North ral Information Processing Systems 28. MIT Press, American Chapter of the Association for Computa- pages 2773–2781. tional Linguistics: Human Language Technologies. Denver, Colorado, pages 1416–1421. Yuk Wah Wong and Raymond Mooney. 2006. Learn- ing for semantic parsing with statistical machine translation. In Proceedings of the Human Language Technology Conference of the NAACL, Main Con- ference. New York City, USA, pages 439–446.

Kun Xu, Siva Reddy, Yansong Feng, Songfang Huang, and Dongyan Zhao. 2016. Question answering on Freebase via relation extraction and textual evi- dence. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers). Berlin, Germany, pages 2326– 2336.

Xuchen Yao and Benjamin Van Durme. 2014. Infor- mation extraction over structured data: Question an- swering with Freebase. In Proceedings of the 52nd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers). Balti- more, Maryland, pages 956–966.

Wen-tau Yih, Ming-Wei Chang, Xiaodong He, and Jianfeng Gao. 2015. Semantic parsing via staged query graph generation: Question answering with knowledge base. In Proceedings of the 53rd Annual Meeting of the Association for Computational Lin- guistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Beijing, China, pages 1321–1331.

Wen-tau Yih, Matthew Richardson, Chris Meek, Ming- Wei Chang, and Jina Suh. 2016. The value of se- mantic parse labeling for knowledge base question answering. In Proceedings of the 54th Annual Meet- ing of the Association for Computational Linguistics (Volume 2: Short Papers). Berlin, Germany, pages 201–206.

John M. Zelle and Raymond J. Mooney. 1996. Learn- ing to Parse Database Queries Using Inductive Logic Programming. In Proceedings of the 13th National Conference on Artificial Intelligence. Portland, Ore- gon, pages 1050–1055.

Luke Zettlemoyer and Michael Collins. 2007. Online learning of relaxed CCG grammars for parsing to

55