Evaluating Semantic Parsing Against a Simple Web-Based Question Answering Model

Evaluating Semantic Parsing against a Simple Web-based Question Answering Model Alon Talmor Mor Geva Jonathan Berant Tel-Aviv University Tel-Aviv University Tel-Aviv University [email protected] [email protected] [email protected] Abstract (Voorhees and Tice, 2000; Hermann et al., 2015; Hewlett et al., 2016; Kadlec et al., 2016; Seo et al., Semantic parsing shines at analyzing com- 2016). plex natural language that involves com- In semantic parsing, background knowledge position and computation over multiple has already been compiled into a knowledge-base pieces of evidence. However, datasets (KB), and thus the challenge is in interpreting the for semantic parsing contain many factoid question, which may contain compositional con- questions that can be answered from a sin- structions (“What is the second-highest mountain gle web document. In this paper, we pro- in Europe?”) or computations (“What is the dif- pose to evaluate semantic parsing-based ference in population between France and Ger- question answering models by comparing many?”). In unstructured QA, the model needs them to a question answering baseline that to also interpret the language of a document, and queries the web and extracts the answer thus most datasets focus on matching the question only from web snippets, without access to against the document and extracting the answer the target knowledge-base. We investigate from some local context, such as a sentence or a this approach on COMPLEXQUESTIONS, paragraph (Onishi et al., 2016; Rajpurkar et al., a dataset designed to focus on composi- 2016; Yang et al., 2015). tional language, and find that our model Since semantic parsing models excel at han- obtains reasonable performance ( 35 F ∼ 1 dling complex linguistic constructions and reason- compared to 41 F1 of state-of-the-art). We ing over multiple facts, a natural way to exam- find in our analysis that our model per- ine whether a benchmark indeed requires model- forms well on complex questions involv- ing these properties, is to train an unstructured QA ing conjunctions, but struggles on ques- model, and check if it under-performs compared tions that involve relation composition and to semantic parsing models. If questions can be superlatives. answered by examining local contexts only, then the use of a knowledge-base is perhaps unneces- 1 Introduction sary. However, to the best of our knowledge, only Question answering (QA) has witnessed a surge models that utilize the KB have been evaluated on of interest in recent years (Hill et al., 2015; Yang common semantic parsing benchmarks. et al., 2015; Pasupat and Liang, 2015; Chen et al., The goal of this paper is to bridge this evalua- 2016; Joshi et al., 2017), as it is one of the promi- tion gap. We develop a simple log-linear model, nent tests for natural language understanding. QA in the spirit of traditional web-based QA systems can be coarsely divided into semantic parsing- (Kwok et al., 2001; Brill et al., 2002), that answers based QA, where a question is translated into a questions by querying the web and extracting the logical form that is executed against a knowledge- answer from returned web snippets. Thus, our base (Zelle and Mooney, 1996; Zettlemoyer and evaluation scheme is suitable for semantic pars- Collins, 2005; Liang et al., 2011; Kwiatkowski ing benchmarks in which the knowledge required et al., 2013; Reddy et al., 2014; Berant and for answering questions is covered by the web (in Liang, 2015), and unstructured QA, where a ques- contrast with virtual assitants for which the knowl- tion is answered directly from some relevant text edge is specific to an application). 161 Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (*SEM 2017), pages 161–167, Vancouver, Canada, August 3-4, 2017. c 2017 Association for Computational Linguistics We test this model on COMPLEXQUESTIONS s1: Billy Batts (Character) - Biography - IMDb Billy Batts (Character) on IMDb: Movies, TV, (Bao et al., 2016), a dataset designed to re- Celebs, and more... ... Devino is portrayed by quire more compositionality compared to earlier Frank Vincent in the film Goodfellas. Page last up- datasets, such as WEBQUESTIONS (Berant et al., dated by !!!de leted!!! s2: Frank Vincent - Wikipedia He appeared in 2013) and SIMPLEQUESTIONS (Bordes et al., Scorsese’s 1990 film Goodfellas, where he played 2015). We find that a simple QA model, despite Billy Batts, a made man in the Gambino crime fam- having no access to the target KB, performs rea- R: ily. He also played a role in Scorsese’s... sonably well on this dataset ( 35 F1 compared to . ∼ the state-of-the-art of 41 F1). Moreover, for the s100: Voice-over in Goodfellas In the summer when they played cards all night, nobody ever called the subset of questions for which the right answer can cops. .... But we had a problem with Billy Batts. be found in one of the web snippets, we outper- This was a touchy thing. Tommy had killed a made man. Billy was a part of the Bambino crew and un- form the semantic parser (51.9 F1 vs. 48.5 F1). We touchable. Before you... analyze results for different types of composition- q: “who played the part of billy batts in goodfellas?” ality and find that superlatives and relation com- a: “Frank Vincent” position constructions are challenging for a web- Figure 1: A training example containing a result set R, a based QA system, while conjunctions and events question q and an answer a. The result set R contains 100 with multiple arguments are easier. web snippets si, each including a title (boldface) and text. An important insight is that semantic parsers The answer is underlined. must overcome the mismatch between natural language and formal language. Consequently, lan- has a title and a text fragment. An example for a guage that can be easily matched against the web training example is provided in Figure1. may become challenging to express in logical Semantic parsing-based QA datasets contain form. For example, the word “wife” is an atomic question-answer pairs alongside a background binary relation in natural language, but expressed KB. To convert such datasets to our setup, we run with a complex binary λx.λy.Spouse(x, y) the question q against Google’s search engine and ∧ Gender(x, Female) in knowledge-bases. Thus, scrape the top-K web snippets. We use only the some of the complexity of understanding natural web snippets and ignore any boxes or other infor- language is removed when working with a natural mation returned (see Figure1 and the full dataset language representation. in the supplementary material). To conclude, we propose to evaluate the extent to which semantic parsing-based QA benchmarks Compositionality We argue that if a dataset require compositionality by comparing semantic truly requires a compositional model, then it parsing models to a baseline that extracts the an- should be difficult to tackle with methods that swer from short web snippets. We obtain rea- only match the question against short web snip- sonable performance on COMPLEXQUESTIONS, pets. This is since it is unlikely to integrate all and analyze the types of compositionality that are necessary pieces of evidence from the snippets. challenging for a web-based QA model. To en- We convert COMPLEXQUESTIONS into the sure reproducibility, we release our dataset, which aforementioned format, and manually analyze the attaches to each example from COMPLEXQUES- types of compositionality that occur on 100 ran- TIONS the top-100 retrieved web snippets.1 dom training examples. Table1 provides an example for each of the question types we found: 2 Problem Setting and Dataset SIMPLE: an application of a single binary relation on a single entity. (i) (i) (i) N Given a training set of triples q ,R , a , FILTER: a question where the semantic type { }i=1 where q(i) is a question, R(i) is a web result set, of the answer is mentioned (“tv shows” in Ta- and a(i) is the answer, our goal is to learn a model ble1). that produces an answer a for a new question- N-ARY: A question about a single event that result set pair (q, R). A web result set R consists involves more than one entity (“juni” and of K(= 100) web snippets, where each snippet si “spy kids 4” in Table1). CONJUNCTION: A question whose answer is 1Data can be downloaded from https: //worksheets.codalab.org/worksheets/ the conjunction of more than one binary rela- 0x91d77db37e0a4bbbaeb37b8972f4784f/ tion in the question. 162 Type Example % our model by maximizing the regularized condi- N SIMPLE “who has gone out with cornelis de graeff” 17% (i) tional log-likelihood objective i=1 log pθ(a FILTER “which tv shows has wayne rostad starred in” 18% (i) (i) 2 | N-ARY “who played juni in spy kids 4?” 51% q ,R ) + λ θ 2. At test time, we return the ONJ · || || P C . “what has queen latifah starred in that doug 10% most probable answers based on p (a q, R) (de- mchenry directed” θ | COMPOS. “who was the grandson of king david’s father?” 7% tails in Section4). While semantic parsers gener- SUPERL. “who is the richest sports woman?” 9% OTHER “what is the name george lopez on the show?” 8% ally return a set, in COMPLEXQUESTIONS 87% of the answers are a singleton set. Table 1: An example for each compositionality type and the proportion of examples in 100 random examples. A question Features A candidate span a often has multiple can fall into multiple types, and thus the sum exceeds 100%. mentions in the result set R.

Load more